Managing Apache HBasePDF version

Reading data from HBase

The Get and Scan are the two ways to read data from HBase, aside from manually parsing HFiles.

A Get is simply a Scan limited by the API to one row. A Scan fetches zero or more rows of a table. By default, a Scan reads the entire table from start to end. You can limit your Scan results in several different ways, which affect the Scan's load in terms of IO, network, or both, as well as processing load on the client side. This topic is provided as a quick reference. Refer to the API documentation for Scan for more in-depth information. You can also perform Get and Scan using the HBase Shell, the REST API, or the Thrift API.

  • Specify a startrow or stoprow or both. Neither startrow nor stoprow need to exist. Because HBase sorts rows lexicographically, it will return the first row after startrow would have occurred, and will stop returning rows after stoprow would have occurred.The goal is to reduce IO and network.
    • The startrow is inclusive and the stoprow is exclusive. Given a table with rows a, b, c, d, e, f, and startrow of c and stoprow of f, rows c-e are returned.
    • If you omit startrow, the first row of the table is the startrow.
    • If you omit the stoprow, all results after startrow (including startrow) are returned.
    • If startrow is lexicographically after stoprow, and you set Scan setReversed(boolean reversed) to true, the results are returned in reverse order. Given the same table above, with rows a-f, if you specify c as the stoprow and f as the startrow, rows f, e, and d are returned.
    Scan()
         Scan(byte[] startRow)
         Scan(byte[] startRow, byte[] stopRow)
  • Specify a scanner cache that will be filled before the Scan result is returned, setting setCaching to the number of rows to cache before returning the result. By default, the caching setting on the table is used. The goal is to balance IO and network load.
    public Scan setCaching(int caching)
  • To limit the number of columns if your table has very wide rows (rows with a large number of columns), use setBatch(int batch) and set it to the number of columns you want to return in one batch. A large number of columns is not a recommended design pattern.
    public Scan setBatch(int batch)
  • To specify a maximum result size, use setMaxResultSize(long), with the number of bytes. The goal is to reduce IO and network.
    public Scan setMaxResultSize(long maxResultSize)
  • When you use setCaching and setMaxResultSize together, single server requests are limited by either number of rows or maximum result size, whichever limit comes first.
  • You can limit the scan to specific column families or columns by using addFamily or addColumn. The goal is to reduce IO and network. IO is reduced because each column family is represented by a Store on each RegionServer, and only the Stores representing the specific column families in question need to be accessed.
    public Scan addColumn(byte[] family,
         byte[] qualifier)
         
         public Scan addFamily(byte[] family)
  • You can specify a range of timestamps or a single timestamp by specifying setTimeRange or setTimestamp.
    public Scan setTimeRange(long minStamp,
         long maxStamp)
         throws IOException
         
         public Scan setTimeStamp(long timestamp)
         throws IOException
  • You can retrieve a maximum number of versions by using setMaxVersions.
    public Scan setMaxVersions(int maxVersions)
  • You can use a filter by using setFilter. .
    public Scan setFilter(Filter filter)
  • You can disable the server-side block cache for a specific scan using the API setCacheBlocks(boolean). This is an expert setting and should only be used if you know what you are doing.