Reading Data from HBase

Get and Scan are the two ways to read data from HBase, aside from manually parsing HFiles. A Get is simply a Scan limited by the API to one row. A Scan fetches zero or more rows of a table. By default, a Scan reads the entire table from start to end. You can limit your Scan results in several different ways, which affect the Scan's load in terms of IO, network, or both, as well as processing load on the client side. This topic is provided as a quick reference. Refer to the API documentation for Scan for more in-depth information. You can also perform Gets and Scan using the HBase Shell.

  • Specify a startrow or stoprow or both. Neither startrow nor stoprow need to exist. Because HBase sorts rows lexicographically, it will return the first row after startrow would have occurred, and will stop returning rows after stoprow would have occurred.The goal is to reduce IO and network.
    • The startrow is inclusive and the stoprow is exclusive. Given a table with rows a, b, c, d, e, f, and startrow of c and stoprow of f, rows c-e are returned.
    • If you omit startrow, the first row of the table is the startrow.
    • If you omit the stoprow, all results after startrow (including startrow) are returned.
    • If startrow is lexicographically after stoprow, and you set Scan setReversed(boolean reversed) to true, the results are returned in reverse order. Given the same table above, with rows a-f, if you specify c as the stoprow and f as the startrow, rows f, e, and d are returned.
    Scan()
    Scan(byte[] startRow)
    Scan(byte[] startRow, byte[] stopRow)
  • Specify a scanner cache that will be filled before the Scan result is returned, setting setCaching to the number of rows to cache before returning the result. By default, the caching setting on the table is used. The goal is to balance IO and network load.
    public Scan setCaching(int caching)
  • To limit the number of columns if your table has very wide rows (rows with a large number of columns), use setBatch(int batch) and set it to the number of columns you want to return in one batch. A large number of columns is not a recommended design pattern.
    public Scan setBatch(int batch)
  • To specify a maximum result size, use setMaxResultSize(long), with the number of bytes. The goal is to reduce IO and network.
    public Scan setMaxResultSize(long maxResultSize)
  • When you use setCaching and setMaxResultSize together, single server requests are limited by either number of rows or maximum result size, whichever limit comes first.
  • You can limit the scan to specific column families or columns by using addFamily or addColumn. The goal is to reduce IO and network. IO is reduced because each column family is represented by a Store on each region server, and only the Stores representing the specific column families in question need to be accessed.
    public Scan addColumn(byte[] family,
                 byte[] qualifier)
    
    public Scan addFamily(byte[] family)
  • You can specify a range of timestamps or a single timestamp by specifying setTimeRange or setTimestamp.
    public Scan setTimeRange(long minStamp,
                    long maxStamp)
                      throws IOException
    
    public Scan setTimeStamp(long timestamp)
                      throws IOException
  • You can retrieve a maximum number of versions by using setMaxVersions.
    public Scan setMaxVersions(int maxVersions)
  • You can use a filter by using setFilter. Filters are discussed in detail in HBase Filtering and the Filter API.
    public Scan setFilter(Filter filter)
  • You can disable the server-side block cache for a specific scan using the API setCacheBlocks(boolean). This is an expert setting and should only be used if you know what you are doing.

Perform Scans Using HBase Shell

You can perform scans using HBase Shell, for testing or quick queries. Use the following guidelines or issue the scan command in HBase Shell with no parameters for more usage information. This represents only a subset of possibilities.

# Display usage information
hbase> scan

# Scan all rows of table 't1'
hbase> scan 't1'

# Specify a startrow, limit the result to 10 rows, and only return selected columns
hbase> scan 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}

# Specify a timerange
hbase> scan 't1', {TIMERANGE => [1303668804, 1303668904]}

# Specify a custom filter
hbase> scan 't1', {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}

# Disable the block cache for a specific scan (experts only)
hbase> scan 't1', {COLUMNS => ['c1', 'c2'], CACHE_BLOCKS => false}

Hedged Reads

Hadoop 2.4 introduced a new feature called hedged reads, in HDFS-5776. If a read from a block is slow, the HDFS client starts up another parallel, 'hedged' read against a different block replica. The result of whichever read returns first is used, and the outstanding read is cancelled. This feature helps in situations where a read occasionally takes a long time rather than when there is a systemic problem. Hedged reads can be enabled for HBase when the HFiles are stored in HDFS. This feature is disabled by default.

Enabling Hedged Reads for HBase Using the Command Line

To enable hedged reads for HBase, edit the hbase-site.xml file on each server. Set dfs.client.hedged.read.threadpool.size to the number of threads to dedicate to running hedged threads, and set the dfs.client.hedged.read.threshold.millis configuration property to the number of milliseconds to wait before starting a second read against a different block replica. Set dfs.client.hedged.read.threadpool.size to 0 or remove it from the configuration to disable the feature. After changing these properties, restart your cluster.

The following is an example configuration for hedged reads for HBase.

<property>
  <name>dfs.client.hedged.read.threadpool.size</name>
  <value>20</value>  <!-- 20 threads -->
</property>
<property>
  <name>dfs.client.hedged.read.threshold.millis</name>
  <value>10</value>  <!-- 10 milliseconds -->
</property>