Reading data from HBase
Scan are the two ways to read data from
HBase, aside from manually parsing HFiles.
Get is simply a
Scan limited by the API to one row. A
Scan fetches zero or more rows of a table. By default, a
reads the entire table from start to end. You can limit your
Scan results in
several different ways, which affect the
Scan's load in terms of IO, network, or
both, as well as processing load on the client side. This topic is provided as a quick reference.
Refer to the API documentation for Scan for more in-depth information. You can also
perform Get and Scan using the HBase Shell, the REST API, or the Thrift API.
- Specify a
stoprowor both. Neither
stoprowneed to exist. Because HBase sorts rows lexicographically, it will return the first row after
startrowwould have occurred, and will stop returning rows after
stoprowwould have occurred.The goal is to reduce IO and network.
startrowis inclusive and the
stoprowis exclusive. Given a table with rows
- If you omit
startrow, the first row of the table is the
- If you omit the
stoprow, all results after
startrow) are returned.
startrowis lexicographically after
stoprow, and you set
Scan setReversed(boolean reversed)to
true, the results are returned in reverse order. Given the same table above, with rows
a-f, if you specify
cas the stoprow and
fas the startrow, rows
Scan() Scan(byte startRow) Scan(byte startRow, byte stopRow)
- Specify a scanner cache that will be filled before the Scan result is returned,
setCachingto the number of rows to cache before returning the result. By default, the caching setting on the table is used. The goal is to balance IO and network load.
public Scan setCaching(int caching)
- To limit the number of columns if your table has very wide rows (rows with a
large number of columns), use setBatch(int batch) and set it to the
number of columns you want to return in one batch. A large number of
columns is not a recommended design pattern.
public Scan setBatch(int batch)
- To specify a maximum result size, use
setMaxResultSize(long), with the number of bytes. The goal is to reduce IO and network.
public Scan setMaxResultSize(long maxResultSize)
- When you use
setMaxResultSizetogether, single server requests are limited by either number of rows or maximum result size, whichever limit comes first.
- You can limit the scan to specific column families or columns by
addColumn. The goal is to reduce IO and network. IO is reduced because each column family is represented by a Store on each RegionServer, and only the Stores representing the specific column families in question need to be accessed.
public Scan addColumn(byte family, byte qualifier) public Scan addFamily(byte family)
- You can specify a range of timestamps or a single timestamp by specifying
setTimeRange or setTimestamp.
public Scan setTimeRange(long minStamp, long maxStamp) throws IOException public Scan setTimeStamp(long timestamp) throws IOException
- You can retrieve a maximum number of versions by using setMaxVersions.
public Scan setMaxVersions(int maxVersions)
- You can use a filter by using
public Scan setFilter(Filter filter)
- You can disable the server-side block cache for a specific scan
using the API
setCacheBlocks(boolean). This is an expert setting and should only be used if you know what you are doing.