Reading data from HBase
The Get
and Scan
are the two ways to read data from
HBase, aside from manually parsing HFiles.
A Get
is simply a Scan
limited by the API to one row. A
Scan
fetches zero or more rows of a table. By default, a Scan
reads the entire table from start to end. You can limit your Scan
results in
several different ways, which affect the Scan
's load in terms of IO, network, or
both, as well as processing load on the client side. This topic is provided as a quick reference.
Refer to the API documentation for Scan for more in-depth information. You can also
perform Get and Scan using the HBase Shell, the REST API, or the Thrift API.
- Specify a
startrow
orstoprow
or both. Neitherstartrow
norstoprow
need to exist. Because HBase sorts rows lexicographically, it will return the first row afterstartrow
would have occurred, and will stop returning rows afterstoprow
would have occurred.The goal is to reduce IO and network.- The
startrow
is inclusive and thestoprow
is exclusive. Given a table with rowsa
,b
,c
,d
,e
,f
, andstartrow
ofc
andstoprow
off
, rowsc-e
are returned. - If you omit
startrow
, the first row of the table is thestartrow
. - If you omit the
stoprow
, all results afterstartrow
(includingstartrow
) are returned. - If
startrow
is lexicographically afterstoprow
, and you setScan setReversed(boolean reversed)
totrue
, the results are returned in reverse order. Given the same table above, with rowsa-f
, if you specifyc
as the stoprow andf
as the startrow, rowsf
,e
, andd
are returned.
Scan() Scan(byte[] startRow) Scan(byte[] startRow, byte[] stopRow)
- The
- Specify a scanner cache that will be filled before the Scan result is returned,
setting
setCaching
to the number of rows to cache before returning the result. By default, the caching setting on the table is used. The goal is to balance IO and network load.public Scan setCaching(int caching)
- To limit the number of columns if your table has very wide rows (rows with a
large number of columns), use setBatch(int batch) and set it to the
number of columns you want to return in one batch. A large number of
columns is not a recommended design pattern.
public Scan setBatch(int batch)
- To specify a maximum result size, use
setMaxResultSize(long)
, with the number of bytes. The goal is to reduce IO and network.public Scan setMaxResultSize(long maxResultSize)
- When you use
setCaching
andsetMaxResultSize
together, single server requests are limited by either number of rows or maximum result size, whichever limit comes first. - You can limit the scan to specific column families or columns by
using
addFamily
oraddColumn
. The goal is to reduce IO and network. IO is reduced because each column family is represented by a Store on each RegionServer, and only the Stores representing the specific column families in question need to be accessed.public Scan addColumn(byte[] family, byte[] qualifier) public Scan addFamily(byte[] family)
- You can specify a range of timestamps or a single timestamp by specifying
setTimeRange or setTimestamp.
public Scan setTimeRange(long minStamp, long maxStamp) throws IOException public Scan setTimeStamp(long timestamp) throws IOException
- You can retrieve a maximum number of versions by using setMaxVersions.
public Scan setMaxVersions(int maxVersions)
- You can use a filter by using
setFilter
. .public Scan setFilter(Filter filter)
- You can disable the server-side block cache for a specific scan
using the API
setCacheBlocks(boolean)
. This is an expert setting and should only be used if you know what you are doing.