Choose the right import method
Learn about how to choose the right data import method.
- The location, size, and format of your existing data
- Whether you need to import data once or periodically over time
- Whether you want to import the data in bulk or stream it into HBase regularly
- How fresh the HBase data needs to be
This topic helps you choose the correct method or composite of methods and provides example workflows for each method.
Always run HBase administrative commands as the HBase user (typically
hbase
).
If the data is already in an HBase table:
To move the data from one HBase cluster to another, use
snapshot
and either theclone_snapshot
orExportSnapshot
utility; or, use theCopyTable
utility.To move the data from one HBase cluster to another without downtime on either cluster, use replication.
If the data currently exists outside HBase:
If possible, write the data to HFile format, and use a BulkLoad to import it into HBase. The data is immediately available to HBase and you can bypass the normal write path, increasing efficiency.
If you prefer not to use bulk loads, and you are using a tool such as Pig, you can use it to import your data.
If you need to stream live data to HBase instead of import in bulk:
Write a Java client using the Java API, or use the Apache Thrift Proxy API to write a client in a language supported by Thrift.
Stream data directly into HBase using the REST Proxy API in conjunction with an HTTP client such as
wget
orcurl
.Use Flume or Spark.
Most likely, at least one of these methods works in your situation. If not, you can use MapReduce directly. Test the most feasible methods with a subset of your data to determine which one is optimal.