Managing Apache HBasePDF version

Choose the right import method

Learn about how to choose the right data import method.

The method you use for importing data into HBase depends on several factors:
  • The location, size, and format of your existing data
  • Whether you need to import data once or periodically over time
  • Whether you want to import the data in bulk or stream it into HBase regularly
  • How fresh the HBase data needs to be

This topic helps you choose the correct method or composite of methods and provides example workflows for each method.

Always run HBase administrative commands as the HBase user (typically hbase).

If the data is already in an HBase table:

  • To move the data from one HBase cluster to another, use snapshot and either the clone_snapshot or ExportSnapshot utility; or, use the CopyTable utility.

  • To move the data from one HBase cluster to another without downtime on either cluster, use replication.

If the data currently exists outside HBase:

  • If possible, write the data to HFile format, and use a BulkLoad to import it into HBase. The data is immediately available to HBase and you can bypass the normal write path, increasing efficiency.

  • If you prefer not to use bulk loads, and you are using a tool such as Pig, you can use it to import your data.

If you need to stream live data to HBase instead of import in bulk:

  • Write a Java client using the Java API, or use the Apache Thrift Proxy API to write a client in a language supported by Thrift.

  • Stream data directly into HBase using the REST Proxy API in conjunction with an HTTP client such as wget or curl.

  • Use Flume or Spark.

Most likely, at least one of these methods works in your situation. If not, you can use MapReduce directly. Test the most feasible methods with a subset of your data to determine which one is optimal.