Spark on HBase: Using the HBase Connector
The Spark-HBase connector (shc
) is a Spark library that supports access to
HBase tables as external sources or sinks. Application access is through Spark SQL at the data
frame level, with support for optimizations such as partition pruning, predicate pushdown, and
scanning.
The connector bridges the gap between the HBase key-value store and complex relational SQL queries. It is useful for Spark applications and interactive tools, as it allows operations such as complex SQL queries on top of an HBase table inside Spark, and table joins against data frames. The connector leverages the standard Spark DataSource API for query optimization.
Note | |
---|---|
The Spark HBase connector uses HBase jar files by default. If you want to submit jobs on an
HBase cluster with Phoenix enabled, you must include ./bin/spark-submit --class your.application.class \ --master yarn-client \ --num-executors 2 \ --driver-memory 512m \ --executor-memory 512m --executor-cores 1 \ --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 \ --repositories http://repo.hortonworks.com/content/groups/public/ \ --jars /usr/hdp/current/phoenix-client/phoenix-server.jar \ --files /etc/hbase/conf/hbase-site.xml /To/your/application/jar |
The HBase connector library is available as a Spark package; you can download it from
https://github.com/hortonworks-spark/shc. The repository readme
file
contains information about how to use the package with Spark applications.
For more information about the Spark HBase connector, see Spark HBase Connector - a Year in Review.