Apache Spark Component Guide
Also available as:
PDF

Spark on HBase: Using the HBase Connector

The Spark-HBase connector (shc) is a Spark library that supports access to HBase tables as external sources or sinks. Application access is through Spark SQL at the data frame level, with support for optimizations such as partition pruning, predicate pushdown, and scanning.

The connector bridges the gap between the HBase key-value store and complex relational SQL queries. It is useful for Spark applications and interactive tools, as it allows operations such as complex SQL queries on top of an HBase table inside Spark, and table joins against data frames. The connector leverages the standard Spark DataSource API for query optimization.

[Note]Note

The Spark HBase connector uses HBase jar files by default. If you want to submit jobs on an HBase cluster with Phoenix enabled, you must include --jars phoenix-server.jar in your spark-submit command; for example:

./bin/spark-submit --class your.application.class --master yarn-client --num-executors 2 --driver-memory 512m --executor-memory 512m --executor-cores 1 --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --jars /usr/hdp/current/phoenix-client/phoenix-server.jar --files /etc/hbase/conf/hbase-site.xml /To/your/application/jar

The HBase connector library is available as a Spark package; you can download it from https://github.com/hortonworks-spark/shc. The repository readme file contains information about how to use the package with Spark applications.