Selecting a Connector
Use the following information to select an HBase connector for Spark.
The two connectors are designed to meet the needs of different workloads. In general,
use the Hortonworks Spark-HBase Connector
for SparkSQL, DataFrame, and
other fixed schema workloads. Use the RDD-Based Spark-HBase Connector
for RDDs and other flexible schema workloads.
Hortonworks Spark-HBase Connector
When using the connector developed by Hortonworks, the underlying context is data frame, with support for optimizations such as partition pruning, predicate pushdowns, and scanning. The connector is highly optimized to push down filters into the HBase level, speeding up workload. The tradeoff is limited flexibility because you must specify your schema upfront. The connector leverages the standard Spark DataSource API for query optimization.
The connector is open-sourced for the community. The Hortonworks Spark-HBase
Connector
library is available as a downloadable Spark package at https://github.com/hortonworks-spark/shc. The repository
readme
file contains information about how to use the package with
Spark applications.
For more information about the connector, see A Year in Review blog.
RDD-Based Spark-HBase Connector
The RDD-based connector is developed by the Apache community. The connector is designed with full flexibility in mind: you can define schema on read and therefore it is suitable for workloads where schema is undefined at ingestion time. However, the architecture has some tradeoffs when it comes to performance.
Refer to the following table for other factors that might affect your choice of connector, source repos, and code examples.
Table 8.1. Comparison of the Spark-HBase Connectors
Hortonworks Spark-HBase Connector Connector |
RDD-Based Spark-HBase Connector |
|
---|---|---|
Source | Hortonworks | Apache HBase community |
Apache Open Source? | Yes | Yes |
Requires a Schema? | Yes: Fixed schema | No: Flexible schema |
Suitable Data for Connector | SparkSQL or DataFrame | RDD |
Main Repo | shc git repo |
Apache hbase-spark git repo |
Sample Code for Java | Not available | Apache hbase.git repo |
Sample Code for Scala | shc git
repo |
Apache hbase.git repo |