The HBase-Spark Connector bridges the gap between the simple HBase Key Value store
and complex relational SQL queries. It enables users to perform complex data analytics on
top of HBase using Spark.
An HBase DataFrame is a standard Spark DataFrame, and is able to interact with any
other data sources such as Hive, ORC, Parquet, and JSON.
-
Go to the Spark or Spark3 service.
-
Click the Configuration tab.
-
Ensure that the HBase service is selected in Spark
Service as a dependency.
-
Select .
-
Select .
-
Locate the
Spark Client Advanced Configuration Snippet (Safety Valve)
for spark-conf/spark-defaults.conf
property or search for it by
typing its name in the Search box.
-
Add the following properties to ensure that all required Phoenix and HBase
platform dependencies are available on the classpath for the Spark executors and
drivers:
For Spark 2:
spark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/hbase-spark.jar:/opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/hbase-spark-protocol-shaded.jar:/opt/cloudera/parcels/CDH/jars/scala-library.jar
spark.driver.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/hbase-spark.jar:/opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/hbase-spark-protocol-shaded.jar:/opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/scala-library.jar
For Spark 3:
spark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase_connectors_for_spark3/lib/hbase-spark3.jar,/opt/cloudera/parcels/CDH/lib/hbase_connectors_for_spark3/lib/hbase-spark3-protocol-shaded.jar
spark.driver.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase_connectors_for_spark3/lib/hbase-spark3.jar,/opt/cloudera/parcels/CDH/lib/hbase_connectors_for_spark3/lib/hbase-spark3-protocol-shaded.jar
If your spark and hbase are running on the same instance, then skip to step
8.
If you are using the HBase-Spark connector to connect to an HBase
instance outside of the cluster, run the hbase
mapredcp
command on the remote cluster.
- Copy all JAR files listed in the output to the local cluster, and
add the JAR files to both *extraClasspath properties.
- Copy the directory containing the
hbase-site.xml from the remote cluster, and
add it to the *extraClasspath properties.
spark.executor.extraClassPath=/copied/hbase-conf/,/copied/hbase_connectors/lib/hbase-spark.jar:/copied/hbase_connectors/lib/hbase-spark-protocol-shaded.jar:/copied/hbase_connectors/lib/scala-library.jar, /copied/hbase-shaded-mapreduce-2.1.6.3.1.5.0-152,...
-
Enter a Reason for change, and then click Save Changes
to commit the changes.
-
Restart the role and service when Cloudera Manager prompts you to
restart.
- Enable your IDE by adding the following dependency to your
build:
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-spark</artifactId>
<version>[***VERSION EXAMPLE: 6.0.0.7.2.10.0-297***]</version>
<scope>provided</scope>
</dependency>
- Build a Spark or Spark3 application using the dependencies that you provide when
you run your application. If you follow the previous instructions, Cloudera
Manager automatically configures the connector. If you have not, add the
necessary parameters to the command line when running the
spark-submit
or spark3-submit
command.
--conf spark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase-spark-protocol-shaded.jar:/opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/scala-library.jar --conf spark.driver.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase-spark-protocol-shaded.jar:/opt/cloudera/parcels/CDH/lib/hbase_connectors/lib/scala-library.jar
Consider the following example while using a Spark3 application.
spark3-shell --conf spark.driver.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase_connectors_for_spark3/lib/hbase-spark3.jar,/opt/cloudera/parcels/CDH/lib/hbase_connectors_for_spark3/lib/hbase-spark3-protocol-shaded.jar:/etc/hbase/conf/ --conf spark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase_connectors_for_spark3/lib/hbase-spark3.jar,/opt/cloudera/parcels/CDH/lib/hbase_connectors_for_spark3/lib/hbase-spark3-protocol-shaded.jar:/etc/hbase/conf/