You can configure Spark in your Data Engineering cluster to interact with the
Cloudera Operational Database. You can use this integration to READ and WRITE to Cloudera Operational Database
from Spark on Cloudera Data Engineering using the spark-hbase connector.
If you want to leverage Phoenix instead of HBase, see Cloudera Operational Database-Cloudera Data Engineering using
Phoenix.
-
Download the Cloudera Operational Database client configurations.
For the Spark in Cloudera Data Engineering to connect to Cloudera Operational Database, it requires the
hbase-site.xml configuration of the Cloudera Operational Database cluster. Refer
to the following steps.
-
Go to the Cloudera Operational Database UI and click on the test-cod
database.
-
Go to the tab of the Cloudera Operational Database database, and copy the command under the
HBase Client Configuration URL field.
curl -f -o "hbase-config.zip" -u "<YOUR WORKLOAD USERNAME>" "https://cod--4wfxojpfxmwg-gateway.XXXXXXXXX.cloudera.site/clouderamanager/api/v41/clusters/cod--4wfxojpfxmwg/services/hbase/clientConfig"
- Ensure to provide the workload password for the
curl command.
- Explore the downloaded zip file to obtain the
hbase-site.xml file.
-
Create the HBase table.
Create a new HBase table inside the Cloudera Operational Database database using the
Hue link.
-
Go to the Cloudera Operational Database UI and click on the test-cod
database.
-
Click on the Hue link under the SQL
EDITOR field.
-
On the Hue UI, click the HBase menu item on the
left navigation panel. Click New Table.
-
Choose a table name and column families, and click on the
Submit button. For example, let us consider
the table name testtable and a single column family
testcf.
-
Configure the job using Cloudera Data Engineering CLI.
-
Configure Cloudera Data Engineering CLI to point to the virtual cluster created in the
previous step. For more details, see Configuring the CLI
client.
-
Create resources using the following command.
cde resource create --name cod-spark-resource
-
Upload hbase-site.xml.
cde resource upload --name cod-spark-resource --local-path /your/path/to/hbase-site.xml --resource-path conf/hbase-site.xml
-
Upload the demo app jar that was built earlier.
cde resource upload --name cod-spark-resource --local-path /path/to/your/spark-hbase-project.jar --resource-path spark-hbase-project.jar
-
Create the Cloudera Data Engineering job using a JSON definition.
{
"mounts": [
{
"resourceName": "cod-spark-resource"
}
],
"name": "my-cod-spark-job",
"spark": {
"className": "<YOUR MAIN CLASS>",
"conf": {
"spark.executor.extraClassPath": "/app/mount/conf",
"spark.driver.extraClassPath": "/app/mount/conf"
},
"args": [ "<YOUR ARGS IF ANY>"],
"driverCores": 1,
"driverMemory": "1g",
"executorCores": 1,
"executorMemory": "1g",
"file": "spark-hbase-project.jar",
"pyFiles": [],
"files": ["conf/hbase-site.xml"],
"numExecutors": 4
}
}
-
Import the job using the following command, assuming that the above
JSON is saved as my-job-definition.json.
cde job import --file my-job-definition.json
The spark.driver.extraClassPath and
spark.executor.extraClassPath inside the job definition
points to the same path which is used to upload the
hbase-site.xml into the Cloudera Data Engineering resource.
This way the hbase-site.xml is automatically loaded from the
classpath and you do not need to refer to it explicitly in your Spark code. You can
define as follows.
val conf = HBaseConfiguration.create()
val hbaseContext = new HBaseContext(spark.sparkContext, conf)