Configure data engineering Spark to use with Cloudera Operational Database

You can configure Spark in your Data Engineering cluster to interact with the Cloudera Operational Database. You can use this integration to READ and WRITE to Cloudera Operational Database from Spark on Cloudera Data Engineering using the spark-hbase connector.

If you want to leverage Phoenix instead of HBase, see Cloudera Operational Database-Cloudera Data Engineering using Phoenix.

Cloudera Operational Database is already provisioned and the database is created. For ore information, see Onboarding Cloudera Operational Database users.
For this example, let us assume test-cod as the database name.
Cloudera Data Engineering is already provisioned and the virtual cluster is already created. For more information, see Cloudera Data Engineering service.

Download the Cloudera Operational Database client configurations.
For the Spark in Cloudera Data Engineering to connect to Cloudera Operational Database, it requires the hbase-site.xml configuration of the Cloudera Operational Database cluster. Refer to the following steps.
1. Go to the Cloudera Operational Database UI and click on the test-cod database.
2. Go to the Connect > HBase tab of the Cloudera Operational Database database, and copy the command under the HBase Client Configuration URL field.
```
curl -f -o "hbase-config.zip" -u "<YOUR WORKLOAD USERNAME>" "https://cod--4wfxojpfxmwg-gateway.XXXXXXXXX.cloudera.site/clouderamanager/api/v41/clusters/cod--4wfxojpfxmwg/services/hbase/clientConfig"
```
  - Ensure to provide the workload password for the curl command.
  - Explore the downloaded zip file to obtain the hbase-site.xml file.
Create the HBase table.
Create a new HBase table inside the Cloudera Operational Database database using the Hue link.
1. Go to the Cloudera Operational Database UI and click on the test-cod database.
2. Click on the Hue link under the SQL EDITOR field.
3. On the Hue UI, click the HBase menu item on the left navigation panel. Click New Table.
4. Choose a table name and column families, and click on the Submit button. For example, let us consider the table name testtable and a single column family testcf.

Configure the job using Cloudera Data Engineering CLI.

Configure Cloudera Data Engineering CLI to point to the virtual cluster created in the previous step. For more details, see Configuring the CLI client.

Create resources using the following command.

cde resource create --name cod-spark-resource

Upload hbase-site.xml.

cde resource upload --name cod-spark-resource --local-path /your/path/to/hbase-site.xml --resource-path conf/hbase-site.xml

Upload the demo app jar that was built earlier.

cde resource upload --name cod-spark-resource --local-path /path/to/your/spark-hbase-project.jar --resource-path spark-hbase-project.jar

Create the Cloudera Data Engineering job using a JSON definition.

{
"mounts": [
{
"resourceName": "cod-spark-resource"
}
],
"name": "my-cod-spark-job",
"spark": {
"className": "<YOUR MAIN CLASS>",
"conf": {
"spark.executor.extraClassPath": "/app/mount/conf",
"spark.driver.extraClassPath": "/app/mount/conf"
},
"args": [ "<YOUR ARGS IF ANY>"],
"driverCores": 1,
"driverMemory": "1g",
"executorCores": 1,
"executorMemory": "1g",
"file": "spark-hbase-project.jar",
"pyFiles": [],
"files": ["conf/hbase-site.xml"],
"numExecutors": 4
}
}

Import the job using the following command, assuming that the above JSON is saved as my-job-definition.json.
```
cde job import --file my-job-definition.json
```

The spark.driver.extraClassPath and spark.executor.extraClassPath inside the job definition points to the same path which is used to upload the hbase-site.xml into the Cloudera Data Engineering resource.

This way the hbase-site.xml is automatically loaded from the classpath and you do not need to refer to it explicitly in your Spark code. You can define as follows.

val conf = HBaseConfiguration.create()
val hbaseContext = new HBaseContext(spark.sparkContext, conf)

Configure data engineering Spark to use with Cloudera Operational Database

We want your opinion

How can we improve this page?