Running Phoenix and HBase Spark Applications using Cloudera Data Engineering

Cloudera Data Engineering supports running Spark applications in the Cloudera on cloud. You can use Cloudera Data Engineering to run Phoenix and HBase Spark applications against Cloudera Operational Database.

Set your Cloudera workload password. For more information, see Setting the workload password.
Synchronize users from the User Management Service in the Cloudera Control Plane into the environment in which your Cloudera Operational Database is running.
Ensure Cloudera Data Engineering service is enabled and a virtual cluster is created in the Data Engineering Experience. For more information, see Creating virtual clusters and Enabling a Cloudera Data Engineering service.

Set HBase and Phoenix versions in your Maven project.

Use the describe-client-connectivity command for the HBase and Phoenix version information. The following code snippet shows fetching the database connectivity information and parsing the required HBase and Phoenix information to build your application.

echo "HBase version"
cdp opdb describe-client-connectivity --database-name my-database --environment-name my-env | jq ".connectors[] | select(.name == \"hbase\") | .version"
echo "Phoenix Connector Version"
cdp opdb describe-client-connectivity --database-name my-database --environment-name my-env | jq ".connectors[] | select(.name == \"phoenix-thick-jdbc\") | .version"

HBase Version
2.4.6.7.2.14.0-133
Phoenix Spark Version
"6.0.0.7.2.14.0-133"

Update the HBase and Phoenix connector versions in our Maven project or configuration.

<properties>
...
    <phoenix.connector.version>6.0.0.7.2.14.0-133</phoenix.connector.version>
    <hbase.version>2.4.6.7.2.14.0-133</hbase.version>
...
</properties>

Download hbase-site.xml and hbase-omid-client-config.yml configuration files.

Use the describe-client-connectivity command to determine the client configuration URL.

cdp opdb describe-client-connectivity --database-name spark-connector --environment-name cod-7213 | jq ".connectors[] | select(.name == \"hbase\") | .configuration.clientConfigurationDetails[] | select(.name == \"HBASE\") | .url "

"https://cod--XXXXXX-gateway0..xcu2-8y8x.dev.cldr.work/clouderamanager/api/v41/clusters/XXXXX/services/hbase/clientConfig"

Use the URL gathered from the previous command and run the curl command to download the HBase configurations.

curl -f -o "hbase-config.zip" -u "<csso_user>" "https://cod--XXXXXX-gateway0.cod-7213....xcu2-8y8x.dev.cldr.work/clouderamanager/api/v41/clusters/cod--XXXX/services/hbase/clientConfig"

Unzip hbase-config.zip and copy the hbase-site.xml and hbase-omid-client-config.yml to src/main/resources path in the Maven project.

unzip hbase-conf.zip
cp hbase-conf/hbase-site.xml <path to src/main/resources>
cp hbase-conf/hbase-omid-client-config.yml <path to src/main/resources>

Build the project.
```
$ mvn package
```

Create a Cloudera Data Engineering job.

Configure Cloudera Data Engineering CLI to point to the virtual cluster. For more information, see Downloading the Cloudera Data Engineering command line interface.

Create a resource using the following command.

cde resource create --name phoenix-spark-app-resource

Upload the required jars which you downloaded while building the project.

cde resource upload --name spark-app-resource --local-path ./target/connector-libs/hbase-shaded-mapreduce-2.4.6.7.2.14.0-133.jar --resource-path hbase-shaded-mapreduce-2.4.6.7.2.14.0-133.jar
cde resource upload --name spark-app-resource --local-path ./target/connector-libs/opentelemetry-api-0.12.0.jar --resource-path opentelemetry-api-0.12.0.jar
cde resource upload --name spark-app-resource --local-path ./target/connector-libs/opentelemetry-context-0.12.0.jar --resource-path opentelemetry-context-0.12.0.jar
cde resource upload --name spark-app-resource --local-path ./target/connector-libs/phoenix5-spark-shaded-6.0.0.7.2.14.0-133.jar --resource-path phoenix5-spark-shaded-6.0.0.7.2.14.0-133.jar

Upload the Spark application app jar that you had built earlier.

cde resource upload --name spark-app-resource --local-path ./target/phoenix-spark-transactions-0.1.0.jar --resource-path phoenix-spark-transactions-0.1.0.jar

Replace HBase, Phoenix, and Phoenix Spark connector versions in the spark-job.json as shown in the following sample and create a Cloudera Data Engineering job using the following JSON and import commands.

{
   "mounts":[
      {
         "resourceName":"phoenix-spark-app-resource"
      }
   ],
   "name":"phoenix-spark-app",
   "spark":{
      "className":"com.cloudera.cod.examples.spark.SparkApp",
      "args":[
         "{{ phoenix_jdbc_url }}"
      ],
      "driverCores":1,
      "driverMemory":"1g",
      "executorCores":1,
      "executorMemory":"1g",
      "file":"phoenix-spark-transactions-0.1.0.jar",
      "pyFiles":[
         
      ],
      "files":[
        "hbase-shaded-mapreduce-2.4.6.7.2.14.0-133.jar",
        "opentelemetry-api-0.12.0.jar",
        "opentelemetry-context-0.12.0.jar"
        "phoenix5-spark-shaded-6.0.0.7.2.14.0-133.jar",
      ],
      "numExecutors":4
   }
}
cde job import --file spark-job.json

Run the project.

Use the describe-client-connectivity command to determine the base JDBC URL to pass.

cdp opdb describe-client-connectivity --database-name my-database --environment-name my-env | jq ".connectors[] | select(.name == \"phoenix-thick-jdbc\") | .configuration.jdbcUrl"

Run the job by passing the JDBC URL obtained from the previous command, as an argument to the job.
```
cde job run --name phoenix-spark-app --variable phoenix_jdbc_url=<phoenix_jdbc_url>
```

Running Phoenix and HBase Spark Applications using Cloudera Data Engineering

We want your opinion

How can we improve this page?