Configuring Cloudera Octopai Connector for Apache Spark

Learn about installing and configuring the Spline-based Cloudera Octopai Data Lineage Connector for Apache Spark to capture automated metadata lineage.

License requirement: Ensure Spark is included in your Cloudera Octopai subscription before proceeding.

Figure 1. Cloudera Octopai Connector architecture overview

Supported capabilities

The connector captures lineage in the following scenarios:

Spline agent lineage – Lineage capture is limited to what the Spline agent can parse from Spark SQL execution plans.
Active jobs – Only running or newly executed jobs are collected.
Persistent actions – Read and write operations that touch persistent storage (tables or files) are recorded.
Cluster configuration – Spark must be configured with the Spline properties in spark-defaults.conf.
Explicit application name – For job names to appear, jobs must define the application name explicitly:
```
spark = SparkSession.builder \
    .appName("Spark UDF Example") \
    .getOrCreate()
```
Customer-managed environments – Spark clusters are deployed and managed within the customer environment.

Limitations

The following constraints apply to the connector:

Successful jobs only – Lineage is generated for jobs that finish without errors.
Persistent storage focus – Operations that remain in-memory are excluded from lineage capture.
Named jobs required – Jobs without an explicit name produce lineage records without a meaningful identifier.
Kerberos support – Kerberos and delegation tokens are not yet supported; use basic authentication when sending lineage to the Spline server.
Spline parsing scope – Only Spark operations that Spline supports will appear in lineage.
Streaming jobs – Spark Structured Streaming workloads (for example, Kafka flows) are not captured.
Partial execution – Only code paths that are executed (for example, a conditional branch that runs) appear in lineage.
User-defined functions – UDF logic is not parsed, although their invocation appears in the execution plan.

Before starting the installation, ensure the following:

A running Spark Cluster (Spark 2.x or 3.x)
Access to HDFS for storing lineage files
Cloudera Manager or similar access to configure Spark cluster properties
Access permissions to upload JAR files to HDFS and edit Spark configurations

Clone the Cloudera Octopai customized Spline Agent.

Clone the repository from the Cloudera Octopai customized branch:

git clone https://github.com/OCTOPAILTD/spline-spark-agent.git
cd spline-spark-agent
git checkout OCT-27187_Enable_writing_to_files

Build the Spline Agent bundle.
Choose the relevant folder according to your Spark version and navigate to it. The Jar is shipped with the Cloudera Octopai Agent and the user needs to upload the correct jar according to its spark version.

For example, for Spark 3.5:
```
PS C:\GIT\spline-spark-agent\bundle-3.5> mvn clean package
```
After the build, you will find the Spline Agent JAR file under:
```
bundle-3.5/target/spark-3.5-spline-agent-bundle_2.12-2.2.1.jar
```
Upload the JAR to HDFS.
Upload the built Spline Agent JAR file to your HDFS /tmp folder:
```
hdfs dfs -put spark-3.5-spline-agent-bundle_2.12-2.2.1.jar /tmp/
```

Configure Spark defaults.

Add the following properties to your Spark cluster configuration (spark-defaults.conf) through Cloudera Manager or equivalent:

spark.jars=hdfs:///tmp/spark-3.5-spline-agent-bundle_2.12-2.2.1.jar
spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener
spark.spline.mode=ENABLED
spark.spline.lineageDispatcher=hdfs
spark.spline.lineageDispatcher.hdfs.className=za.co.absa.spline.harvester.dispatcher.HDFSLineageDispatcher
spark.spline.lineageDispatcher.hdfs.directory=hdfs:///tmp/spline
spark.driver.memory=4g

Figure 2. Sample Spark defaults configuration

Create the HDFS lineage directory.
Create the directory where lineage files will be written and set permissions:
```
hdfs dfs -mkdir /tmp/spline
hdfs dfs -chown hive /tmp/spline
```
Figure 3. HDFS directory creation example
Set permissions.

Ensure the user running the Spark jobs has permission to write lineage files to /tmp/spline.

This can typically be done by ensuring the Spark job runs under a user who has write access to /tmp/spline in HDFS.

After completing the installation, verify the following:

Spline Agent JAR is built and uploaded to HDFS.
Spark cluster configuration is updated with Spline properties.
/tmp/spline folder is created and write-access is configured.
Spark cluster is restarted or configuration is refreshed.
Test Spark jobs are producing lineage files in /tmp/spline.