CDS 3.5 Powered by Apache Spark Overview

Apache Spark is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala. This document describes CDS 3.5 Powered by Apache Spark. CDS (Cloudera Distribution of Spark) enables you to install and evaluate the features of Apache Spark 3 without upgrading your CDP Private Cloud Base cluster.

For detailed API information, see the Apache Spark project site.

CDS 3.5 Powered by Apache Spark is an add-on service for CDP Private Cloud Base, distributed as a parcel and the Cloudera Service Descriptor file is available in Cloudera Manager for CDP 7.1.9.

On CDP Private Cloud Base, a Spark 3 service can coexist with the existing Spark 2 service. The configurations of the two services do not conflict and both services use the same YARN service. The port of the Spark History Server is 18088 for Spark 2 and 18089 for Spark 3.

CDS 3.5 for GPUs

The RAPIDS Accelerator for Apache Spark is not supported with CDS 3.5.

Unsupported connectors

This release does not support the following connectors:

SparkR

Unsupported Features

This release does not support the following feature:

Hudi
Push-based shuffle
The use of Iceberg tables is not supported in Spark Streaming
Read/Write operations to a Hive bucketed table is unsupported
Spark RAPIDS
Spark Connect
The HBase Connector's Atlas lineage generation is not supported in the Spark-Atlas Connector
The Hive Warehouse Connector's Atlas lineage generation is not supported in the Spark-Atlas Connector
Atlas lineage is not captured for MERGE INTO operations on Iceberg tables

Limitations of Spark in CDP

Limitations of Spark (in comparison to Apache Spark 3.5) in CDP are described below:

spark.sql.orc.compression.codec config doesn't accept zsdt value.