Support for Spark Structured Streaming in Cloudera Data Engineering (Technical Preview)

Understand the supported features and limitations related to Spark Structured Streaming in Cloudera Data Engineering (CDE). Spark Structured Streaming is compatible with both Spark 2 and Spark 3.

Spark Structured Streaming is supported with the following limitations:
  • For checkpointing on S3 for Amazon Web Services (AWS), Cloudera recommends using of the abortable stream based checkpoint manager with the following command:
    • --conf spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.internal.io.cloud.AbortableStreamBasedCheckpointFileManager
  • For authentication with delegation tokens that is supported for Kafka, use the following:
    • For Spark 2, pass the following configuration: --conf spark.kafka.bootstrap.servers=<kafka_broker_list>
    • For Spark 3, pass the following configuration where <clusterName> is an arbitrary name used for grouping the configuration: --conf spark.kafka.clusters.<clusterName>.auth.bootstrap.servers=<kafks_broker_list>
  • Hive Warehouse Conector (HWC) is not supported.
  • Cloudera Observability does not report information on Spark Streaming jobs.
  • The visual profiling and CDE deep analysis features are not supported.
  • Checkpointing with Amazon Elastic File System (EFS) is not supported.
  • Spark dynamic allocation is not supported.

    You may use spark.streaming.dynamicAllocation, but this option is only available for Discretized streams (DStreams).

  • S3 checkpoints may have issues with Cloudera Runtime versions 7.2.8 or lower.
  • While running streaming with Kafka, set the auto.offset.reset parameter to "latest", to avoid out-of-memory errors.
  • Schema Registry with Spark 3 is not supported.