CDS 2 Powered by Apache Spark Overview

Apache Spark is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala.

For detailed API information, see the Apache Spark project site.

CDS 2 Powered by Apache Spark consists of Spark core and several related projects:

Spark SQL
Module for working with structured data. Allows you to seamlessly mix SQL queries with Spark programs.
Spark Streaming
API that allows you to build scalable fault-tolerant streaming applications.
MLlib
API that implements common machine learning algorithms.

Cloudera distributes two versions of Apache Spark: 1.6 and 2.0.

Spark 1.6 is distributed as part of Cloudera Enterprise 5.7.x and higher, whose documentation is available at Cloudera Enterprise 5.7.x Documentation.

This document describes the separately released CDS 2.0.

A Spark 1.6 service can coexist with a Spark 2.0 service. The configurations of the two services do not conflict and both services use the same YARN service. The port of the Spark History Server is 18088 for Spark 1.6 and 18089 for Spark 2.0.

Unsupported Features

Consult CDS Powered by Apache Spark Known Issues for a comprehensive list of features that are not supported with CDS 2 Powered by Apache Spark.