CDS 2 Powered by Apache Spark Overview

Apache Spark is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala.

For detailed API information, see the Apache Spark project site.

CDS 2 Powered by Apache Spark consists of Spark core and several related projects:

Spark SQL
Module for working with structured data. Allows you to seamlessly mix SQL queries with Spark programs.
Spark Streaming
API that allows you to build scalable fault-tolerant streaming applications.
MLlib
API that implements common machine learning algorithms.

Cloudera distributes these versions of Apache Spark: 1.6, 2.0, and 2.1.

Spark 1.6 is distributed as part of Cloudera Enterprise 5.7.x and higher. The latest documentation is available at Cloudera Enterprise documentation.

This document describes the separately released CDS 2.1 Powered by Apache Spark. Spark 2 is shipped separately for ease of use and convenience of consumption. It enables customers to install and upgrade Spark 2 without going through a full upgrade of the CDH cluster.

A Spark 1.6 service can coexist with a Spark 2.1 service. The configurations of the two services do not conflict and both services use the same YARN service. The port of the Spark History Server is 18088 for Spark 1.6 and 18089 for Spark 2.

Unsupported Features

Consult Spark 2 Known Issues for a comprehensive list of features that are not supported with the CDS 2 Powered by Apache Spark.