Overview of CDS 2 Powered by Apache Spark
Apache Spark is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala.
CDS 2 Powered by Apache Spark consists of Spark core and several related projects:
- Spark SQL
- Module for working with structured data. Allows you to seamlessly mix SQL queries with Spark programs.
- Spark Streaming
- API that allows you to build scalable fault-tolerant streaming applications.
- MLlib
- API that implements common machine learning algorithms.
Cloudera distributes these versions of Apache Spark: 1.6, 2.0, 2.1, and 2.2.
Spark 1.6 is distributed as part of Cloudera Enterprise 5.7.x and higher. The latest documentation is available at Cloudera Enterprise documentation.
This documentation describes the separately released CDS 2.2 powered by Apache Spark. Spark 2 is shipped separately for ease of use and convenience of consumption. It enables customers to install and upgrade Spark 2 without going through a full upgrade of the CDH cluster.
A Spark 1.6 service can coexist with a Spark 2 service. The configurations of the two services do not conflict and both services use the same YARN service. The port of the Spark History Server is 18088 for Spark 1.6 and 18089 for Spark 2.
Unsupported Features
Consult CDS Powered by Apache Spark Known Issues for a comprehensive list of features that are not supported with CDS 2.