Spark Guide

Apache Spark is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala and consists of Spark core and several related projects.

You can run Spark applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application. Running Spark applications interactively is commonly performed during the data-exploration phase and for ad hoc analysis.

To run applications distributed across a cluster, Spark requires a cluster manager. In CDH 6, Cloudera supports only the YARN cluster manager. When run on YARN, Spark application processes are managed by the YARN ResourceManager and NodeManager roles. Spark Standalone is no longer supported.

For detailed API information, see the Apache Spark project site.

The Apache Spark 2 service in CDH 6 consists of Spark core and several related projects:

Spark SQL
Module for working with structured data. Allows you to seamlessly mix SQL queries with Spark programs.
Spark Streaming
API that allows you to build scalable fault-tolerant streaming applications.
MLlib
API that implements common machine learning algorithms.

The Cloudera Enterprise product includes the Spark features roughly corresponding to the feature set and bug fixes of Apache Spark 2.4. The Spark 2.x service was previously shipped as its own parcel, separate from CDH.

In CDH 6, the Spark 1.6 service does not exist. The port of the Spark History Server is 18088, which is the same as formerly with Spark 1.6, and a change from port 18089 formerly used for the Spark 2 parcel.

Unsupported Features

The following Spark features are not supported:

  • Apache Spark experimental features/APIs are not supported unless stated otherwise.
  • Using the JDBC Datasource API to access Hive or Impala is not supported
  • ADLS not Supported for All Spark Components. Microsoft Azure Data Lake Store (ADLS) is a cloud-based filesystem that you can access through Spark applications. Spark with Kudu is not currently supported for ADLS data. (Hive on Spark is available for ADLS in CDH 5.12 and higher.)
  • IPython / Jupyter notebooks is not supported. The IPython notebook system (renamed to Jupyter as of IPython 4.0) is not supported.
  • Certain Spark Streaming features not supported. The mapWithState method is unsupported because it is a nascent unstable API.
  • Thrift JDBC/ODBC server is not supported
  • Spark SQL CLI is not supported
  • GraphX is not supported
  • SparkR is not supported
  • Structured Streaming is supported, but the following features of it are not:

    • Continuous processing, which is still experimental, is not supported.
    • Stream static joins with HBase have not been tested and therefore are not supported.
  • Spark cost-based optimizer (CBO) not supported.

Consult Apache Spark Known Issues for a comprehensive list of Spark 2 features that are not supported with CDH 6.