Spark Guide

Apache Spark is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala and consists of Spark core and several related projects:

  • Spark SQL - Module for working with structured data. Allows you to seamlessly mix SQL queries with Spark programs.
  • Spark Streaming - API that allows you to build scalable fault-tolerant streaming applications.
  • MLlib - API that implements common machine learning algorithms.
  • GraphX - API for graphs and graph-parallel computation.

You can run Spark applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application. Running Spark applications interactively is commonly performed during the data-exploration phase and for ad-hoc analysis.

To run applications distributed across a cluster, Spark requires a cluster manager. Cloudera supports two cluster managers: YARN and Spark Standalone. When run on YARN, Spark application processes are managed by the YARN ResourceManager and NodeManager roles. When run on Spark Standalone, Spark application processes are managed by Spark Master and Worker roles.

Unsupported Features

The following Spark features are not supported:

  • Spark SQL:
    • spark.ml
    • ML pipeline APIs
  • Spark MLib:
    • spark.ml
  • SparkR
  • GraphX
  • Spark on Scala 2.11
  • Mesos cluster manager