Chapter 1. Analyzing Data with Apache Spark

Hortonworks Data Platform (HDP) supports Apache Spark, a fast, large-scale data processing engine.

Deep integration of Spark with YARN allows Spark to operate as a cluster tenant alongside Apache engines such as Hive, Storm, and HBase, all running simultaneously on a single data platform. Instead of creating and managing a set of dedicated clusters for Spark applications, you can store data in a single location, access and analyze it with multiple processing engines, and leverage your resources.

Spark on YARN leverages YARN services for resource allocation, runs Spark executors in YARN containers, and supports workload management and Kerberos security features. It has two modes:

YARN-cluster mode, optimized for long-running production jobs
YARN-client mode, best for interactive use such as prototyping, testing, and debugging

Spark shell and the Spark Thrift server run in YARN-client mode only.

HDP 2.6 supports Spark 1.6 and Spark 2.0. HDP 2.6 also supports Livy, for local and remote access to Spark through the Livy REST API.

Table 1.1. Spark and Livy Feature Support by HDP Version

HDP Version(s)	2.6.0	2.5.0 2.5.3	2.4.3	2.4.2	2.4.0	2.3.4 2.3.4.7 2.3.6	2.3.2	2.2.8 2.2.9 2.3.0	2.2.4 2.2.6
Spark Version	1.6.3 2.1.0	1.6.2	1.6.2	1.6.1	1.6.0	1.5.2	1.4.1	1.3.1	1.2.1
Livy Version(s)	0.3
Feature:
Spark Core	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Spark on YARN	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Spark on YARN for Kerberos-enabled clusters	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Spark history server	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Hive support	1.2.1	1.2.1	1.2.1	1.2.1	1.2.1	1.2.1	0.13.1	0.13.1
Spark MLlib	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
ML Pipeline API	Yes	Yes	Yes	Yes	Yes	Yes	Yes
DataFrame API	Yes	Yes	Yes	Yes	Yes	Yes	Yes	TP
Optimized Row Columnar (ORC) Files	Yes	Yes	Yes	Yes	Yes	Yes	Yes	TP
PySpark	Yes	Yes	Yes	Yes	Yes	Yes	Yes	TP
SparkR	Yes	Yes	TP	TP	TP	TP	TP
Spark SQL	Yes	Yes	Yes	Yes	Yes	Yes	TP	TP	TP
Spark SQL Thrift server (JDBC, ODBC)	Yes	Yes	Yes	Yes	Yes	Yes	TP	TP
Spark SQL row- and column-level access control	TP
Spark Streaming	Yes	Yes	Yes	Yes	Yes	Yes	TP	TP	TP
Dynamic resource allocation	Yes*	Yes*	Yes*	Yes*	Yes*	Yes*	TP	TP
HBase connector	Yes	Yes	TP	TP
GraphX	TP	TP	TP	TP	TP	TP

TP: Tech Preview

* Note: Dynamic Resource Allocation does not work with Spark Streaming.

The following features are available as technical previews, and are considered under development. Do not use these features in your production systems. If you have questions regarding these features, contact Support by logging a case on the Hortonworks Support Portal at https://support.hortonworks.com.

GraphX
DataSet API

The following features and associated tools are not officially supported by Hortonworks:

Spark Standalone
Spark on Mesos
Jupyter (formerly IPython) Notebook

​Chapter 1. Analyzing Data with Apache Spark

Chapter 1. Analyzing Data with Apache Spark