Chapter 1. Introduction

Hortonworks Data Platform supports Apache Spark, a fast, large-scale data processing engine.

Deep integration of Spark with YARN allows Spark to operate as a cluster tenant alongside other engines such as Hive, Storm, and HBase, all running simultaneously on a single data platform. YARN allows flexibility: you can choose the right processing tool for the job. Instead of creating and managing a set of dedicated clusters for Spark applications, you can store data in a single location, access and analyze it with multiple processing engines, and leverage your resources. In a modern data architecture with multiple processing engines using YARN and accessing data in HDFS, Spark on YARN is the leading Spark deployment mode.

Spark Features

Spark on HDP supports the following features:

Spark on YARN
Spark Core
Spark SQL
Spark SQL Thrift Server (JDBC/ODBC)
Spark MLLib
Spark Streaming
Spark History Server
DataFrame API
Optimized Row Columnar (ORC) files
ML Pipeline API
Support for Hive 1.2.1
PySpark
Dynamic Resource Allocation

The following features and associated tools are available as technical previews:

SparkR
GraphX
Apache Zeppelin
DataSet API

The following features and associated tools are not officially supported by Hortonworks:

Spark Standalone
Spark on Mesos
Jupyter/iPython Notebook

Spark on YARN leverages YARN services for resource allocation, and runs Spark Executors in YARN containers. Spark on YARN supports workload management and Kerberos security features. It has two modes:

YARN-cluster mode, optimized for long-running production jobs.
YARN-client mode, best for interactive use such as prototyping, testing, and debugging. Spark Shell and Spark Thrift Server both run in YARN-client mode only.

Table 1.1. Spark - HDP Version Support

HDP	Ambari	Spark
2.4.2	2.2.2	1.6.1
2.4.0	2.2.1	1.6.0
2.3.4	2.2.0	1.5.2
2.3.2	2.1.2	1.4.1
2.3.0	2.1.1	1.3.1
2.2.9	2.1.1	1.3.1
2.2.8	2.1.1	1.3.1
2.2.6	2.1.1	1.2.1
2.2.4	2.0.1	1.2.1

Table 1.2. Spark Feature Support by Version

Feature	1.2.1	1.3.1	1.4.1	1.5.2	1.6.0	1.6.1
Spark Core	Yes	Yes	Yes	Yes	Yes	Yes
Spark on YARN	Yes	Yes	Yes	Yes	Yes	Yes
Spark on YARN, Kerberos-enabled clusters	Yes	Yes	Yes	Yes	Yes	Yes
Spark History Server	Yes	Yes	Yes	Yes	Yes	Yes
Spark MLLib	Yes	Yes	Yes	Yes	Yes	Yes
Hive 13 (or later) support, including `collect_list` UDF		Hive version 0.13.1	Hive version 0.13.1	Hive version 1.2.1	Hive version 1.2.1	Hive version 1.2.1
ML Pipeline API			Yes	Yes	Yes	Yes
DataFrame API		TP	Yes	Yes	Yes	Yes
ORC Files		TP	Yes	Yes	Yes	Yes
PySpark		TP	Yes	Yes	Yes	Yes
Spark SQL	TP	TP	TP	Yes	Yes	Yes
Spark Thrift Server (JDBC/ODBC)		TP	TP	Yes	Yes	Yes
Spark Streaming	TP	TP	TP	Yes	Yes	Yes
Dynamic Resource Allocation		TP	TP	Yes*	Yes*	Yes*
SparkR			TP	TP	TP	TP
GraphX				TP	TP	TP
Spark HBase connector						TP

TP: Tech Preview

* Note: Dynamic Resource Allocation does not work with Spark Streaming.

​Chapter 1. Introduction

Chapter 1. Introduction