Chapter 1. Introduction

Hortonworks Data Platform supports Apache Spark 1.3.1, a fast, large-scale data processing engine.

Deep integration of Spark with YARN allows Spark to operate as a cluster tenant alongside other engines such as Hive, Storm, and HBase, all running simultaneously on a single data platform. YARN allows flexibility: you can choose the right processing tool for the job. Instead of creating and managing a set of dedicated clusters for Spark applications, you can store data in a single location, access and analyze it with multiple processing engines, and leverage your resources. In a modern data architecture with multiple processing engines using YARN and accessing data in HDFS, Spark on YARN is the leading Spark deployment mode.

Spark Features

Spark on HDP supports the following features:

Spark Core
Spark on YARN
Spark on YARN on Kerberos-enabled clusters
Spark History Server
Spark MLLib
Support for Hive 0.1.3, including the collect_list UDF

The following features are available as technical previews:

Spark DataFrame API
ORC file support
Spark SQL
Spark Streaming
Spark SQL Thrift Server
Dynamic Executor Allocation

The following features and tools are not officially supported in this release:

ML Pipeline API
SparkR
Spark Standalone
GraphX
iPython
Zeppelin

Spark on YARN uses YARN services for resource allocation, running Spark Executors in YARN containers. Spark on YARN supports workload management and Kerberos security features. It has two modes:

YARN-Cluster mode, optimized for long-running production jobs.
YARN-Client mode, best for interactive use such as prototyping, testing, and debugging. Spark Shell runs in YARN-Client mode only.

The following tables summarize Spark versions and feature support across HDP and Ambari versions.

Table 1.1. Spark Support in HDP, Ambari

HDP	Ambari	Spark
2.2.4	2.0.1	1.2.1
2.2.6	2.1.1	1.2.1
2.2.8	2.1.1	1.3.1
2.2.9	2.1.1	1.3.1

Table 1.2. Spark Feature Support by Version

Feature	1.2.1	1.3.1
Spark Core	Yes	Yes
Spark on YARN	Yes	Yes
Spark on YARN, Kerberos-enabled clusters	Yes	Yes
Spark History Server	Yes	Yes
Spark MLLib	Yes	Yes
Hive 0.1.3, including `collect_list` UDF		Yes
ML Pipeline API (PySpark)
DataFrame API		TP
ORC Files		TP
Spark SQL	TP	TP
Spark Streaming	TP	TP
Spark SQL Thrift Server		TP
Dynamic Executor Allocation		TP
SparkR
Spark Standalone
GraphX

TP: Tech Preview

​Chapter 1. Introduction

Chapter 1. Introduction