Chapter 1. Introduction
Hortonworks Data Platform supports Apache Spark 1.6, a fast, large-scale data processing engine.
Deep integration of Spark with YARN allows Spark to operate as a cluster tenant alongside other engines such as Hive, Storm, and HBase, all running simultaneously on a single data platform. YARN allows flexibility: you can choose the right processing tool for the job. Instead of creating and managing a set of dedicated clusters for Spark applications, you can store data in a single location, access and analyze it with multiple processing engines, and leverage your resources. In a modern data architecture with multiple processing engines using YARN and accessing data in HDFS, Spark on YARN is the leading Spark deployment mode.
Spark Features
Spark on HDP supports the following features:
Spark Core
Spark on YARN
Spark on YARN on Kerberos-enabled clusters
Spark History Server
Spark MLLib
DataFrame API
Optimized Row Columnar (ORC) files
Spark SQL
Spark SQL Thrift Server
Spark Streaming
Support for Hive 1.2.1
ML Pipeline API
PySpark
Dynamic Resource Allocation
The following features and associated tools are available as technical previews:
SparkR
GraphX
The following features and associated tools are not officially supported by Hortonworks:
Spark Standalone
Spark on Mesos
Jupyter/iPython Notebook
Oozie Spark action is not supported, but there is a tech note available for HDP customers
Spark on YARN leverages YARN services for resource allocation, and runs Spark Executors in YARN containers. Spark on YARN supports workload management and Kerberos security features. It has two modes:
YARN-Cluster mode, optimized for long-running production jobs.
YARN-Client mode, best for interactive use such as prototyping, testing, and debugging. Spark Shell runs in YARN-Client mode only.
Table 1.1. Spark - HDP Version Support
HDP | Ambari | Spark |
---|---|---|
2.4.0 | 2.2.1 | 1.6.0 |
2.3.4 | 2.2.0 | 1.5.2 |
2.3.2 | 2.1.2 | 1.4.1 |
2.3.0 | 2.1.1 | 1.3.1 |
2.2.9 | 2.1.1 | 1.3.1 |
2.2.8 | 2.1.1 | 1.3.1 |
2.2.6 | 2.1.1 | 1.2.1 |
2.2.4 | 2.0.1 | 1.2.1 |
Table 1.2. Spark Feature Support by Version
Feature | 1.2.1 | 1.3.1 | 1.4.1 | 1.5.2 | 1.6.0 |
---|---|---|---|---|---|
Spark Core | Yes | Yes | Yes | Yes | Yes |
Spark on YARN | Yes | Yes | Yes | Yes | Yes |
Spark on YARN, Kerberos-enabled clusters | Yes | Yes | Yes | Yes | Yes |
Spark History Server | Yes | Yes | Yes | Yes | Yes |
Spark MLLib | Yes | Yes | Yes | Yes | Yes |
Hive 13 (or later) support, including collect_list UDF | Hive version 0.13.1 | Hive version 0.13.1 | Hive version 1.2.1 | Hive version 1.2.1 | |
ML Pipeline API | Yes | Yes | Yes | ||
DataFrame API | TP | Yes | Yes | Yes | |
ORC Files | TP | Yes | Yes | Yes | |
PySpark | TP | Yes | Yes | Yes | |
Spark SQL | TP | TP | TP | Yes | Yes |
Spark Thrift Server | TP | TP | Yes | Yes | |
Spark Streaming | TP | TP | TP | Yes | Yes |
Dynamic Resource Allocation | TP | TP | Yes* | Yes* | |
SparkR | TP | TP | TP | ||
GraphX | TP | TP |
TP: Tech Preview
* Note: Dynamic Resource Allocation does not work with Spark Streaming.