Chapter 1. Analyzing Data with Apache Spark
Hortonworks Data Platform (HDP) supports Apache Spark, a fast, large-scale data processing engine.
Deep integration of Spark with YARN allows Spark to operate as a cluster tenant alongside Apache engines such as Hive, Storm, and HBase, all running simultaneously on a single data platform. Instead of creating and managing a set of dedicated clusters for Spark applications, you can store data in a single location, access and analyze it with multiple processing engines, and leverage your resources.
Spark on YARN leverages YARN services for resource allocation, runs Spark executors in YARN containers, and supports workload management and Kerberos security features. It has two modes:
YARN-cluster mode, optimized for long-running production jobs
YARN-client mode, best for interactive use such as prototyping, testing, and debugging
Spark shell and the Spark Thrift server run in YARN-client mode only.
Table 1.1. Spark Feature Support by Version
Spark Version | 1.6.2 | 1.6.2 | 1.6.1 | 1.6.0 | 1.5.2 | 1.4.1 | 1.3.1 | 1.2.1 |
---|---|---|---|---|---|---|---|---|
HDP Version(s) |
2.5.0 2.5.3 |
2.4.3 | 2.4.2 | 2.4.0 |
2.3.4 2.3.4.7 2.3.6 | 2.3.2 |
2.2.8 2.2.9 2.3.0 |
2.2.4 2.2.6 |
Feature: | ||||||||
Spark Core | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
Spark on YARN | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
Spark on YARN, Kerberos-enabled clusters | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
Spark history server | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
Hive support | 1.2.1 | 1.2.1 | 1.2.1 | 1.2.1 | 1.2.1 | 0.13.1 | 0.13.1 | |
Spark MLlib | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
ML Pipeline API | Yes | Yes | Yes | Yes | Yes | Yes | ||
DataFrame API | Yes | Yes | Yes | Yes | Yes | Yes | TP | |
ORC Files | Yes | Yes | Yes | Yes | Yes | Yes | TP | |
PySpark | Yes | Yes | Yes | Yes | Yes | Yes | TP | |
SparkR | Yes | TP | TP | TP | TP | TP | ||
Spark SQL | Yes | Yes | Yes | Yes | Yes | TP | TP | TP |
Spark Thrift server (JDBC, ODBC) | Yes | Yes | Yes | Yes | Yes | TP | TP | |
Spark Streaming | Yes | Yes | Yes | Yes | Yes | TP | TP | TP |
Dynamic resource allocation | Yes* | Yes* | Yes* | Yes* | Yes* | TP | TP | |
HBase connector | Yes | TP | TP |
TP: Tech Preview
* Note: Dynamic Resource Allocation does not work with Spark Streaming.
The following features are available as technical previews, and are considered under development. Do not use these features in your production systems. If you have questions regarding these features, contact Support by logging a case on the Hortonworks Support Portal at https://support.hortonworks.com.
Spark 2.0, including side-by-side installation with Spark 1.6.2 (see Installing Spark)
GraphX
DataSet API
The following features and associated tools are not officially supported by Hortonworks:
Direct use of the Livy server and REST API. (The Livy server is accessible through the %livy interpreter, within Zeppelin.)
Spark Standalone
Spark on Mesos
Jupyter (formerly IPython) Notebook