Apache Spark Component Guide
Also available as:
PDF
loading table of contents...

Chapter 1. Analyzing Data with Apache Spark

Hortonworks Data Platform (HDP) supports Apache Spark, a fast, large-scale data processing engine.

Deep integration of Spark with YARN allows Spark to operate as a cluster tenant alongside Apache engines such as Hive, Storm, and HBase, all running simultaneously on a single data platform. Instead of creating and managing a set of dedicated clusters for Spark applications, you can store data in a single location, access and analyze it with multiple processing engines, and leverage your resources.

Spark on YARN leverages YARN services for resource allocation, runs Spark executors in YARN containers, and supports workload management and Kerberos security features. It has two modes:

  • YARN-cluster mode, optimized for long-running production jobs

  • YARN-client mode, best for interactive use such as prototyping, testing, and debugging

Spark shell and the Spark Thrift server run in YARN-client mode only.

HDP 2.6 supports Spark versions 1.6 and 2.0; Livy, for local and remote access to Spark through the Livy REST API; and Apache Zeppelin, for browser-based notebook access to Spark. (For more information about Zeppelin, see the Zeppelin Component Guide.)

Table 1.1. Spark and Livy Feature Support by HDP Version

HDP Version(s)2.6.12.6.02.5.0, 2.5.32.4.32.4.22.4.02.3.4, 2.3.4.7, 2.3.62.3.22.2.8, 2.2.9, 2.3.0, 2.2.4, 2.2.6
Spark Version1.6.3, 2.1.11.6.3, 2.1.01.6.21.6.21.6.11.6.01.5.21.4.11.3.11.2.1
Support for Livy0.30.3        
Support for Hive1.2.11.2.11.2.11.2.11.2.11.2.11.2.10.13.10.13.1 
Spark Core
Spark on YARN
Spark on YARN for Kerberos-enabled clusters
Spark history server
Spark MLlib
ML Pipeline API  
DataFrame APITP 
Optimized Row Columnar (ORC) FilesTP 
PySparkTP 
SparkRTPTPTPTPTP  
Spark SQLTPTPTP
Spark SQL Thrift server for JDBC, ODBC accessTPTP 
Spark-LLAP: SQL row- and column-level access control (Spark 2 only)TPTP        
Spark StreamingTPTPTP
Dynamic resource allocation✓*✓*✓*✓*✓*✓*✓*TPTP 
HBase connectorTPTP     
GraphXTPTPTPTPTPTPTP   
DataSet APITPTPTPTPTP     

* Dynamic Resource Allocation does not work with Spark Streaming.

TP: Technical Preview. Technical previews are considered under development. Do not use these features in production systems. If you have questions regarding these features, contact Support through the Hortonworks Support Portal, https://support.hortonworks.com.

The following features and associated tools are not officially supported by Hortonworks:

  • Spark Standalone

  • Spark on Mesos

  • Jupyter Notebook (formerly IPython)