Spark Guide
Also available as:
PDF
loading table of contents...

Chapter 1. Introduction

Hortonworks Data Platform supports Apache Spark, a fast, large-scale data processing engine.

Deep integration of Spark with YARN allows Spark to operate as a cluster tenant alongside other engines such as Hive, Storm, and HBase, all running simultaneously on a single data platform. YARN allows flexibility: you can choose the right processing tool for the job. Instead of creating and managing a set of dedicated clusters for Spark applications, you can store data in a single location, access and analyze it with multiple processing engines, and leverage your resources. In a modern data architecture with multiple processing engines using YARN and accessing data in HDFS, Spark on YARN is the leading Spark deployment mode.

Spark Features

Spark on HDP supports the following features:

  • Spark on YARN

  • Spark Core

  • Spark SQL

  • Spark SQL Thrift Server (JDBC/ODBC)

  • Spark MLLib

  • Spark Streaming

  • Spark History Server

  • DataFrame API

  • Optimized Row Columnar (ORC) files

  • ML Pipeline API

  • Support for Hive 1.2.1

  • PySpark

  • Dynamic Resource Allocation

The following features and associated tools are available as technical previews:

The following features and associated tools are not officially supported by Hortonworks:

  • Spark Standalone

  • Spark on Mesos

  • Jupyter/iPython Notebook

   

Spark on YARN leverages YARN services for resource allocation, and runs Spark Executors in YARN containers. Spark on YARN supports workload management and Kerberos security features. It has two modes:

  • YARN-cluster mode, optimized for long-running production jobs.

  • YARN-client mode, best for interactive use such as prototyping, testing, and debugging. Spark Shell and Spark Thrift Server both run in YARN-client mode only.

Table 1.1. Spark - HDP Version Support

HDPAmbariSpark
2.4.22.2.21.6.1
2.4.02.2.11.6.0
2.3.42.2.01.5.2
2.3.22.1.21.4.1
2.3.02.1.11.3.1
2.2.92.1.11.3.1
2.2.82.1.11.3.1
2.2.62.1.11.2.1
2.2.42.0.11.2.1

Table 1.2. Spark Feature Support by Version

Feature1.2.11.3.11.4.11.5.21.6.01.6.1
Spark CoreYesYesYesYesYesYes
Spark on YARNYesYesYesYesYesYes
Spark on YARN, Kerberos-enabled clustersYesYesYesYesYesYes
Spark History ServerYesYesYesYesYesYes
Spark MLLibYesYesYesYesYesYes
Hive 13 (or later) support, including collect_list UDF Hive version 0.13.1Hive version 0.13.1Hive version 1.2.1Hive version 1.2.1Hive version 1.2.1
ML Pipeline API  YesYesYesYes
DataFrame API TPYesYesYesYes
ORC Files TPYesYesYesYes
PySpark TPYesYesYesYes
Spark SQLTPTPTPYesYesYes
Spark Thrift Server (JDBC/ODBC) TPTPYesYesYes
Spark StreamingTPTPTPYesYesYes
Dynamic Resource Allocation TPTPYes*Yes*Yes*
SparkR  TPTPTPTP
GraphX   TPTPTP
Spark HBase connector     TP

   TP: Tech Preview

* Note: Dynamic Resource Allocation does not work with Spark Streaming.