Apache Spark Component Guide

Also available as:

PDF

Contents

Hortonworks Data Platform

Apache Spark Component Guide

Copyright © 2012-2017 Hortonworks, Inc.

Except where otherwise noted, this document is licensed under Creative Commons Attribution ShareAlike 4.0 License

http://creativecommons.org/licenses/by-sa/4.0/legalcode

The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included.

Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain, free and open source.

Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page. Feel free to contact us directly to discuss your specific needs.

Contents

1. Analyzing Data with Apache Spark

2. Installing Spark

Installing Spark Using Ambari
Installing Spark Manually
Verifying Spark Configuration for Hive Access
Installing the Spark Thrift Server After Deploying Spark
Validating the Spark Installation

3. Configuring Spark

Configuring the Spark SQL Hive Warehouse Directory

Configuring the Spark Thrift Server

Enabling Spark SQL User Impersonation for the Spark Thrift Server
Customizing the Spark Thrift Server Port

Configuring the Livy Server

Configuring SSL for the Livy Server
Configuring High Availability for the Livy Server

Configuring the Spark History Server

Configuring Dynamic Resource Allocation

Customizing Dynamic Resource Allocation Settings on an Ambari-Managed Cluster
Configuring Cluster Dynamic Resource Allocation Manually
Configuring a Job for Dynamic Resource Allocation
Dynamic Resource Allocation Properties

Configuring Spark for Wire Encryption

Configuring Spark for Wire Encryption
Configuring Spark2 for Wire Encryption

Configuring Spark for a Kerberos-Enabled Cluster

Configuring the Spark History Server
Configuring the Spark Thrift Server
Setting Up Access for Submitting Jobs

4. Running Spark

Specifying Which Version of Spark to Run

Running Sample Spark 1.x Applications

Spark Pi
WordCount

Running Sample Spark 2.x Applications

Spark Pi
WordCount

5. Submitting Spark Applications Through Livy

Using Livy with Spark Versions 1 and 2

Using Livy with Interactive Notebooks

Using the Livy API to Run Spark Jobs: Overview

Running an Interactive Session With the Livy API

Livy Objects for Interactive Sessions
Setting Path Variables for Python
Livy API Reference for Interactive Sessions

Submitting Batch Applications Using the Livy API

Livy Batch Object
Livy API Reference for Batch Jobs

6. Running PySpark in a Virtual Environment

7. Automating Spark Jobs with Oozie Spark Action

Configuring Oozie Spark Action for Spark 1
Configuring Oozie Spark Action for Spark 2

8. Developing Spark Applications

Using the Spark DataFrame API

Using Spark SQL

Accessing Spark SQL through the Spark Shell
Accessing Spark SQL through JDBC or ODBC: Prerequisites
Accessing Spark SQL through JDBC
Accessing Spark SQL through ODBC
Spark SQL User Impersonation

Calling Hive User-Defined Functions

Using Built-in UDFs
Using Custom UDFs

Using Spark Streaming

Prerequisites
Building and Running a Secure Spark Streaming Job
Running Spark Streaming Jobs on a Kerberos-Enabled Cluster
Sample pom.xml File for Spark Streaming with Kafka

HBase Data on Spark with Connectors

Selecting a Connector
Using the Connector with Apache Phoenix

Accessing HDFS Files from Spark

Specifying Compression
Accessing HDFS from PySpark

Accessing ORC Data in Hive Tables

Accessing ORC Files from Spark
Predicate Push-Down Optimization
Loading ORC Data into DataFrames Using Predicate Push-Down
Optimizing Queries Through Partition Pruning
Enabling Vectorized Query Execution
Reading Hive ORC Tables
Additional Resources

Using Custom Libraries with Spark

9. Using Spark from R: SparkR

Prerequisites
SparkR Example

10. Tuning Spark

Provisioning Hardware

Checking Job Status

Checking Job History

Improving Software Performance

Configuring YARN Memory Allocation for Spark

List of Tables

1.1. Spark and Livy Feature Support by HDP Version
3.1. Dynamic Resource Allocation Properties
3.2. Optional Dynamic Resource Allocation Properties
8.1. Comparison of the Spark-HBase Connectors