Spark Guide

Also available as:

PDF

Contents

Hortonworks Data Platform

Spark Guide

Copyright © 2012-2016 Hortonworks, Inc.

Except where otherwise noted, this document is licensed under Creative Commons Attribution ShareAlike 4.0 License

http://creativecommons.org/licenses/by-sa/4.0/legalcode

The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included.

Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain, free and open source.

Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page. Feel free to contact us directly to discuss your specific needs.

Contents

1. Introduction

2. Prerequisites

3. Installing and Configuring Spark

Installing and Configuring Spark Over Ambari

(Optional) Configuring Spark for Hive Access
(Optional) Installing the Spark Thrift Server After Deploying Spark

Configuring Dynamic Resource Allocation and Thrift Server Settings

Customizing Cluster Dynamic Resource Allocation Settings (Ambari)
Configuring Cluster Dynamic Resource Allocation Manually
Configuring a Job for Dynamic Resource Allocation
Dynamic Resource Allocation Properties
Customizing the Spark Thrift Server Port

(Optional) Configuring Spark for a Kerberos-Enabled Cluster

Configuring the Spark Thrift Server on a Kerberos-Enabled Cluster
Configuring Spark Streaming with Kafka in a Kerberos-Enabled Environment

(Optional) Configuring the Spark History Server

Validating the Spark Installation

4. Developing Spark Applications

Spark Pi Program
WordCount Program

5. Using the Spark DataFrame API

Additional DataFrame API Examples
Specify Schema Programmatically

6. Accessing ORC Files from Spark

Accessing ORC in Spark
Reading and Writing with ORC
Column Pruning
Predicate Push-down
Partition Pruning
DataFrame Support
Additional Resources

7. Using Spark SQL

Accessing Spark SQL Through the Spark Shell

Accessing Spark SQL through JDBC and ODBC

Accessing Spark SQL through JDBC
Accessing Spark SQL through ODBC

Forming JDBC Connection Strings for Spark SQL

Calling Hive User-Defined Functions

Using Custom UDFs

8. Using Spark Streaming

Using Spark Streaming with Kafka on a Kerberos-Enabled Cluster

Adding the spark-kafka-streaming jar File to a Deployed Build
Running Spark Streaming - Kafka Jobs on a Kerberos-Enabled Cluster

9. Adding Libraries to Spark

10. Using Spark with HDFS

Specifying Compression
Accessing HDFS from PySpark: Setting HADOOP_CONF_DIR

11. Tuning Spark

Provisioning Hardware
Checking Job Status
Checking Job History
Improving Software Performance
Configuring YARN Memory Allocation for Spark
Specifying codec Files

12. Appendix: Sample pom.xml File for Spark Streaming with Kafka

List of Tables

1.1. Spark - HDP Version Support
1.2. Spark Feature Support by Version
2.1. Prerequisites for Running Spark
3.1. Dynamic Resource Allocation Properties
3.2. Dynamic Resource Allocation: Optional Settings