Spark Guide
Also available as:
loading table of contents...


1. Introduction
2. Prerequisites
3. Installing and Configuring Spark
Installing and Configuring Spark Over Ambari
(Optional) Configuring Spark for Hive Access
(Optional) Installing the Spark Thrift Server After Deploying Spark
Configuring Dynamic Resource Allocation and Thrift Server Settings
Customizing Cluster Dynamic Resource Allocation Settings (Ambari)
Configuring Cluster Dynamic Resource Allocation Manually
Configuring a Job for Dynamic Resource Allocation
Dynamic Resource Allocation Properties
Customizing the Spark Thrift Server Port
(Optional) Configuring Spark for a Kerberos-Enabled Cluster
Configuring the Spark Thrift Server on a Kerberos-Enabled Cluster
Configuring Spark Streaming with Kafka in a Kerberos-Enabled Environment
(Optional) Configuring the Spark History Server
Validating the Spark Installation
4. Developing Spark Applications
Spark Pi Program
WordCount Program
5. Using the Spark DataFrame API
Additional DataFrame API Examples
Specify Schema Programmatically
6. Accessing ORC Files from Spark
Accessing ORC in Spark
Reading and Writing with ORC
Column Pruning
Predicate Push-down
Partition Pruning
DataFrame Support
Additional Resources
7. Using Spark SQL
Accessing Spark SQL Through the Spark Shell
Accessing Spark SQL through JDBC and ODBC
Accessing Spark SQL through JDBC
Accessing Spark SQL through ODBC
Forming JDBC Connection Strings for Spark SQL
Calling Hive User-Defined Functions
Using Custom UDFs
8. Using Spark Streaming
Using Spark Streaming with Kafka on a Kerberos-Enabled Cluster
Adding the spark-kafka-streaming jar File to a Deployed Build
Running Spark Streaming - Kafka Jobs on a Kerberos-Enabled Cluster
9. Adding Libraries to Spark
10. Using Spark with HDFS
Specifying Compression
Accessing HDFS from PySpark: Setting HADOOP_CONF_DIR
11. Tuning Spark
Provisioning Hardware
Checking Job Status
Checking Job History
Improving Software Performance
Configuring YARN Memory Allocation for Spark
Specifying codec Files
12. Appendix: Sample pom.xml File for Spark Streaming with Kafka