Spark Guide
Also available as:
PDF
loading table of contents...

Contents

1. Introduction
2. Prerequisites
3. Installing and Configuring Spark
Installing Spark Over Ambari
(Optional) Configuring Spark for Hive Access
(Optional) Installing the Spark Thrift Server After Deploying Spark
Configuring Dynamic Resource Allocation and Thrift Server Settings
Customizing Cluster Dynamic Resource Allocation Settings (Ambari)
Configuring Cluster Dynamic Resource Allocation Manually
Configuring a Job for Dynamic Resource Allocation
Dynamic Resource Allocation Properties
Customizing the Spark Thrift Server Port
(Optional) Configuring Spark for a Kerberos-Enabled Cluster
Configuring the Spark Thrift Server on a Kerberos-Enabled Cluster
(Optional) Configuring the Spark History Server
Validating the Spark Installation
4. Developing Spark Applications
Spark Pi Program
WordCount Program
5. Using the Spark DataFrame API
Additional DataFrame API Examples
Specify Schema Programmatically
6. Accessing ORC Files from Spark
Accessing ORC in Spark
Reading and Writing with ORC
Column Pruning
Predicate Push-down
Partition Pruning
DataFrame Support
Additional Resources
7. Using Spark SQL
Accessing Spark SQL Through the Spark Shell
Accessing Spark SQL through JDBC
Forming JDBC Connection Strings for Spark SQL
Calling Hive User-Defined Functions
Using Custom UDFs
8. Using Spark Streaming
9. Adding Libraries to Spark
10. Using Spark with HDFS
Specifying Compression
Accessing HDFS from PySpark: Setting HADOOP_CONF_DIR
11. Tuning and Troubleshooting Spark
Hardware Provisioning
Checking Job Status
Checking Job History
Configuring Spark JVM Memory Allocation
Configuring YARN Memory Allocation for Spark
Specifying codec Files