Apache Spark Component Guide
Also available as:
loading table of contents...


1. Analyzing Data with Apache Spark
2. Installing Spark
Installing Spark Using Ambari
Verify Spark Configuration for Hive Access
Installing the Spark Thrift Server After Deploying Spark
Validating the Spark Installation
3. Configuring Spark
Customizing the Spark Thrift Server Port
Configuring the Livy Server
Configuring the Spark History Server
Configuring Dynamic Resource Allocation
Customizing Dynamic Resource Allocation Settings on an Ambari-Managed Cluster
Configuring Cluster Dynamic Resource Allocation Manually
Configuring a Job for Dynamic Resource Allocation
Dynamic Resource Allocation Properties
Configuring Spark for Wire Encryption
Configuring Spark for a Kerberos-Enabled Cluster
Configuring the Spark History Server
Configuring the Spark Thrift Server
Setting Up Access for Submitting Jobs
4. Developing and Submitting Spark Applications
Running Spark Applications
Spark Pi
Specifying Which Version of Spark to Use
Using the Spark DataFrame API
Adding Libraries to Spark
Using Spark SQL
Accessing Spark SQL Through the Spark Shell
Accessing Spark SQL through JDBC or ODBC
Forming JDBC Connection Strings for Spark SQL
Calling Hive User-Defined Functions
Using Spark Streaming
Building and Running a Secure Spark Streaming Job
Running Spark Streaming Jobs on a Kerberos-Enabled Cluster
Sample pom.xml File for Spark Streaming with Kafka
Spark on HBase: Using the HBase Connector
Accessing ORC Data in Hive Tables
Accessing ORC Files from Spark
Enabling Predicate Push-Down Optimization
Loading ORC Data into DataFrames by Using Predicate Push-Down
Optimizing Queries Through Partition Pruning
Additional Resources
Accessing HDFS Files from Spark
Specifying Compression
Accessing HDFS from PySpark
5. Using Spark from R: SparkR
SparkR Example
Additional Resources
6. Tuning Spark
Provisioning Hardware
Checking Job Status
Checking Job History
Improving Software Performance
Configuring YARN Memory Allocation for Spark