Apache Spark Component Guide
Also available as:
PDF
loading table of contents...

Contents

1. Analyzing Data with Apache Spark
2. Installing Spark
Installing Spark Using Ambari
Installing Spark Manually
Verifying Spark Configuration for Hive Access
Installing the Spark Thrift Server After Deploying Spark
Validating the Spark Installation
3. Configuring Spark
Configuring the Spark SQL Hive Warehouse Directory
Configuring the Spark Thrift Server
Enabling Spark SQL User Impersonation for the Spark Thrift Server
Customizing the Spark Thrift Server Port
Configuring the Livy Server
Configuring SSL for the Livy Server
Configuring High Availability for the Livy Server
Configuring the Spark History Server
Configuring Dynamic Resource Allocation
Customizing Dynamic Resource Allocation Settings on an Ambari-Managed Cluster
Configuring Cluster Dynamic Resource Allocation Manually
Configuring a Job for Dynamic Resource Allocation
Dynamic Resource Allocation Properties
Configuring Spark for Wire Encryption
Configuring Spark for Wire Encryption
Configuring Spark2 for Wire Encryption
Configuring Spark for a Kerberos-Enabled Cluster
Configuring the Spark History Server
Configuring the Spark Thrift Server
Setting Up Access for Submitting Jobs
4. Running Spark
Specifying Which Version of Spark to Run
Running Sample Spark 1.x Applications
Spark Pi
WordCount
Running Sample Spark 2.x Applications
Spark Pi
WordCount
5. Submitting Spark Applications Through Livy
Using Livy with Spark Versions 1 and 2
Using Livy with Interactive Notebooks
Using the Livy API to Run Spark Jobs: Overview
Running an Interactive Session With the Livy API
Livy Objects for Interactive Sessions
Setting Path Variables for Python
Livy API Reference for Interactive Sessions
Submitting Batch Applications Using the Livy API
Livy Batch Object
Livy API Reference for Batch Jobs
6. Running PySpark in a Virtual Environment
7. Automating Spark Jobs with Oozie Spark Action
Configuring Oozie Spark Action for Spark 1
Configuring Oozie Spark Action for Spark 2
8. Developing Spark Applications
Using the Spark DataFrame API
Using Spark SQL
Accessing Spark SQL through the Spark Shell
Accessing Spark SQL through JDBC or ODBC: Prerequisites
Accessing Spark SQL through JDBC
Accessing Spark SQL through ODBC
Spark SQL User Impersonation
Calling Hive User-Defined Functions
Using Built-in UDFs
Using Custom UDFs
Using Spark Streaming
Prerequisites
Building and Running a Secure Spark Streaming Job
Running Spark Streaming Jobs on a Kerberos-Enabled Cluster
Sample pom.xml File for Spark Streaming with Kafka
HBase Data on Spark with Connectors
Selecting a Connector
Using the Connector with Apache Phoenix
Accessing HDFS Files from Spark
Specifying Compression
Accessing HDFS from PySpark
Accessing ORC Data in Hive Tables
Accessing ORC Files from Spark
Enabling Predicate Push-Down Optimization
Loading ORC Data into DataFrames by Using Predicate Push-Down
Optimizing Queries Through Partition Pruning
Enabling Vectorized Query Execution
Reading Hive ORC Tables
Additional Resources
Using Custom Libraries with Spark
9. Using Spark from R: SparkR
Prerequisites
SparkR Example
10. Tuning Spark
Provisioning Hardware
Checking Job Status
Checking Job History
Improving Software Performance
Configuring YARN Memory Allocation for Spark