Developing Apache Spark Applications
Introduction
Spark application model
Spark execution model
Developing and running an Apache Spark WordCount application
Using the Spark DataFrame API
Building Spark Applications
Best practices for building Apache Spark applications
Building reusable modules in Apache Spark applications
Packaging different versions of libraries with an Apache Spark application
Using Spark SQL
SQLContext and HiveContext
Querying files into a DataFrame
Spark SQL example
Interacting with Hive views
Performance and storage considerations for Spark SQL DROP TABLE PURGE
TIMESTAMP compatibility for Parquet files
Accessing Spark SQL through the Spark shell
Calling Hive user-defined functions (UDFs)
Using Spark Streaming
Spark Streaming and Dynamic Allocation
Spark Streaming Example
Enabling fault-tolerant processing in Spark Streaming
Configuring authentication for long-running Spark Streaming jobs
Building and running a Spark Streaming application
Sample pom.xml file for Spark Streaming with Kafka
Accessing external storage from Spark
Accessing data stored in Amazon S3 through Spark
Examples of accessing Amazon S3 data from Spark
Accessing Hive from Spark
Accessing HDFS Files from Spark
Accessing ORC Data in Hive Tables
Accessing ORC files from Spark
Predicate push-down optimization
Loading ORC data into DataFrames using predicate push-down
Optimizing queries using partition pruning
Enabling vectorized query execution
Reading Hive ORC tables
Accessing Avro data files from Spark SQL applications
Accessing Parquet files from Spark SQL applications
Using Spark MLlib
Running a Spark MLlib example
Enabling Native Acceleration For MLlib
Using custom libraries with Spark
Using Apache Iceberg with Spark