Migrating Workloads to
CDP One
Overview
Migrating Spark workloads to CDP
Spark 1.6 to Spark 2.4 Refactoring
Handling prerequisites
Spark 1.6 to Spark 2.4 changes
New Spark entry point SparkSession
Dataframe API registerTempTable deprecated
union replaces unionAll
Empty schema not supported
Referencing a corrupt JSON/CSV record
Dataset and DataFrame API explode deprecated
CSV header and schema match
Table properties support
CREATE OR REPLACE VIEW and ALTER VIEW not supported
Managed table location
Write to Hive bucketed tables
Rounding in arithmetic operations
Precedence of set operations
HAVING without GROUP BY
CSV bad record handling
Spark 2.4 CSV example
Configuring storage locations
Querying Hive managed tables from Spark
Compiling and running Spark workloads
Compiling and running a Java-based job
Compiling and running a Scala-based job
Running a Python-based job
Running a job interactively
Post-migration tasks
Spark 2.3 to Spark 2.4 Refactoring
Handling prerequisites
Spark 2.3 to Spark 2.4 changes
Empty schema not supported
CSV header and schema match
Table properties support
Managed table location
Precedence of set operations
HAVING without GROUP BY
CSV bad record handling
Spark 2.4 CSV example
Configuring storage locations
Querying Hive managed tables from Spark
Compiling and running Spark workloads
Post-migration tasks
Migrating Hive and Impala workloads to CDP One
Handling prerequisites
Hive 1 and 2 to Hive 3 changes
Reserved keywords
Spark-client JAR requires prefix
Hive warehouse directory
Replace Hive CLI with Beeline
PARTIALSCAN
Concatenation of an external table
INSERT OVERWRITE
Managed to external table
Property changes affecting ordered or sorted subqueries and views
Runtime configuration changes
Prepare Hive tables for migration
Impala changes from CDH to CDP
Impala configuration differences in CDH and CDP
Additional documentation