Spark to CDP

How to migrate Apache Spark data to CDP.

CDP Spark Versions

  • CDP Private Cloud Base ships with Spark 2.4 on YARN.
  • CDP Public Cloud has two options:
    • The DataHub Data Engineering template includes Spark 2.4 on YARN. An experimental Spark 3 template is also provided.
    • CML provides Spark 2.4 on Kubernetes.

Spark Migration

  • Primarily impacts CDH 5.x / 6.x or HDP 2.x due to Hive 3 changes -- ACID tables.
  • You can use HDFS and Hive replication to move Spark data from CDH and HDP to CDP.
    • Existing external tables are not impacted -- existing applications can continue using the Spark API with external tables without code changes.
    • Managed tables that are migrated are Hive 3 ACID, which breaks Spark compatibility.
    • You can use the Hive 3 metadata and table upgrade utilities to understand which HMS tables would be ACID.

Access to Hive 3 managed (ACID) tables requires the Hive Warehouse Connector (HWC).