Spark to CDP
How to migrate Apache Spark data to CDP.
CDP Spark Versions
- CDP Private Cloud Base ships with Spark 2.4 on YARN.
- CDP Public Cloud has two options:
- The DataHub Data Engineering template includes Spark 2.4 on YARN. An experimental Spark 3 template is also provided.
- CML provides Spark 2.4 on Kubernetes.
Spark Migration
- Primarily impacts CDH 5.x / 6.x or HDP 2.x due to Hive 3 changes -- ACID tables.
- You can use HDFS and Hive replication to move Spark data from CDH and HDP to
CDP.
- Existing external tables are not impacted -- existing applications can continue using the Spark API with external tables without code changes.
- Managed tables that are migrated are Hive 3 ACID, which breaks Spark compatibility.
- You can use the Hive 3 metadata and table upgrade utilities to understand which HMS tables would be ACID.
Access to Hive 3 managed (ACID) tables requires the Hive Warehouse Connector (HWC).
- Supported applications : Spark Shell, PySpark, spark-submit.
- Include required HWC libraries and configurations.
- Read/Write operations use the HWC APIs.
- HiveServer2 and Apache Ranger provide fine-grained access control. FGAC 3 CDP DC 7.0 read operations using HWC is limited 4 CDP DC 7.1 read operations improved for high volume