Frequently Asked Questions about Apache Spark in CDH
This Frequently Asked Questions (FAQ) page covers general information about CDS Powered By Apache Spark and other questions that are relevant for early adopters of the latest Spark 2 features.
Continue reading:
What happens to Spark 1.6, or older Spark 2 parcel, during upgrade from CDH 5 to CDH 6?
With CDH 5, you were running Spark 1.6, or Spark 1.6 and Spark 2.x side-by-side (that is, if you installed the separate parcel for CDS Powered By Apache Spark). In CDH 6, Spark 2.x becomes the default. All the default binary names, such as pyspark and spark-submit, refer to the Spark 2 commands. The history server uses port 18088, the same as Spark 1.6 did in CDH 5.
If you formerly had multiple different Spark services on the cluster, because of running Spark 1.6 and 2.x side-by-side, you also have the same number of Spark services after the upgrade, each with its own history server and logs. Any new jobs that are submitted use the history server and log directory of the first Spark service (which was Spark 1.6 and is now 2.x).
If Spark Standalone was running on the CDH 5 cluster, you must uninstall it before upgrading to CDH 6. Therefore, all other instructions and background information assume that Spark Standalone is not present on the CDH 6 cluster.
You might also have to take corrective action during the upgrade if the Spark 1.6 and Spark 2.x gateway roles reside on the same hosts and they cannot be merged due to differing priorities. For a smoother upgrade experience, keep these gateway roles on separate hosts.
The at-rest encryption mechanism for Spark 1.6 in CDH 5 is different from that in Spark 2, which uses the commons-crypto library. Any old configuration settings for at-rest encryption are transparently recognized by Spark 2 in CDH 6, without any action on your part. You receive warnings in your logs about old settings, and can update the setting names when convenient.
Why doesn't feature or library XYZ work?
A number of features, components, libraries, and integration points from Spark 1.6 are not supported with CDS Powered By Apache Spark. See Apache Spark Known Issues for details.