Troubleshooting CDS Powered by Apache Spark

Troubleshooting for CDS Powered by Apache Spark mainly involves checking configuration settings and application code to diagnose performance and scalability issues.

Commercial support for GA version

Cloudera customers with commercial support can now use normal support channels for CDS Powered by Apache Spark.

Error instantiating Hive metastore class

A Hive compatibility issue in CDS 2.0 Release 1 affects CDH 5.10.1 and higher, CDH 5.9.2 and higher, CDH 5.8.5 and higher, and CDH 5.7.6 and higher. If you are using one of these CDH versions, you must upgrade to the CDS 2.0 Release 2 or higher parcel, to avoid Spark 2 job failures when using Hive functionality.

When you encounter a problem due to the Hive compatibility issue, the error stack starts like this:

java.lang.RuntimeException: Unable to instantiate
  org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance
      (MetaStoreUtils.java:1545)
        at org.apache.hadoop.hive.metastore
          .RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.

The solution is to upgrade to CDS 2.0 Release 2 or higher.

Wrong version of Python

When you use CDS Powered by Apache Spark with Python 2.x, you must use Python 2.7 or higher. You might need to install a new version of Python on all hosts in the cluster, because some Linux distributions come with Python 2.6 by default. If the right level of Python is not picked up by default, set the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables to point to the correct Python executable before running the pyspark2 command.

API changes that are not backward-compatible

Between Spark 1.6 and Spark 2.0, some APIs have changed in ways that are not backward compatible. Recompile all applications to take advantage of Spark 2 capabilities. For any compilation errors, check if the corresponding function has changed in Spark 2, and if so, change your code to use the latest function name, parameters, and return type.

A Spark component does not work or is unstable

Certain components from the Spark ecosystem are explicitly not supported with CDS Powered by Apache Spark. Check against the compatibility matrix for Spark to make sure the components you are using are all intended to work with CDS Powered by Apache Spark and CDH.