Troubleshooting for Spark

Troubleshooting for Spark mainly involves checking configuration settings and application code to diagnose performance and scalability issues.

Draft comment:
Now we're getting into "incompatible changes" territory. Parcel out some of this info to Migration, Incompatible Changes, etc. Possibly as conrefs - it will be good to keep a Spark Troubleshooting topic.

Wrong version of Python

Spark 2 requires Python 2.7 or higher. You might need to install a new version of Python on all hosts in the cluster, because some Linux distributions come with Python 2.6 by default. If the right level of Python is not picked up by default, set the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables to point to the correct Python executable before running the pyspark command.

API changes that are not backward-compatible

Between Spark 1.6 (part of CDH 5) and Spark 2.x (part of CDH 6), some APIs have changed in ways that are not backward compatible. Recompile all CDH 5 Spark applications under CDH 6 to take advantage of Spark 2 capabilities. For any compilation errors, check if the corresponding function has changed in Spark 2, and if so, change your code to use the latest function name, parameters, and return type.

A Spark component does not work or is unstable

Certain components from the Spark ecosystem are explicitly not supported with the Spark 2 that is included in CDH 6. Check against the compatibility matrix for Spark to make sure the components you are using are all intended to work with Spark in CDH 6.

Errors During pyspark Startup

First-time Spark users, especially on small or newly installed clusters, might encounter intimidating errors during pyspark startup. The following are some errors that you might see (typically followed by a lengthy Java call stack), and some simple workarounds that you can perform even as a non-administrator, to successfully get at least to a pyspark command prompt.

ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Required executor memory (1024+384 MB) is
above the max threshold (1024 MB) of this cluster! Please check the values of
'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:319)

The preceding error might occur on a cluster using undersized virtual machines. If your goal is just to see pyspark running and it does not make sense to fine-tune the memory settings for a demonstration non-production cluster, you can specify a lower memory limit by running pyspark with the --executor-memory option. For example:

pyspark --executor-memory=600M

Another kind of error might occur on startup, indicating a permission problem with an HDFS directory under /user:

ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.hadoop.security.AccessControlException: Permission denied:
user=<varname>user_id</varname>, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:400)

To run pyspark, you must be logged in as a user that has a corresponding HDFS home directory, such as /user/user_id. If you are running as root or some other user that does not have HDFS privileges, you might not be able to create the corresponding directory in HDFS. If so, switch to one of the existing HDFS-privileged users:

The following example shows how both root and a generic test user ID both cannot run pyspark due to lack of an HDFS home directory. After switching to a user that does have an HDFS home directory, we can run pyspark successfully and get to the command prompt with no errors.

[root@myhost ~]# hdfs dfs -mkdir /user/root
mkdir: Permission denied: user=root, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x

[root@myhost ~]# sudo su testuser
[testuser@myhost root]$ hdfs dfs -mkdir /user/testuser
mkdir: Permission denied: user=testuser, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x

[testuser@myhost root]$ hdfs dfs -ls /user
Found 7 items
drwxrwxrwx   - mapred hadoop              0 2018-03-09 15:19 /user/history
drwxrwxr-t   - hive   hive                0 2018-03-09 15:19 /user/hive
drwxrwxr-x   - hue    hue                 0 2018-03-09 15:25 /user/hue
drwxrwxr-x   - impala impala              0 2018-03-09 15:17 /user/impala
drwxrwxr-x   - oozie  oozie               0 2018-03-09 15:18 /user/oozie
drwxr-x--x   - spark  spark               0 2018-03-09 15:18 /user/spark
drwxr-xr-x   - hdfs   supergroup          0 2018-03-09 15:18 /user/yarn

[testuser@myhost root]# su impala
[impala@myhost root]$ pyspark --executor-memory=500M
Python 2.7.5 (default, Nov  6 2016, 00:28:07)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
    ____              __
   / __/__  ___ _____/ /__
  _\ \/ _ \/ _ `/ __/  '_/
 /__ / .__/\_,_/_/ /_/\_\   version 2.2.0-cdh6.x-SNAPSHOT
    /_/

Using Python version 2.7.5 (default, Nov  6 2016 00:28:07)
SparkSession available as 'spark'.
>>>