Troubleshooting Issues with Workloads

This section describes some potential issues data scientists might encounter once the application is running workloads.

404 error in Workbench after starting an engine

This is typically caused because a wildcard DNS subdomain was not set up before installation. While the application will largely work, the engine consoles are served on subdomains and will not be routed correctly unless a wildcard DNS entry pointing to the master host is properly configured. You might need to wait 30-60 minutes until the DNS entries propagate. For instructions, see Set Up a Wildcard DNS Subdomain.

.

Engines cannot be scheduled due to lack of CPU or memory

A symptom of this is the following error message in the Workbench: "Unschedulable: No node in the cluster currently has enough CPU or memory to run the engine."

Either shut down some running sessions or jobs or provision more hosts for Cloudera Data Science Workbench.

Workbench prompt flashes red and does not take input

The Workbench prompt flashing red indicates that the session is not currently ready to take input.

Cloudera Data Science Workbench does not currently support non-REPL interaction. One workaround is to skip the prompt using appropriate command-line arguments. Otherwise, consider using the terminal to answer interactive prompts.

PySpark jobs fail due to HDFS permission errors

: org.apache.hadoop.security.AccessControlException: Permission denied: user=alice, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x 

(Required for CDH 5 and CDH 6) To be able to use Spark 2, each user must have their own /home directory in HDFS. If you sign in to Hue first, these directories will automatically be created for you. Alternatively, you can have cluster administrators create these directories.

hdfs dfs -mkdir /user/<username>
hdfs dfs -chown <username>:<username> /user/<username>

PySpark jobs fail due to Python version mismatch

Exception: Python in worker has different version 2.6 than that in driver 2.7, PySpark cannot run with different minor versions

One solution is to install the matching Python 2.7 version on all the cluster hosts. Another, more recommended solution is to install the Anaconda parcel on all CDH cluster hosts. Cloudera Data Science Workbench Python engines will use the version of Python included in the Anaconda parcel which ensures Python versions between driver and workers will always match. Any library paths in workloads sent from drivers to workers will also match because Anaconda is present in the same location across all hosts. Once the parcel has been installed, set the PYSPARK_PYTHON environment variable in the Cloudera Data Science Workbench Admin dashboard. Alternatively, you can use Cloudera Manager to set the path.

Jobs fail due to incorrect JAVA_HOME on HDP

Commands, such as hdfs commands, and jobs fail with an error similar to the following message:
ERROR: JAVA_HOME /usr/lib/jvm/java does not exist.
The JAVA_HOME path you configure for Cloudera Data Science Workbench in cdsw.conf must match the JAVA_HOME configured by hadoop-env.sh for the HDP cluster. After you update JAVA_HOME in cdsw.conf, you must restart Cloudera Data Science Workbench. For more information, see Changes to cdsw.conf.

The user_events table is growing in size and affecting performance

The user_events table is used to monitor and audit user events. It can grown in size in long running deployments and can decrease performance. To clean the table manually:
  1. SSH to the Cloudera Data Science Workbench Master host and log in as root.

    ssh root@<cdsw_master_host_domain_name>
  2. Get the name of the database pod:
    kubectl get pods -l role=db
    The command returns information similar to the following example:
    NAME                  READY   STATUS    RESTARTS   AGE
    db-6d56584f76-phn2f   1/1     Running   0          4h46m
  3. Enter the following command to log into the database as the sense user::
    kubectl exec -it <database pod> -- psql -U sense
  4. Run the following query to get the number of rows from the user_events table:
    select count(id) from user_events;
  5. Delete the records older than 30 days by running the following query:
    delete from user_events where created_at < NOW() - INTERVAL '30 days';