Troubleshooting Cloudera Data Science Workbench

Use one or more of the following courses of action to start debugging issues with Cloudera Data Science Workbench.

Check the status of the application.
```
cdsw status
```
SSH to your master host and run the following node validation command to check that the key services are running:
```
cdsw validate
```
Make sure your Cloudera Data Science Workbench configuration is correct.

CSD Deployments

Log into Cloudera Manager and review configuration for the CDSW service.
RPM Deployments
```
cat /etc/cdsw/config/cdsw.conf
```

The following sections describe solutions to potential problems and error messages you may encounter while installing, configuring or using Cloudera Data Science Workbench.

Understanding Installation Warnings
Failed to run Kernel memory slabs check
Error Encountered Trying to Load Images when Initializing Cloudera Data Science Workbench
404 Not Found Error
Kerberos Issues
TLS/SSL Issues
Issues with Workloads
Troubleshooting Issues with Models and Experiments

Understanding Installation Warnings

This section describes solutions to some warnings you might encounter during the installation process.

Preexisting iptables rules not supported
Remove the entry corresponding to /dev/xvdc from /etc/fstab
Linux sysctl kernel configuration errors
CDH parcels not found at /opt/cloudera/parcels

Preexisting iptables rules not supported

WARNING: Cloudera Data Science Workbench requires iptables, but does not support preexisting iptables rules.

Kubernetes makes extensive use of iptables. However, it’s hard to know how pre-existing iptables rules will interact with the rules inserted by Kubernetes. Therefore, Cloudera recommends you run the following commands to clear all pre-existing rules before you proceed with the installation.

sudo iptables -P INPUT ACCEPT
sudo iptables -P FORWARD ACCEPT
sudo iptables -P OUTPUT ACCEPT
sudo iptables -t nat -F
sudo iptables -t mangle -F
sudo iptables -F
sudo iptables -X

Remove the entry corresponding to /dev/xvdc from /etc/fstab

Cloudera Data Science Workbench installs a custom filesystem on its Application and Docker block devices. These filesystems will be used to store user project files and Docker engine images respectively. Therefore, Cloudera Data Science Workbench requires complete access to the block devices. To avoid losing any existing data, make sure the block devices allocated to Cloudera Data Science Workbench are reserved only for the workbench.

Linux sysctl kernel configuration errors

Kubernetes and Docker require non-standard kernel configuration. Depending on the existing state of your kernel, this might result in sysctl errors such as:

sysctl net.bridge.bridge-nf-call-iptables must be set to 1

This is because the settings in /etc/sysctl.conf conflict with the settings required by Cloudera Data Science Workbench. Cloudera cannot make a blanket recommendation on how to resolve such errors because they are specific to your deployment. Cluster administrators may choose to either remove or modify the conflicting value directly in /etc/sysctl.conf, remove the value from the conflicting configuration file, or even delete the module that is causing the conflict.

To start diagnosing the issue, run the following command to see the list of configuration files that are overwriting values in /etc/sysctl.conf.

SYSTEMD_LOG_LEVEL=debug /usr/lib/systemd/systemd-sysctl

You will see output similar to:

Parsing /usr/lib/sysctl.d/00-system.conf
Parsing /usr/lib/sysctl.d/50-default.conf
Parsing /etc/sysctl.d/99-sysctl.conf
Overwriting earlier assignment of net/bridge/bridge-nf-call-ip6tables in file '/etc/sysctl.d/99-sysctl.conf'.
Overwriting earlier assignment of net/bridge/bridge-nf-call-ip6tables in file '/etc/sysctl.d/99-sysctl.conf'.
Overwriting earlier assignment of net/bridge/bridge-nf-call-ip6tables in file '/etc/sysctl.d/99-sysctl.conf'.
Parsing /etc/sysctl.d/k8s.conf
Overwriting earlier assignment of net/bridge/bridge-nf-call-iptables in file '/etc/sysctl.d/k8s.conf'.
Parsing /etc/sysctl.conf
Overwriting earlier assignment of net/bridge/bridge-nf-call-ip6tables in file '/etc/sysctl.conf'.
Overwriting earlier assignment of net/bridge/bridge-nf-call-ip6tables in file '/etc/sysctl.conf'.
Setting 'net/ipv4/conf/all/promote_secondaries' to '1'
Setting 'net/ipv4/conf/default/promote_secondaries' to '1'
...

/etc/sysctl.d/k8s.conf is the configuration added by Cloudera Data Science Workbench. Administrators must make sure that no other file is overwriting values set by /etc/sysctl.d/k8s.conf.

CDH parcels not found at /opt/cloudera/parcels

There are two possible reasons for this warning:

If you are using a custom parcel directory, you can ignore the warning and proceed with the installation. Once the Cloudera Data Science Workbench is running, set the path to the CDH parcel in the admin dashboard. See Configuring the Engine Environment.
This warning can be an indication that you have not added gateway roles to the Cloudera Data Science Workbench nodes. In this case, do not ignore the warning. Exit the installer and go to Cloudera Manager to add gateway roles to the cluster. See Configure Gateway Hosts Using Cloudera Manager.

Failed to run Kernel memory slabs check

Users might see the following error message in Cloudera Manager after upgrading to Cloudera Data Science Workbench 1.4.2.

Bad: Failed to run Kernel memory slabs check

This error is an indication that Cloudera Data Science Workbench hosts were not rebooted as part of the upgrade to version 1.4.2. The host reboot is required to fix a Red Hat kernel slab leak issue that was discovered in Cloudera Data Science Workbench 1.4.0. For more information, see: (Red Hat Only) Host Reboot Required for Upgrades from Cloudera Data Science Workbench 1.4.0.

To proceed, stop Cloudera Data Science Workbench and reboot all Cloudera Data Science Workbench hosts. As a precaution, you might want to consult your cluster/IT administrator before you start rebooting hosts. Once all hosts have rebooted, restart Cloudera Data Science Workbench.

If that does not fix the issue, contact Cloudera Support.

Error Encountered Trying to Load Images when Initializing Cloudera Data Science Workbench

Here are some sample error messages you might see when initializing Cloudera Data Science Workbench:

Error encountered while trying to load images.: 1

Unable to load images from [/etc/cdsw/images/cdsw_<version>.tar.gz].: 1

Error processing tar file(exit status 1): write /../..tar: no space left on device

These errors are an indication that the root volume is running out of space when trying to initialize Cloudera Data Science Workbench. During the initialization process, the Cloudera Data Science Workbench installer temporarily decompresses the engine image file located in /etc/cdsw/images to the /var/lib/docker/tmp/ directory.

If you have previously partitioned the root volume (which should be at least 100 GB), make sure you allocate at least 20 GB to /var/lib/docker/tmp so that the installer can proceed without running out of space.

404 Not Found Error

The 404 Not Found error might appear in the browser when you try to reach the Cloudera Data Science Workbench web application.

This error is an indication that your installation of Cloudera Data Science Workbench was successful, but there was a mismatch in the domain configured in cdsw.conf and the domain referenced in the browser. To fix the error, go to /etc/cdsw/config/cdsw.conf and check that the URL you supplied for the DOMAIN property matches the one you are trying to use to reach the web application. This is the wildcard domain dedicated to Cloudera Data Science Workbench, not the hostname of the master node.

If this requires a change to cdsw.conf, after saving the changes run cdsw reset followed by cdsw init.

Troubleshooting Kerberos Errors

HDFS commands fail with Kerberos errors even though Kerberos authentication is successful in the web application

If Kerberos authentication is successful in the web application, and the output of klist in the engine reveals a valid-looking TGT, but commands such as hdfs dfs -ls / still fail with a Kerberos error, it is possible that your cluster is missing the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy File. The JCE policy file is required when Red Hat uses AES-256 encryption. This library should be installed on each cluster host and will live under $JAVA_HOME. For more information, see Using AES-256 Encryption.

Cannot find renewable Kerberos TGT

Cloudera Data Science Workbench runs its own Kerberos TGT renewer which produces non-renewable TGT. However, this confuses Hadoop's renewer which looks for renewable TGTs. If the Spark 2 logging level is set to WARN or lower, you may see exceptions such as:

16/12/24 16:38:40 WARN security.UserGroupInformation: Exception encountered while running the renewal command. Aborting renew thread. ExitCodeException exitCode=1: kinit: Resource temporarily unavailable while renewing credentials

16/12/24 16:41:23 WARN security.UserGroupInformation: PriviledgedActionException as:user@CLOUDERA.LOCAL (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]

This is not a bug. Spark 2 workloads will not be affected by this. Access to Kerberized resources should also work as expected.

Troubleshooting TLS/SSL Errors

This section describes some common issues with TLS configuration on Cloudera Data Science Workbench. Common errors include:

Cloudera Data Science Workbench initialisation fails with an error such as:
```
Error preparing server: tls: failed to parse private key 
```
Your browser reports that the Cloudera Data Science Workbench web application is not secure even though you have enabled TLS settings as per Enabling TLS/SSL for Cloudera Data Science Workbench.

Possible Causes and Solutions

Certificate does not include the wildcard domain - Confirm that the TLS certificate issued by your CA lists both, the Cloudera Data Science Workbench domain, as well as a wildcard for all first-level subdomains. For example, if your Cloudera Data Science Workbench domain is cdsw.company.com, then the TLS certificate must include both cdsw.company.com and *.cdsw.company.com.
Path to the private key and/or certificate is incorrect - Confirm that the path to the private key file is correct by comparing the path and file name to the values for TLS_KEY and/or TLS_CERT in cdsw.conf or Cloudera Manager. For example:
```
TLS_CERT="/path/to/cert.pem"
TLS_KEY="/path/to/private.key"
```
Private key file does not have the right permissions - The private key file must have read-only permissions. Set it as follows:
```
chmod 444 private.key
```
Private key is encrypted - Cloudera Data Science Workbench does not support encrypted private keys. Check to see if your private key is encrypted:
```
cat private.key
```
```
-----BEGIN RSA PRIVATE KEY-----
Proc-Type: 4,ENCRYPTED
DEK-Info: DES-EDE3-CBC,11556F53E4A2824A
```
If the private key is encrypted as shown above, use the following steps to decrypt it:
1. Make a backup of the private key file.
```
mv private.key private.key.encrypted
```
2. Decrypt the backup private key and save the file to private.key. You will be asked to enter the private key password.
```
openssl rsa -in private.key.encrypted -out private.key
```
Private key and certificate are not related - Check to see if the private key matches the public key in the certificate.
1. Print a hash of the private key modulus.
```
openssl rsa -in private.key -noout -modulus | openssl md5
```
```
(stdin)= 7a8d72ed61bb4be3c1f59e4f0161c023
```
2. Print a hash of the public key modulus.
```
openssl x509 -in cert.pem -noout -modulus | openssl md5
```
```
(stdin)= 7a8d72ed61bb4be3c1f59e4f0161c023
```
  If the md5 hash output of both keys is different, they are not related to each other, and will not work. You must revoke the old certificate, regenerate a new private key and Certificate Signing Request (CSR), and then apply for a new certificate.

Troubleshooting Issues with Workloads

This section describes some potential issues data scientists might encounter once the application is running workloads.

404 error in Workbench after starting an engine
Engines cannot be scheduled due to lack of CPU or memory
Workbench prompt flashes red and does not take input
PySpark jobs fail due to HDFS permission errors
PySpark jobs fail due to Python version mismatch

404 error in Workbench after starting an engine

This is typically caused because a wildcard DNS subdomain was not set up before installation. While the application will largely work, the engine consoles are served on subdomains and will not be routed correctly unless a wildcard DNS entry pointing to the master node is properly configured. You might need to wait 30-60 minutes until the DNS entries propagate. For instructions, see Set Up a Wildcard DNS Subdomain.

Engines cannot be scheduled due to lack of CPU or memory

A symptom of this is the following error message in the Workbench: "Unschedulable: No node in the cluster currently has enough CPU or memory to run the engine."

Either shut down some running sessions or jobs or provision more nodes for Cloudera Data Science Workbench.

Workbench prompt flashes red and does not take input

The Workbench prompt flashing red indicates that the session is not currently ready to take input.

Cloudera Data Science Workbench does not currently support non-REPL interaction. One workaround is to skip the prompt using appropriate command-line arguments. Otherwise, consider using the terminal to answer interactive prompts.

PySpark jobs fail due to HDFS permission errors

: org.apache.hadoop.security.AccessControlException: Permission denied: user=alice, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x

To be able to use Spark 2, each user must have their own /home directory in HDFS. If you sign in to Hue first, these directories will automatically be created for you. Alternatively, you can have cluster administrators create these directories.

hdfs dfs -mkdir /user/<username>
hdfs dfs -chown <username>:<username> /user/<username>

PySpark jobs fail due to Python version mismatch

Exception: Python in worker has different version 2.6 than that in driver 2.7, PySpark cannot run with different minor versions

One solution is to install the matching Python 2.7 version on all the cluster hosts. Another, more recommended solution is to install the Anaconda parcel on all CDH cluster hosts. Cloudera Data Science Workbench Python engines will use the version of Python included in the Anaconda parcel which ensures Python versions between driver and workers will always match. Any library paths in workloads sent from drivers to workers will also match because Anaconda is present in the same location across all hosts. Once the parcel has been installed, set the PYSPARK_PYTHON environment variable in the Cloudera Data Science Workbench Admin dashboard. Alternatively, you can use Cloudera Manager to set the path.

Troubleshooting Issues with Models and Experiments

See the following topics:

SSH Keys

Command Line Reference