Troubleshooting Cloudera Data Science Workbench
-
Check the status of the application.
cdsw status
-
Make sure the contents of the configuration file are correct.
cat /etc/cdsw/config/cdsw.conf
-
SSH to your master host and run the following node validation command to check that the key services are running:
cdsw validate
The following sections describe solutions to potential problems and error messages you may encounter while installing, configuring or using Cloudera Data Science Workbench. There is also an example of the Cloudera Data Science Workbench configuration file for your reference.
Understanding Installation Warnings
This section describes solutions to some warnings you might encounter during the installation process.
Preexisting iptables rules not supported
WARNING: Cloudera Data Science Workbench requires iptables, but does not support preexisting iptables rules.Kubernetes makes extensive use of iptables. However, it’s hard to know how pre-existing iptables rules will interact with the rules inserted by Kubernetes. Therefore, Cloudera recommends you disable all pre-existing rules before you proceed with the installation.
Please remove the entry corresponding to /dev/xvdc from /etc/fstab
Cloudera Data Science Workbench installs a custom filesystem on its Application and Docker block devices. These filesystems will be used to store user project files and Docker engine images respectively. Therefore, Cloudera Data Science Workbench requires complete access to the block devices. To avoid losing any existing data, make sure the block devices allocated to Cloudera Data Science Workbench are reserved only for the workbench.
Linux sysctl kernel configuration errors
Kubernetes and Docker require non-standard kernel configuration. Depending on the existing state of your kernel, this might result in sysctl errors such as:
sysctl net.bridge.bridge-nf-call-iptables must be set to 1
This is because the settings in /etc/sysctl.conf conflict with the settings required by Cloudera Data Science Workbench. Cloudera cannot make a blanket recommendation on how to resolve such errors because they are specific to your deployment. Cluster administrators may choose to either remove or modify the conflicting value directly in /etc/sysctl.conf, remove the value from the conflicting configuration file, or even delete the module that is causing the conflict.
To start diagnosing the issue, run the following command to see the list of configuration files that are overwriting values in /etc/sysctl.conf.
SYSTEMD_LOG_LEVEL=debug /usr/lib/systemd/systemd-sysctl
You will see output similar to:
Parsing /usr/lib/sysctl.d/00-system.conf Parsing /usr/lib/sysctl.d/50-default.conf Parsing /etc/sysctl.d/99-sysctl.conf Overwriting earlier assignment of net/bridge/bridge-nf-call-ip6tables in file '/etc/sysctl.d/99-sysctl.conf'. Overwriting earlier assignment of net/bridge/bridge-nf-call-ip6tables in file '/etc/sysctl.d/99-sysctl.conf'. Overwriting earlier assignment of net/bridge/bridge-nf-call-ip6tables in file '/etc/sysctl.d/99-sysctl.conf'. Parsing /etc/sysctl.d/k8s.conf Overwriting earlier assignment of net/bridge/bridge-nf-call-iptables in file '/etc/sysctl.d/k8s.conf'. Parsing /etc/sysctl.conf Overwriting earlier assignment of net/bridge/bridge-nf-call-ip6tables in file '/etc/sysctl.conf'. Overwriting earlier assignment of net/bridge/bridge-nf-call-ip6tables in file '/etc/sysctl.conf'. Setting 'net/ipv4/conf/all/promote_secondaries' to '1' Setting 'net/ipv4/conf/default/promote_secondaries' to '1' Setting 'net/ipv6/conf/default/disable_ipv6' to '0' Setting 'kernel/sysrq' to '16' ...
/etc/sysctl.d/k8s.conf is the configuration added by Cloudera Data Science Workbench. Administrators must make sure that no other file is overwriting values set by /etc/sysctl.d/k8s.conf.
CDH parcels not found at /opt/cloudera/parcels
- If you are using a custom parcel directory, you can ignore the warning and proceed with the installation. Once the Cloudera Data Science Workbench is running, set the path to the CDH parcel in the admin dashboard. See Non-standard CDH Parcel Location.
- This warning can be an indication that you have not added gateway roles to the Cloudera Data Science Workbench nodes. In this case, do not ignore the warning. Exit the installer and go to Cloudera Manager to add gateway roles to the cluster. See Configure Gateway Hosts Using Cloudera Manager.
Java is not installed or is installed in a non-standard location.
If you have not already installed Java, exit the installer and install Oracle JDK on the cluster.
If Java has been installed, but is in a non-standard location, once installation is complete, set JAVA_HOME in the Cloudera Data Science Workbench site administrator dashboard. See Setting JAVA_HOME.
404 Not Found Error
The 404 Not Found error might appear in the browser when you try to reach the Cloudera Data Science Workbench web console.
This error is an indication that your installation of Cloudera Data Science Workbench was successful, but there was a mismatch in the domain configured in cdsw.conf and the domain referenced in the browser. To fix the error, go to /etc/cdsw/config/cdsw.conf and check that the URL you supplied for the DOMAIN property matches the one you are trying to use to reach the web application. This is the wildcard domain dedicated to Cloudera Data Science Workbench, not the hostname of the master node.
If this requires a change to cdsw.conf, after saving the changes run cdsw reset followed by cdsw init.
Troubleshooting Issues with Running Workloads
This section describes some potential issues data scientists might encounter once the application is running workloads.
404 error in Workbench after starting an engine
This is typically caused because a wildcard DNS subdomain was not set up before installation. While the application will largely work, the engine consoles are served on subdomains and will not be routed correctly unless a wildcard DNS entry pointing to the master node is properly configured. You might need to wait 30-60 minutes until the DNS entries propagate. For instructions, see Set Up a Wildcard DNS Subdomain.
Engines cannot be scheduled due to lack of CPU or memory
A symptom of this is the following error message in the Workbench: "Unschedulable: No node in the cluster currently has enough CPU or memory to run the engine."
Either shut down some running sessions or jobs or provision more nodes for Cloudera Data Science Workbench.
Workbench prompt flashes red and does not take input
The Workbench prompt flashing red indicates that the session is not currently ready to take input.
Cloudera Data Science Workbench does not currently support non-REPL interaction. One workaround is to skip the prompt using appropriate command-line arguments. Otherwise, consider using the terminal to answer interactive prompts.
PySpark jobs fail due to HDFS permission errors
: org.apache.hadoop.security.AccessControlException: Permission denied: user=alice, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x
hdfs dfs -mkdir /user/<username> hdfs dfs -chown <username>:<username> /user/<username>
PySpark jobs fail due to Python version mismatch
Exception: Python in worker has different version 2.6 than that in driver 2.7, PySpark cannot run with different minor versions
One solution is to install the matching Python 2.7 version on all the cluster hosts. Another, more recommended solution is to install the Anaconda parcel on all CDH cluster hosts. Cloudera Data Science Workbench Python engines will use the version of Python included in the Anaconda parcel which ensures Python versions between driver and workers will always match. Any library paths in workloads sent from drivers to workers will also match because Anaconda is present in the same location across all hosts. Once the parcel has been installed, set the PYSPARK_PYTHON environment variable in the Cloudera Data Science Workbench Admin dashboard. Alternatively, you can use Cloudera Manager to set the path.
Cannot find renewable Kerberos TGT
16/12/24 16:38:40 WARN security.UserGroupInformation: Exception encountered while running the renewal command. Aborting renew thread. ExitCodeException exitCode=1: kinit: Resource temporarily unavailable while renewing credentials 16/12/24 16:41:23 WARN security.UserGroupInformation: PriviledgedActionException as:user@CLOUDERA.LOCAL (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
This is not a bug. Spark 2 workloads will not be affected by this. Access to Kerberized resources should also work as expected.