Known Issues and Limitations in Cloudera Data Science Workbench 1.5.x

Upgrades

TSB-350: Permanent Fix for Data Loss Risk During Cloudera Data Science Workbench (CDSW) Shutdown and Restart

TSB-346 was released in the time-frame of CDSW 1.4.2 to fix this issue, but it only turned out to be a partial fix. With CDSW 1.4.3, we have fixed the issue permanently and released TSB-350 to address this fix. Note that the script that was provided with TSB-346 still ensures that data loss is prevented and must be used to shutdown/restart all the affected CDSW released listed below.

Affected Versions: Cloudera Data Science Workbench 1.0.x, 1.1.x, 1.2.x, 1.3.x, 1.4.0, 1.4.1, 1.4.2

Fixed Version: Cloudera Data Science Workbench 1.4.3 (and higher)

Cloudera Bug: DSE-5108

The complete text for TSB-350 is available in the 1.4.3 release notes and in the Cloudera Security Bulletins: TSB-350: Risk of Data Loss During Cloudera Data Science Workbench (CDSW) Shutdown and Restart.

TSB-346: Risk of Data Loss During Cloudera Data Science Workbench (CDSW) Shutdown and Restart

Stopping Cloudera Data Science Workbench involves unmounting the NFS volumes that store CDSW project directories and then cleaning up a folder where the kubelet stores its temporary state. However, due to a race condition, this NFS unmount process can take too long or fail altogether. If this happens, CDSW projects that remain mounted will be deleted by the cleanup step.

Products affected: Cloudera Data Science Workbench

Releases affected: Cloudera Data Science Workbench versions -
  • 1.0.x

  • 1.1.x

  • 1.2.x

  • 1.3.0, 1.3.1

  • 1.4.0, 1.4.1

Users affected: This potentially affects all CDSW users.

Detected by: Nehmé Tohmé (Cloudera)

Severity (Low/Medium/High): High

Impact: If the NFS unmount fails during shutdown, data loss can occur. All CDSW project files might be deleted.

CVE: N/A

Immediate action required: If you are running any of the affected Cloudera Data Science Workbench versions, you must run the following script on the CDSW master host every time before you stop or restart Cloudera Data Science Workbench. Failure to do so can result in data loss.

This script should also be run before initiating a Cloudera Data Science Workbench upgrade. As always, we recommend creating a full backup prior to beginning an upgrade.

cdsw_protect_stop_restart.sh - Available for download at: cdsw_protect_stop_restart.sh.

#!/bin/bash

set -e

cat << EXPLANATION


This script is a workaround for Cloudera TSB-346. It protects your
CDSW projects from a rare race condition that can result in data loss.
Run this script before stopping the CDSW service, irrespective of whether
the stop precedes a restart, upgrade, or any other task.

Run this script only on the master node of your CDSW cluster.

You will be asked to specify a target folder on the master node where the
script will save a backup of all your project files. Make sure the target
folder has enough free space to accommodate all of your project files. To
determine how much space is required, run 'du -hs /var/lib/cdsw/current/projects'
on the CDSW master node.

This script will first back up your project files to the specified target
folder. It will then temporarily move your project files aside to protect
against the data loss condition. At that point, it is safe to stop the CDSW
service. After CDSW has stopped, the script will move the project files back
into place.

Note: This workaround is not required for CDSW 1.4.2 and higher.



EXPLANATION

read -p "Enter target folder for backups: " backup_target

echo "Backing up to $backup_target..."
rsync -azp /var/lib/cdsw/current/projects "$backup_target"

read -n 1 -p "Backup complete. Press enter when you are ready to stop CDSW: "

echo "Deleting all Kubernetes resources..."
kubectl delete configmaps,deployments,daemonsets,replicasets,services,ingress,secrets,persistentvolumes,persistentvolumeclaims,jobs --all
kubectl delete pods --all

echo "Temporarily saving project files to /var/lib/cdsw/current/projects_tmp..."
mkdir /var/lib/cdsw/current/projects_tmp
mv /var/lib/cdsw/current/projects/* /var/lib/cdsw/current/projects_tmp

echo -e "Please stop the CDSW service."

read -n 1 -p "Press enter when CDSW has stopped: "

echo "Moving projects back into place..."
mv /var/lib/cdsw/current/projects_tmp/* /var/lib/cdsw/current/projects
rm -rf /var/lib/cdsw/current/projects_tmp

echo -e "Done. You may now upgrade or start the CDSW service."
echo -e "When CDSW is running, if desired, you may delete the backup data at $backup_target"

Addressed in release/refresh/patch: This issue is fixed in Cloudera Data Science Workbench 1.4.2.

Note that you are required to run the workaround script above when you upgrade from an affected version to a release with the fix. This helps guard against data loss when the affected version needs to be shut down during the upgrade process.

For the latest update on this issue see the corresponding Knowledge article:

TSB 2018-346: Risk of Data Loss During Cloudera Data Science Workbench (CDSW) Shutdown and Restart

(Red Hat Only) Host Reboot Required for Upgrades from Cloudera Data Science Workbench 1.4.0

Cloudera Data Science Workbench 1.4.2 includes a fix for a Red Hat kernel slab leak issue that was found in Cloudera Data Science Workbench 1.4.0. However, to have this fix go into effect, Red Hat users must reboot all Cloudera Data Science Workbench hosts before proceeding with an upgrade from CDSW 1.4.0 to CDSW 1.4.2 (or higher).

Therefore, if you are a Red Hat user upgrading from Cloudera Data Science Workbench 1.4.0, your upgrade path will require the following steps:
  1. Run the cdsw_protect_stop_restart.sh script to safely stop CDSW.
  2. Backup all your application data.
  3. Reboot all Cloudera Data Science Workbench hosts. As a precaution, you should consult your cluster/IT administrator before you start rebooting hosts.
  4. Proceed with the upgrade to Cloudera Data Science Workbench 1.4.2 (or higher).
These steps have also been added to the upgrade documentation here:

Cloudera Bug: DSE-4098

CDH Integration

CDH client configuration changes require a full Cloudera Data Science Workbench reset

Cloudera Data Science Workbench does not automatically detect configuration changes on the CDH cluster. Therefore, any changes made to CDH services, ranging from updates to service configuration properties to complete CDH or CDS parcel upgrades, must be followed by a full reset of Cloudera Data Science Workbench.

Workaround: Depending on your deployment, use one of the following sets of steps to perform a full reset of Cloudera Data Science Workbench. Note that this reset does not impact your data in any way.
  • CSD Deployments - To reset Cloudera Data Science Workbench using Cloudera Manager:
    1. Log into the Cloudera Manager Admin Console.
    2. On the Cloudera Manager homepage, click to the right of the CDSW service and select Restart. Confirm your choice on the next screen and wait for the action to complete.
    OR
  • RPM Deployments - Run the following steps on the Cloudera Data Science Workbench master host:

    cdsw reset
    cdsw init

Cloudera Manager Integration

CSD distribution/activation fails on mixed-OS clusters when there are third-party parcels running on OSs that are not supported by Cloudera Data Science Workbench

For example, adding a new CDSW gateway host on a RHEL 6 cluster running RHEL-6 compatible parcels will fail. This is because Cloudera Manager will not allow distribution of the RHEL 6 parcels on the new host which will likely be running a CDSW-compatible operating system such as RHEL 7.

Workaround: To ensure adding a new CDSW gateway host is successful, you must create a copy of the 'incompatible' third-party parcel files and give them the corresponding RHEL 7 names so that Cloudera Manager allows them to be distributed on the new gateway host. Use the following sample instructions to do so:
  1. SSH to the Cloudera Manager Server host.
  2. Navigate to the directory that contains all the parcels. By default, this is /opt/cloudera/parcels.
    cd /opt/cloudera/parcels
  3. Make a copy of the incompatible third-party parcel with the new name. For example, if you have a RHEL 6 parcel that cannot be distributed on a RHEL 7 CDSW host:
    cp <PARCELNAME.cdh5.x.x.p0.123>-el6.parcel <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel
  4. Repeat the previous step for parcel's SHA file.
    cp <PARCELNAME.cdh5.x.x.p0.123>-el6.parcel.sha <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel.sha
  5. Update the new files' owner and permissions to match those of existing parcels in the /opt/cloudera/parcels directory.
    chown cloudera-scm:cloudera-scm <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel
    chown cloudera-scm:cloudera-scm <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel.sha
    chmod 640 <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel
    chmod 640 <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel.sha
    
You should now be able to add new gateway hosts for Cloudera Data Science Workbench to your cluster.

Cloudera Bug: OPSAPS-42130, OPSAPS-31880

CDSW Service health status after a restart does not match the actual state of the application

After a restart, the Cloudera Data Science Workbench service in Cloudera Manager will display Good health even though the Cloudera Data Science Workbench web application might need a few more minutes to get ready to serve requests.

Cloudera Data Science Workbench diagnostics data might be missing from Cloudera Manager diagnostic bundles.

This occurs because the default timeout for Cloudera Manager data collection is currently set to 3 minutes. However, in the case of Cloudera Data Science Workbench, collecting metrics and logs using the cdsw logs command can take longer than 3 minutes.

Workaround: Use the following steps to modify the default timeout for Cloudera Data Science Workbench data collection:
  1. Login to the Cloudera Manager Admin Console.
  2. Go to the CDSW service.
  3. Click Configuration.
  4. Search for the Docker Daemon Diagnostics Collection Timeout property and set it to 5 minutes.
  5. Click Save Changes.

Alternatively, you can generate a diagnostic bundle by running the cdsw logs command directly on the Master host.

Cloudera Bug: OPSAPS-44016, DSE-3160

CDS Powered By Apache Spark

Spark lineage collection is not supported with Cloudera Data Science Workbench

Lineage collection is enabled by default in Spark 2.3. This feature does not work with Cloudera Data Science Workbench because the lineage log directory is not automatically mounted into CDSW engines when a session/job is started.

Affected Versions: CDS 2.3 release 2 (and higher) Powered By Apache Spark

With Spark 2.3 release 3, if Spark cannot find the lineage log directory, it will automatically disable lineage collection for that application. Spark jobs will continue to execute in Cloudera Data Science Workbench, but lineage information will not be collected.

With Spark 2.3 release 2, Spark jobs will fail in Cloudera Data Science Workbench. Either upgrade to Spark 2.3 release 3 which includes a partial fix (as described above) or use one of the following workarounds to disable Spark lineage:

Workaround 1: Disable Spark Lineage Per-Project in Cloudera Data Science Workbench

To do this, set spark.lineage.enabled to false in a spark-defaults.conf file in your Cloudera Data Science Workbench project. This will need to be done individually for each project as required.

Workaround 2: Disable Spark Lineage for the Cluster

  1. Log in to Cloudera Manager and go to the Spark 2 service.
  2. Click Configuration.
  3. Search for the Enable Lineage Collection property and uncheck the checkbox to disable lineage collection.
  4. Click Save Changes.
  5. Go back to the Cloudera Manager homepage and restart the CDSW service for this change to go into effect.

Cloudera Bug: DSE-3720, CDH-67643

Crashes and Hangs

  • High I/O utilization on the application block device can cause the application to stall or become unresponsive. Users should read and write data directly from HDFS rather than staging it in their project directories.

  • Installing ipywidgets or a Jupyter notebook into a project can cause Python engines to hang due to an unexpected configuration. The issue can be resolved by deleting the installed libraries from the R engine terminal.

Engines

  • Configuring duplicate mount points in the site admin panel (Admin > Engines > Mounts) results in sessions crashing in the workbench.

    Cloudera Bug: DSE-3308

  • Spawning remote workers fails in R when the env parameter is not set. For more details, see Spawning Workers.

    Cloudera Bug: DSE-3384

  • Autofs mounts are not supported with Cloudera Data Science Workbench.

    Cloudera Bug: DSE-2238

  • When using Conda to install Python packages, you must specify the Python version to match the Python versions shipped in the engine image (2.7.11 and 3.6.1). If not specified, the conda-installed Python version will not be used within a project. Pip (pip and pip3) does not face this issue.

Custom Engine Images

  • Cloudera Data Science Workbench only supports customized engines that are based on the Cloudera Data Science Workbench base image.

  • Cloudera Data Science Workbench does not support creation of custom engines larger than 10 GB.

    Cloudera Bug: DSE-4420

  • Cloudera Data Science Workbench does not support pulling images from registries that require Docker credentials.

    Cloudera Bug: DSE-1521

  • The contents of certain pre-existing standard directories such as /home/cdsw, /tmp, /opt/cloudera, and so on, cannot be modified while creating customized engines. This means any files saved in these directories will not be accessible from sessions that are running on customized engines.

    Workaround: Create a new custom directory in the Dockerfile used to create the customized engine, and save your files to that directory. Or, create a new custom directory on all the Cloudera Data Science Workbench gateway hosts and save your files to those directories. Then, mount this directory to the custom engine.

Experiments

  • Experiments do not store snapshots of project files. You cannot automatically restore code that was run as part of an experiment.

  • Experiments will fail if your project filesystem is too large for the Git snapshot process. As a general rule, any project files (code, generated model artifacts, dependencies, etc.) larger than 50 MB must be part of your project's .gitignore file so that they are not included in snapshots for experiment builds.

  • Experiments cannot be deleted. As a result, be conscious of how you use the track_metrics and track_file functions.
    • Do not track files larger than 50MB.
    • Do not track more than 100 metrics per experiment. Excessive metric calls from an experiment may cause Cloudera Data Science Workbench to hang.
  • The Experiments table will allow you to display only three metrics at a time. You can select which metrics are displayed from the metrics dropdown. If you are tracking a large number of metrics (100 or more), you might notice some performance lag in the UI.

  • Arguments are not supported with Scala experiments.

  • The track_metrics and track_file functions are not supported with Scala experiments.

  • The UI does not display a confirmation when you start an experiment or any alerts when experiments fail.

GPU Support

Only CUDA-enabled NVIDIA GPU hardware is supported

Cloudera Data Science Workbench only supports CUDA-enabled NVIDIA GPU cards.

Heterogeneous GPU hardware is not supported

You must use the same GPU hardware across a single Cloudera Data Science Workbench deployment.

GPUs are not detected after a machine reboot

This issue occurs because certain NVIDIA modules do not load automatically after a reboot.

Workaround: To work around this issue, use the following steps to manually load the required modules before Cloudera Data Science Workbench services start. The following commands load the nvidia.ko module, create the /dev/nvidiactl device, and create the list of devices at /dev/nvidia0. They will also create the /dev/nvidia-uvm and /dev/nvidia-uvm-tools devices, and assign execute privileges to /etc/rc.modules. Run these commands once on all the machines that have GPU hardware.

Manually load the required NVIDIA modules:
sudo cat >> /etc/rc.modules <<EOMSG
/usr/bin/nvidia-smi
/usr/bin/nvidia-modprobe -u -c=0
EOMSG
Set execute permission for /etc/rc.modules:
sudo chmod +x /etc/rc.modules

Cloudera Bug: DSE-2847

Jobs API

  • Cloudera Data Science Workbench does not support changing your API key, or having multiple API keys.

  • Currently, you cannot create a job, stop a job, or get the status of a job using the Jobs API.

Models

  • Known Issues with Model Builds and Deployed Models
    • Re-deploying or re-building models results in model downtime (usually brief).

    • Re-starting Cloudera Data Science Workbench does not automatically restart active models. These models must be manually restarted so they can serve requests again.

      Cloudera Bug: DSE-4950

    • Model deployment will fail if your project filesystem is too large for the Git snapshot process. As a general rule, any project files (code, generated model artifacts, dependencies, etc.) larger than 50 MB must be part of your project's .gitignore file so that they are not included in snapshots for model builds.

    • Model builds will fail if your project filesystem includes a .git directory (likely hidden or nested). Typical build stage errors include:
      Error: 2 UNKNOWN: Unable to schedule build: [Unable to create a checkpoint of current source: [Unable to push sources to git server: ...

      To work around this, rename the .git directory (for example, NO.git) and re-build the model.

      Cloudera Bug: DSE-4657

    • JSON requests made to active models should not be more than 5 MB in size. This is because JSON is not suitable for very large requests and has high overhead for binary objects such as images or video. Call the model with a reference to the image or video, such as a URL, instead of the object itself.

    • Any external connections, for example, a database connection or a Spark context, must be managed by the model's code. Models that require such connections are responsible for their own setup, teardown, and refresh.

    • Model logs and statistics are only preserved so long as the individual replica is active. Cloudera Data Science Workbench may restart a replica at any time it is deemed necessary (such as bad input to the model).

    • (Affects version 1.4.x, 1.5.x) The model deployment example (predict.py) in the in-built Python template project does not work anymore due to a change in dependencies in the sklearn package. A working replacement for the predict.py file has been provided here: Deploy the Model - Iris Dataset.

      Cloudera Bug: DSE-5314

  • Limitations
    • Scala models are not supported.

    • Spawning worker threads is not supported with models.

    • Models deployed using Cloudera Data Science Workbench are not highly-available.

    • Dynamic scaling and auto-scaling are not currently supported. To change the number of replicas in service, you will have to re-deploy the build.

Networking

  • Custom /etc/hosts entries on Cloudera Data Science Workbench hosts do not propagate to sessions and jobs running in containers.

    Cloudera Bug: DSE-2598

  • Initialisation of Cloudera Data Science Workbench (cdsw init) will fail if localhost does not resolve to 127.0.0.1.

  • Cloudera Data Science Workbench does not support DNS servers running on 127.0.0.1:53. This IP address resolves to the container localhost within Cloudera Data Science Workbench containers. As a workaround, use either a non-loopback address or a remote DNS server.
  • Kubernetes throws the following error when /etc/resolv.conf lists more than three domains:
    Resolv.conf file '/etc/resolv.conf' contains search line consisting of more than 3 domains!
    Due to a limitation in the libc resolver, only two DNS servers are supported in /etc/resolv.conf. Kubernetes uses one additional entry for the cluster DNS.

Security

SSH access to Cloudera Data Science Workbench hosts must be disabled

The container runtime and application data storage is not fully secure from untrusted users who have SSH access to the gateway hosts. Therefore, SSH access to the gateway hosts for untrusted users should be disabled for security and resource utilization reasons.

LDAP

  • LDAP group search fails when Active Directory returns escape characters as part of the distinguished name (DN).

    Cloudera Bug: DSE-4898

TLS/SSL

  • Self-signed certificates where the Certificate Authority is not part of the user's trust store are not supported for TLS termination. For more details, see Enabling TLS/SSL - Limitations.

  • Cloudera Data Science Workbench does not support the use of encrypted private keys for TLS.

    Cloudera Bug: DSE-1708

  • A "certificate has expired" error displays when you log in to the Cloudera Data Science Workbench web UI. This issue can occur if Cloudera Data Science Workbench exceeds 365 days of continuous uptime because the internal certificate for Kubernetes expires after 1 year.

    Workaround: Restart the Cloudera Data Science Workbench deployment.
    • For CSD installations, restart the Cloudera Data Science Workbench service in Cloudera Manager.
    • For RPM installations, run the following commands on the Master host:
      #restart Cloudera Data Science Workbench
      cdsw reset
      #generate a new certificate for Kubernetes
      cdsw init

Kerberos

  • Using Kerberos plugin modules in krb5.conf is not supported.

  • Modifying the default_ccache_name parameter in krb5.conf does not work in Cloudera Data Science Workbench. Only the default path for this parameter, /tmp/krb5cc_${uid}, is supported.

  • PowerBroker-equipped Active Directory is not supported.

    Cloudera Bug: DSE-1838

  • Cloudera Data Science Workbench does not support the use of a FreeIPA KDC.

    Cloudera Bug: DSE-1482

  • When you upload a Kerberos keytab to authenticate yourself to the CDH cluster, Cloudera Data Science Workbench might display a fleeting error message ('cancelled') in the bottom right corner of the screen, even if authentication was successful. This error message can be ignored.

    Cloudera Bug: DSE-2344

Usability

  • iFrame visualizations do not render in the workbench. Cloudera Data Science Workbench versions 1.4.2 (and higher) added a new feature that allowed users to enable HTTP security headers for responses to Cloudera Data Science Workbench. This setting is enabled by default. However, the X-Frame-Options header added as part of this feature blocks rendering of iFrames injected by third-party data visualization libraries.

    Workaround: To work around this issue, a site administrator can go to the Admin > Security page and disable the Enable HTTP security headers property. Restart Cloudera Data Science Workbench for this change to take effect.

    Affected Version: Cloudera Data Science Workbench 1.4.2 (and higher)

    Cloudera Bug: DSE-5274

  • In a scenario where 100s of users are logged in and creating processes, the nproc and nofile limits of the system may be reached. Use ulimits or other methods to increase the maximum number of processes and open files that can be created by a user on the system.

  • When rebooting, Cloudera Data Science Workbench hosts can take a significant amount of time (about 30 minutes) to become ready.

  • Long-running operations such as fork and clone can time out when projects are large or connections outlast the HTTP timeouts of reverse proxies.

  • The Scala kernel does not support autocomplete features in the editor.

  • Scala and R code can sometimes indent incorrectly in the workbench editor.

    Cloudera Bug: DSE-1218