Known Issues and Limitations in Cloudera Data Science Workbench 1.4.x

This topic lists the current known issues and limitations in Cloudera Data Science Workbench 1.4.x. For previous versions, see:

Upgrades

TSB-350: Permanent Fix for Data Loss Risk During Cloudera Data Science Workbench (CDSW) Shutdown and Restart

TSB-346 was released in the time-frame of CDSW 1.4.2 to fix this issue, but it only turned out to be a partial fix. With CDSW 1.4.3, we have fixed the issue permanently and released TSB-350 to address this fix. Note that the script that was provided with TSB-346 still ensures that data loss is prevented and must be used to shutdown/restart all the affected CDSW released listed below.

Affected Versions: Cloudera Data Science Workbench 1.0.x, 1.1.x, 1.2.x, 1.3.x, 1.4.0, 1.4.1, 1.4.2

Fixed Version: Cloudera Data Science Workbench 1.4.3 (and higher)

Cloudera Bug: DSE-5108

The complete text for TSB-350 is available in the 1.4.3 release notes and in the Cloudera Security Bulletins: TSB-350: Risk of Data Loss During Cloudera Data Science Workbench (CDSW) Shutdown and Restart.

TSB-346: Risk of Data Loss During Cloudera Data Science Workbench (CDSW) Shutdown and Restart

Stopping Cloudera Data Science Workbench involves unmounting the NFS volumes that store CDSW project directories and then cleaning up a folder where the kubelet stores its temporary state. However, due to a race condition, this NFS unmount process can take too long or fail altogether. If this happens, CDSW projects that remain mounted will be deleted by the cleanup step.

Products affected: Cloudera Data Science Workbench

Releases affected: Cloudera Data Science Workbench versions -
  • 1.0.x

  • 1.1.x

  • 1.2.x

  • 1.3.0, 1.3.1

  • 1.4.0, 1.4.1

Users affected: This potentially affects all CDSW users.

Detected by: Nehmé Tohmé (Cloudera)

Severity (Low/Medium/High): High

Impact: If the NFS unmount fails during shutdown, data loss can occur. All CDSW project files might be deleted.

CVE: N/A

Immediate action required: If you are running any of the affected Cloudera Data Science Workbench versions, you must run the following script on the CDSW master node every time before you stop or restart Cloudera Data Science Workbench. Failure to do so can result in data loss.

This script should also be run before initiating a Cloudera Data Science Workbench upgrade. As always, we recommend creating a full backup prior to beginning an upgrade.

cdsw_protect_stop_restart.sh - Available for download at: cdsw_protect_stop_restart.sh.

#!/bin/bash

set -e

cat << EXPLANATION


This script is a workaround for Cloudera TSB-346. It protects your
CDSW projects from a rare race condition that can result in data loss.
Run this script before stopping the CDSW service, irrespective of whether
the stop precedes a restart, upgrade, or any other task.

Run this script only on the master node of your CDSW cluster.

You will be asked to specify a target folder on the master node where the
script will save a backup of all your project files. Make sure the target
folder has enough free space to accommodate all of your project files. To
determine how much space is required, run 'du -hs /var/lib/cdsw/current/projects'
on the CDSW master node.

This script will first back up your project files to the specified target
folder. It will then temporarily move your project files aside to protect
against the data loss condition. At that point, it is safe to stop the CDSW
service. After CDSW has stopped, the script will move the project files back
into place.

Note: This workaround is not required for CDSW 1.4.2 and higher.



EXPLANATION

read -p "Enter target folder for backups: " backup_target

echo "Backing up to $backup_target..."
rsync -azp /var/lib/cdsw/current/projects "$backup_target"

read -n 1 -p "Backup complete. Press enter when you are ready to stop CDSW: "

echo "Deleting all Kubernetes resources..."
kubectl delete configmaps,deployments,daemonsets,replicasets,services,ingress,secrets,persistentvolumes,persistentvolumeclaims,jobs --all
kubectl delete pods --all

echo "Temporarily saving project files to /var/lib/cdsw/current/projects_tmp..."
mkdir /var/lib/cdsw/current/projects_tmp
mv /var/lib/cdsw/current/projects/* /var/lib/cdsw/current/projects_tmp

echo -e "Please stop the CDSW service."

read -n 1 -p "Press enter when CDSW has stopped: "

echo "Moving projects back into place..."
mv /var/lib/cdsw/current/projects_tmp/* /var/lib/cdsw/current/projects
rm -rf /var/lib/cdsw/current/projects_tmp

echo -e "Done. You may now upgrade or start the CDSW service."
echo -e "When CDSW is running, if desired, you may delete the backup data at $backup_target"

Addressed in release/refresh/patch: This issue is fixed in Cloudera Data Science Workbench 1.4.2.

Note that you are required to run the workaround script above when you upgrade from an affected version to a release with the fix. This helps guard against data loss when the affected version needs to be shut down during the upgrade process.

For the latest update on this issue see the corresponding Knowledge article:

TSB 2018-346: Risk of Data Loss During Cloudera Data Science Workbench (CDSW) Shutdown and Restart

(Red Hat Only) Host Reboot Required for Upgrades from Cloudera Data Science Workbench 1.4.0

Cloudera Data Science Workbench 1.4.2 includes a fix for a Red Hat kernel slab leak issue that was found in Cloudera Data Science Workbench 1.4.0. However, to have this fix go into effect, Red Hat users must reboot all Cloudera Data Science Workbench hosts before proceeding with an upgrade from CDSW 1.4.0 to CDSW 1.4.2 (or higher).

Therefore, if you are a Red Hat user upgrading from Cloudera Data Science Workbench 1.4.0, your upgrade path will require the following steps:
  1. Run the cdsw_protect_stop_restart.sh script to safely stop CDSW.
  2. Backup all your application data.
  3. Reboot all Cloudera Data Science Workbench hosts. As a precaution, you should consult your cluster/IT administrator before you start rebooting hosts.
  4. Proceed with the upgrade to Cloudera Data Science Workbench 1.4.2 (or higher).
These steps have also been added to the upgrade documentation here:

Cloudera Bug: DSE-4098

CDH Integration

Cloudera Data Science Workbench (1.4.x and lower) is not supported with Cloudera Manager 6.0.x and CDH 6.0.x.

Cloudera Data Science Workbench 1.5 (and higher) is supported with Cloudera Enterprise 6.1 (and higher).

CDH client configuration changes require a full Cloudera Data Science Workbench reset

Cloudera Data Science Workbench does not automatically detect configuration changes on the CDH cluster. Therefore, any changes made to CDH services, ranging from updates to service configuration properties to complete CDH or CDS parcel upgrades, must be followed by a full reset of Cloudera Data Science Workbench.

Workaround: Depending on your deployment, use one of the following sets of steps to perform a full reset of Cloudera Data Science Workbench. Note that this reset does not impact your data in any way.
  • CSD Deployments - To reset Cloudera Data Science Workbench using Cloudera Manager:
    1. Log into the Cloudera Manager Admin Console.
    2. On the Cloudera Manager homepage, click to the right of the CDSW service and select Restart. Confirm your choice on the next screen and wait for the action to complete.
    OR
  • RPM Deployments - Run the following steps on the Cloudera Data Science Workbench master node.

    cdsw reset
    cdsw init

Cloudera Manager Integration

Custom parcel directories are not properly mounted in Cloudera Data Science Workbench 1.4.3

Configuring the custom parcel directory using the CDH parcel directory property on the Admin > Engines page does not work as expected.

Workaround: To workaround this issue, you must also specify the custom parcel directory as a mount to ensure that the required client configuration is mounted into all the sessions.
  1. Go to Admin > Engines.
  2. Under Environmental Variables, add the PARCEL_DIR variable and set it to the path of the custom parcel directory.
  3. Under Mounts, add the custom directory so that it is available to all new sessions.
Note that if you want to use spark2-submit commands in the engine, you will also need to set:
PATH=$PATH:$PARCEL_DIR/CDH/bin:$PARCEL_DIR/SPARK2/bin

Affected Version: Cloudera Data Science Workbench 1.4.x

Fixed Version: Cloudera Data Science Workbench 1.5.x

Cloudera Bug: DSE-6062

Cloudera Data Science Workbench (1.4.x and lower) is not supported with Cloudera Manager 6.0.x and CDH 6.0.x.

Cloudera Data Science Workbench 1.5 (and higher) is supported with Cloudera Enterprise 6.1 (and higher).

HTTP/HTTPS Proxy settings in Cloudera Manager are erroneously escaped when propagated to Cloudera Data Science Workbench engines

The impact of this issue is that commands such as pip install that require network connections might fail when Cloudera Data Science Workbench proxy settings are enabled in Cloudera Manager.

Affected Version: Cloudera Data Science Workbench 1.4.0

Fixed Version: Cloudera Data Science Workbench 1.4.2. If you cannot update to version 1.4.2 (or higher), use the workaround described below.

Workaround: Configure the HTTP_PROXY and HTTPS_PROXY settings directly in the Cloudera Data Science Workbench UI. Perform these steps for each project on your deployment:
  1. Go to the project's Overview page.
  2. Click Settings > Engine.
  3. Under the Environmental Variables section, enter the name and value for your proxy settings.
  4. Click Add.
  5. Click Save Environment.
You should now be able to run all the commands as expected in your project engines.

Cloudera Bug: DSE-4421

CSD distribution/activation fails on mixed-OS clusters when there are third-party parcels running on OSs that are not supported by Cloudera Data Science Workbench

For example, adding a new CDSW gateway host on a RHEL 6 cluster running RHEL-6 compatible parcels will fail. This is because Cloudera Manager will not allow distribution of the RHEL 6 parcels on the new host which will likely be running a CDSW-compatible operating system such as RHEL 7.

Workaround: To ensure adding a new CDSW gateway host is successful, you must create a copy of the 'incompatible' third-party parcel files and give them the corresponding RHEL 7 names so that Cloudera Manager allows them to be distributed on the new gateway host. Use the following sample instructions to do so:
  1. SSH to the Cloudera Manager Server host.
  2. Navigate to the directory that contains all the parcels. By default, this is /opt/cloudera/parcels.
    cd /opt/cloudera/parcels
  3. Make a copy of the incompatible third-party parcel with the new name. For example, if you have a RHEL 6 parcel that cannot be distributed on a RHEL 7 CDSW host:
    cp <PARCELNAME.cdh5.x.x.p0.123>-el6.parcel <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel
  4. Repeat the previous step for parcel's SHA file.
    cp <PARCELNAME.cdh5.x.x.p0.123>-el6.parcel.sha <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel.sha
  5. Update the new files' owner and permissions to match those of existing parcels in the /opt/cloudera/parcels directory.
    chown cloudera-scm:cloudera-scm <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel
    chown cloudera-scm:cloudera-scm <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel.sha
    chmod 640 <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel
    chmod 640 <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel.sha
    
You should now be able to add new gateway hosts for Cloudera Data Science Workbench to your cluster.

Cloudera Bug: OPSAPS-42130, OPSAPS-31880

CDSW Service health status after a restart does not match the actual state of the application

After a restart, the Cloudera Data Science Workbench service in Cloudera Manager will display Good health even though the Cloudera Data Science Workbench web application might need a few more minutes to get ready to serve requests.

Cloudera Data Science Workbench diagnostics data might be missing from Cloudera Manager diagnostic bundles.

This occurs because the default timeout for Cloudera Manager data collection is currently set to 3 minutes. However, in the case of Cloudera Data Science Workbench, collecting metrics and logs using the cdsw logs command can take longer than 3 minutes.

Workaround: Use the following steps to modify the default timeout for Cloudera Data Science Workbench data collection:
  1. Login to the Cloudera Manager Admin Console.
  2. Go to the CDSW service.
  3. Click Configuration.
  4. Search for the Docker Daemon Diagnostics Collection Timeout property and set it to 5 minutes.
  5. Click Save Changes.

Alternatively, you can generate a diagnostic bundle by running the cdsw logs command directly on the Master node.

Cloudera Bug: OPSAPS-44016, DSE-3160

CDS Powered By Apache Spark

Spark lineage collection is not supported with Cloudera Data Science Workbench

Lineage collection is enabled by default in Spark 2.3. This feature does not work with Cloudera Data Science Workbench because the lineage log directory is not automatically mounted into CDSW engines when a session/job is started.

Affected Versions: CDS 2.3 release 2 (and higher) Powered By Apache Spark

With Spark 2.3 release 3, if Spark cannot find the lineage log directory, it will automatically disable lineage collection for that application. Spark jobs will continue to execute in Cloudera Data Science Workbench, but lineage information will not be collected.

With Spark 2.3 release 2, Spark jobs will fail in Cloudera Data Science Workbench. Either upgrade to Spark 2.3 release 3 which includes a partial fix (as described above) or use one of the following workarounds to disable Spark lineage:

Workaround 1: Disable Spark Lineage Per-Project in Cloudera Data Science Workbench

To do this, set spark.lineage.enabled to false in a spark-defaults.conf file in your Cloudera Data Science Workbench project. This will need to be done individually for each project as required.

Workaround 2: Disable Spark Lineage for the Cluster

  1. Log in to Cloudera Manager and go to the Spark 2 service.
  2. Click Configuration.
  3. Search for the Enable Lineage Collection property and uncheck the checkbox to disable lineage collection.
  4. Click Save Changes.
  5. Go back to the Cloudera Manager homepage and restart the CDSW service for this change to go into effect.

Cloudera Bug: DSE-3720, CDH-67643

Crashes and Hangs

  • High I/O utilization on the application block device can cause the application to stall or become unresponsive. Users should read and write data directly from HDFS rather than staging it in their project directories.

  • Installing ipywidgets or a Jupyter notebook into a project can cause Python engines to hang due to an unexpected configuration. The issue can be resolved by deleting the installed libraries from the R engine terminal.

Engines

  • Environmental variables set in the Admin panel are not being propagated to projects (experiments, sessions, jobs) as expected.

    Affected Version: Cloudera Data Science Workbench 1.4.0

    Fixed Version: Cloudera Data Science Workbench 1.4.2

    Cloudera Bug: DSE-4422

  • Configuring duplicate mount points in the site admin panel (Admin > Engines > Mounts) results in sessions crashing in the workbench.

    Cloudera Bug: DSE-3308

  • Spawning remote workers fails in R when the env parameter is not set. For more details, see Spawning Workers.

    Cloudera Bug: DSE-3384

  • Autofs mounts are not supported with Cloudera Data Science Workbench.

    Cloudera Bug: DSE-2238

  • When using Conda to install Python packages, you must specify the Python version to match the Python versions shipped in the engine image (2.7.11 and 3.6.1). If not specified, the conda-installed Python version will not be used within a project. Pip (pip and pip3) does not face this issue.

Custom Engine Images

  • Cloudera Data Science Workbench only supports customized engines that are based on the Cloudera Data Science Workbench base image.

  • Cloudera Data Science Workbench does not support creation of custom engines larger than 10 GB.

    Cloudera Bug: DSE-4420

  • Cloudera Data Science Workbench does not support pulling images from registries that require Docker credentials.

    Cloudera Bug: DSE-1521

  • The contents of certain pre-existing standard directories such as /home/cdsw, /tmp, /opt/cloudera, and so on, cannot be modified while creating customized engines. This means any files saved in these directories will not be accessible from sessions that are running on customized engines.

    Workaround: Create a new custom directory in the Dockerfile used to create the customized engine, and save your files to that directory. Or, create a new custom directory on all the Cloudera Data Science Workbench gateway hosts and save your files to those directories. Then, mount this directory to the custom engine.

  • When an HTTP/HTTPS proxy is in use, Docker commands fail on Cloudera Data Science Workbench engines that are not available locally (such as custom engine images).

    Workaround: To work around this issue, log on to a non-CDSW cluster host and run the docker pull command to pull the image onto that host. Then, scp to the CDSW host and run docker load to load the image.

    Cloudera Bug: DSE-4427

Experiments

  • Experiments do not store snapshots of project files. You cannot automatically restore code that was run as part of an experiment.

  • Experiments will fail if your project filesystem is too large for the Git snapshot process. As a general rule, any project files (code, generated model artifacts, dependencies, etc.) larger than 50 MB must be part of your project's .gitignore file so that they are not included in snapshots for experiment builds.

  • Experiments cannot be deleted. As a result, be conscious of how you use the track_metrics and track_file functions.
    • Do not track files larger than 50MB.
    • Do not track more than 100 metrics per experiment. Excessive metric calls from an experiment may cause Cloudera Data Science Workbench to hang.
  • The Experiments table will allow you to display only three metrics at a time. You can select which metrics are displayed from the metrics dropdown. If you are tracking a large number of metrics (100 or more), you might notice some performance lag in the UI.

  • Arguments are not supported with Scala experiments.

  • The track_metrics and track_file functions are not supported with Scala experiments.

  • The UI does not display a confirmation when you start an experiment or any alerts when experiments fail.

GPU Support

Only CUDA-enabled NVIDIA GPU hardware is supported

Cloudera Data Science Workbench only supports CUDA-enabled NVIDIA GPU cards.

Heterogeneous GPU hardware is not supported

You must use the same GPU hardware across a single Cloudera Data Science Workbench deployment.

GPUs are not detected after a machine reboot

This issue occurs because certain NVIDIA modules do not load automatically after a reboot.

Workaround: To work around this issue, use the following steps to manually load the required modules before Cloudera Data Science Workbench services start. The following commands load the nvidia.ko module, create the /dev/nvidiactl device, and create the list of devices at /dev/nvidia0. They will also create the /dev/nvidia-uvm and /dev/nvidia-uvm-tools devices, and assign execute privileges to /etc/rc.modules. Run these commands once on all the machines that have GPU hardware.

Manually load the required NVIDIA modules:
sudo cat >> /etc/rc.modules <<EOMSG
/usr/bin/nvidia-smi
/usr/bin/nvidia-modprobe -u -c=0
EOMSG
Set execute permission for /etc/rc.modules:
sudo chmod +x /etc/rc.modules

Cloudera Bug: DSE-2847

Jobs API

  • Cloudera Data Science Workbench does not support changing your API key, or having multiple API keys.

  • Currently, you cannot create a job, stop a job, or get the status of a job using the Jobs API.

Models

  • Known Issues with Model Builds and Deployed Models
    • Re-deploying or re-building models results in model downtime (usually brief).

    • Model deployment will fail if your project filesystem is too large for the Git snapshot process. As a general rule, any project files (code, generated model artifacts, dependencies, etc.) larger than 50 MB must be part of your project's .gitignore file so that they are not included in snapshots for model builds.

    • (Affects versions 1.4.0, 1.4.2) Model deployment fails if your model imports code from other files/folders within the project. To work around this issue, add the following line to your model code to append the project filesystem (mounted to /home/cdsw) to your system path.
      sys.path.append("/home/cdsw")

      Alternatively, configure the PYTHONPATH environmental variable for your model. You can do this either when you create the model for the first time, or, redeploy an existing model with the new environmental variable setting.

      This issue has been fixed in Cloudera Data Science Workbench 1.4.3.

    • Model builds will fail if your project filesystem includes a .git directory (likely hidden or nested). Typical build stage errors include:
      Error: 2 UNKNOWN: Unable to schedule build: [Unable to create a checkpoint of current source: [Unable to push sources to git server: ...
      To work around this, rename the .git directory (for example, NO.git) and re-build the model.
    • JSON requests made to active models should not be more than 5 MB in size. This is because JSON is not suitable for very large requests and has high overhead for binary objects such as images or video. Call the model with a reference to the image or video, such as a URL, instead of the object itself.

    • Any external connections, for example, a database connection or a Spark context, must be managed by the model's code. Models that require such connections are responsible for their own setup, teardown, and refresh.

    • Model logs and statistics are only preserved so long as the individual replica is active. Cloudera Data Science Workbench may restart a replica at any time it is deemed necessary (such as bad input to the model).

    • (Affects version 1.4.x) The model deployment example (predict.py) in the in-built Python template project does not work anymore due to a change in dependencies in the sklearn package. A working replacement for the predict.py file has been provided here: Deploy the Model - Iris Dataset.

  • Limitations
    • Scala models are not supported.

    • Spawning worker threads is not supported with models.

    • Models deployed using Cloudera Data Science Workbench are not highly-available.

    • Dynamic scaling and auto-scaling are not currently supported. To change the number of replicas in service, you will have to re-deploy the build.

Networking

  • Custom /etc/hosts entries on Cloudera Data Science Workbench hosts do not propagate to sessions and jobs running in containers.

    Cloudera Bug: DSE-2598

  • Initialisation of Cloudera Data Science Workbench (cdsw init) will fail if localhost does not resolve to 127.0.0.1.

  • Cloudera Data Science Workbench does not support DNS servers running on 127.0.0.1:53. This IP address resolves to the container localhost within Cloudera Data Science Workbench containers. As a workaround, use either a non-loopback address or a remote DNS server.
  • Kubernetes throws the following error when /etc/resolv.conf lists more than three domains:
    Resolv.conf file '/etc/resolv.conf' contains search line consisting of more than 3 domains!
    Due to a limitation in the libc resolver, only two DNS servers are supported in /etc/resolv.conf. Kubernetes uses one additional entry for the cluster DNS.

Security

TSB-349: SQL Injection Vulnerability in Cloudera Data Science Workbench

An SQL injection vulnerability was found in Cloudera Data Science Workbench. This would allow any authenticated user to run arbitrary queries against CDSW’s internal database. The database contains user contact information, bcrypt-hashed CDSW passwords (in the case of local authentication), API keys, and stored Kerberos keytabs.

Products affected: Cloudera Data Science Workbench (CDSW)

Releases affected: CDSW 1.4.0, 1.4.1, 1.4.2

Users affected: All

Date/time of detection: 2018-10-18

Detected by: Milan Magyar (Cloudera)

Severity (Low/Medium/High): Critical (9.9): CVSS:3.0/AV:N/AC:L/PR:L/UI:N/S:C/C:H/I:H/A:H

Impact: An authenticated CDSW user can arbitrarily access and modify the CDSW internal database. This allows privilege escalation in CDSW, Kubernetes, and the Linux host; creation, deletion, modification, and exfiltration of data, code, and credentials; denial of service; and data loss.

CVE: CVE-2018-20091

Immediate action required:

  1. Strongly consider performing a backup before beginning. We advise you to have a backup before performing any upgrade and before beginning this remediation work.

  2. Upgrade to Cloudera Data Science Workbench 1.4.3 (or higher).

  3. In an abundance of caution Cloudera recommends that you revoke credentials and secrets stored by CDSW. To revoke these credentials:

    1. Change the password for any account with a keytab or kerberos credential that has been stored in CDSW. This includes the Kerberos principals for the associated CDH cluster if entered on the CDSW “Hadoop Authentication” user settings page.

    2. With Cloudera Data Science Workbench 1.4.3 running, run the following remediation script on each CDSW node, including the master and all workers: Remediation Script for TSB-349

      Note: Cloudera Data Science Workbench will become unavailable during this time.

    3. The script performs the following actions:
      1. If using local user authentication, logs out every user and resets their CDSW password.

      2. Regenerates or deletes various keys for every user.

      3. Resets secrets used for internal communications.

    4. Fully stop and start Cloudera Data Science Workbench (a restart is not sufficient).

      • For CSD-based deployments, restart the CDSW service in Cloudera Manager.

        OR

      • For RPM-based deployments, run cdsw stop followed by cdsw start on the CDSW master node.

    5. If using internal TLS termination: revoke and regenerate the CDSW TLS certificate and key.

    6. For each user, revoke the previous CDSW-generated SSH public key for git integration on the git side (the private key in CDSW has already been deleted). A new SSH key pair has already been generated and should be installed in the old key’s place.

    7. Revoke and regenerate any credential stored within a CDSW project, including any passwords stored in projects’ environment variables.

  4. Verify all CDSW settings to ensure they are unchanged (e.g. SMTP server, authentication settings, custom docker images, host mounts, etc).

  5. Treat all CDSW hosts as potentially compromised with root access. Remediate per your policy.

Addressed in release/refresh/patch: Cloudera Data Science Workbench 1.4.3

For the latest update on this issue see the corresponding Knowledge article:

TSB 2019-349: CDSW SQL Injection Vulnerability

TSB-328: Unauthenticated User Enumeration in Cloudera Data Science Workbench

Unauthenticated users can get a list of user accounts of Cloudera Data Science Workbench.

Affected Versions: Cloudera Data Science Workbench 1.4.0 (and lower)

Fixed Versions: Cloudera Data Science Workbench 1.4.2 (and higher)

Immediate action required: Upgrade to the latest version of Cloudera Data Science Workbench (1.4.2 or higher).

For more details, see the Security Bulletins - TSB-328.

SSH access to Cloudera Data Science Workbench nodes must be disabled

The container runtime and application data storage is not fully secure from untrusted users who have SSH access to the gateway nodes. Therefore, SSH access to the gateway nodes for untrusted users should be disabled for security and resource utilization reasons.

SSH tunnels do not work in Cloudera Data Science Workbench 1.4.0

Affected Version: Cloudera Data Science Workbench 1.4.0

Fixed Version: Cloudera Data Science Workbench 1.4.2

Cloudera Bug: DSE-4741

TLS/SSL

  • When an HTTP/HTTPS proxy is in use, Docker commands fail on Cloudera Data Science Workbench engines that are not available locally (such as custom engine images).

    Workaround: To work around this issue, log on to a non-CDSW cluster host and run the docker pull command to pull the image onto that host. Then, scp to the CDSW host and run docker load to load the image.

    Cloudera Bug: DSE-4427

  • On TLS-enabled clusters, workers (in engines) and collection of usage metrics fails because values for the CDSW_PROJECT_URL and CDSW_DS_API_URL environmental variables are set incorrectly (they use https:// instead of http://). Use the following steps to fix the values for these variables.

    Affected Version: Cloudera Data Science Workbench 1.4.0

    Fixed Version: Cloudera Data Science Workbench 1.4.2. If you cannot update to version 1.4.2 (or higher), use the workaround described below.

    Workaround: Perform these steps for each project.
    1. Go to the project Overview page.
    2. Click Open Workbench and launch a new session.
    3. Use the workbench to print out the values for the CDSW_PROJECT_URL and CDSW_DS_API_URL environmental variables. Save these values somewhere. For examples, see Accessing Environmental Variables from Projects.
    4. Now go back to the project and click Settings > Engine.
    5. Add the two environmental variables, CDSW_PROJECT_URL and CDSW_DS_API_URL, on this page. For both variables, copy in the values saved previously but as you do so, replace https:// with http://.
    6. Click Save Environment.

    Cloudera Bug: DSE-4293, DSE-4572, DSE-4202

  • Self-signed certificates where the Certificate Authority is not part of the user's trust store are not supported for TLS termination. For more details, see Enabling TLS/SSL - Limitations.

  • Cloudera Data Science Workbench does not support the use of encrypted private keys for TLS.

    Cloudera Bug: DSE-1708

  • External TLS termination does not work with Cloudera Data Science Workbench 1.4.0.

    Affected Version: Cloudera Data Science Workbench 1.4.0

    Fixed Version: Cloudera Data Science Workbench 1.4.2

    Cloudera Bug: DSE-4640

LDAP

  • LDAP group search fails when Active Directory returns escape characters as part of the distinguished name (DN).

    Cloudera Bug: DSE-4898

Kerberos

  • On non-kerberized clusters, HADOOP_USER_NAME defaults to cdsw. This is a change from previous versions (1.3.x and lower) where if no HADOOP_USER_NAME was entered for a user, HADOOP_USER_NAME would fall back to that user's Cloudera Data Science Workbench username.

    Affected Version: Cloudera Data Science Workbench 1.4.0

    Fixed Version: Cloudera Data Science Workbench 1.4.2

    Cloudera Bug: DSE-4240

  • Using Kerberos plugin modules in krb5.conf is not supported.

  • Modifying the default_ccache_name parameter in krb5.conf does not work in Cloudera Data Science Workbench. Only the default path for this parameter, /tmp/krb5cc_${uid}, is supported.

  • PowerBroker-equipped Active Directory is not supported.

    Cloudera Bug: DSE-1838

  • Cloudera Data Science Workbench does not support the use of a FreeIPA KDC.

    Cloudera Bug: DSE-1482

  • When you upload a Kerberos keytab to authenticate yourself to the CDH cluster, Cloudera Data Science Workbench might display a fleeting error message ('cancelled') in the bottom right corner of the screen, even if authentication was successful. This error message can be ignored.

    Cloudera Bug: DSE-2344

Usability

  • The Files > New Folder dialog box is unresponsive in Cloudera Data Science Workbench 1.4.0

    Even though the New Folder dialog box is unresponsive, you should be able to see the folder you've created once you refresh the page. If that does not work, use the workaround described below.

    Affected Version: Cloudera Data Science Workbench 1.4.0

    Fixed Version: Cloudera Data Science Workbench 1.4.2.

    Workaround: Open the workbench and use either the workbench command prompt or the Terminal to create a new folder instead. For example, if you are using the Terminal, run:
    !mkdir newdir

    Cloudera Bug: DSE-4807

  • Cloudera Data Science Workbench doesn't always persist file changes made in Workbench Editor

    Due to a filesystem bug, sometimes changes made to a project file in the Workbench do not persist when you have the same project Workbench opened in multiple browser windows.

    Affected Version: Cloudera Data Science Workbench 1.4.0

    Fixed Version: Cloudera Data Science Workbench 1.4.2. If you cannot update to version 1.4.2 (or higher), use the workaround described below.

    Workaround: Use only one browser window at a time to work on a project in the Workbench. Make sure no other browser windows are open to the Workbench for the same project.

    Cloudera Bug: DSE-4353

  • Creating a Project using Git Clone via SSH does not work

    Due to a bug in Cloudera Data Science Workbench 1.4.0, Git clone via SSH does not work out-of-the-box. Either upgrade to Cloudera Data Science Workbench 1.4.2 (or higher), or use the workaround described below.

    Affected Version: Cloudera Data Science Workbench 1.4.0

    Fixed Version: Cloudera Data Science Workbench 1.4.2. If you cannot update to version 1.4.2 (or higher), use the workaround described below.

    Workaround:
    1. Create a blank project.
    2. Launch a new session.
    3. Click Terminal access.
    4. Run the following commands to initialize a Git repository and clone your project using SSH. Substitute <Git Clone with SSH URL> with the URL for your project. For example: git@github.example.com:doc-examples/examples.git
      git init
      git remote add origin <Git Clone with SSH URL>
      git pull origin master

    Cloudera Bug: DSE-4278

  • iFrame visualizations do not render in the workbench.

    Cloudera Data Science Workbench versions 1.4.2 (and higher) added a new feature that allowed users to enable HTTP security headers for responses to Cloudera Data Science Workbench. This setting is enabled by default. However, the X-Frame-Options header added as part of this feature blocks rendering of iFrames injected by third-party data visualization libraries.

    Workaround: To work around this issue, a site administrator can go to the Admin > Security page and disable the Enable HTTP security headers property. Restart Cloudera Data Science Workbench for this change to take effect.

    Affected Version: Cloudera Data Science Workbench 1.4.2 (and higher)

    Cloudera Bug: DSE-5274

    Scala sessions hang when running large scripts (longer than 100 lines) in the Workbench editor.

    Workaround 1:

    Execute the script in manually-selected chunks. For example, highlight the first 50 lines and select Run > Run Line(s).

    Workaround 2:

    Restructure your code by moving content into imported functions so as to bring the size down to under 100 lines.

  • The R engine is unable to display multi-byte characters in plots. Examples of multi-byte characters include languages such as Korean, Japanese, and Chinese.

    Workaround: Use the showtext R package to support more fonts and characters. For example, to display Korean characters:
    install.packages('showtext')
    library(showtext)
    font_add_google("Noto Sans KR", "noto")
    showtext_auto()

    Cloudera Bug: DSE-7308

  • In a scenario where 100s of users are logged in and creating processes, the nproc and nofile limits of the system may be reached. Use ulimits or other methods to increase the maximum number of processes and open files that can be created by a user on the system.

  • When rebooting, Cloudera Data Science Workbench nodes can take a significant amount of time (about 30 minutes) to become ready.

  • Long-running operations such as fork and clone can time out when projects are large or connections outlast the HTTP timeouts of reverse proxies.

  • The Scala kernel does not support auto-complete features in the editor.

  • Scala and R code can sometimes indent incorrectly in the workbench editor.

    Cloudera Bug: DSE-1218

  • Installation of the XML package fails in the R kernel.

    Cloudera Bug: DSE-2201