Known Issues and Limitations in Cloudera Data Science Workbench 1.7.2

Installation

During the Cloudera Data Science Workbench startup process, you might see certain timeout issues.
Pods not ready in cluster default ['role/<pod_name>'].
This is due to an issue with some pods taking longer to start up and other dependent processes timing out. Restart the CDSW service to get past this issue.

Cloudera Bug: DSE-6855

Upgrades

Upgrades supported from CDSW 1.5.x (and higher) to CDSW 1.7.x

Cloudera Data Science Workbench only supports upgrades to version 1.7.x from version 1.5.x and 1.6.x. If you are using an earlier version, you must first upgrade to version 1.5.x or 1.6.x, and then upgrade to version 1.7.x.

CDSW restart issue on multi-node deployments; CDSW Web UI does not automatically come up after upgrading to CDSW 1.7.1

After upgrading multi-node deployments (1 CDSW Master, multiple Workers) to CDSW 1.7.1, the web application is not automatically accessible as expected. This happens because of a bug where the CDSW restart process does not open the HTTP/HTTPS port required by the web pod.

Affected Version: Cloudera Data Science Workbench 1.7.1

Fixed Version: Cloudera Data Science Workbench 1.7.2

Workaround: This is a one-time fix needed to solve the issue with the CDSW restart process.
  1. Download the following patch files:
  2. Copy the ingress-controller.yaml file to /etc/cdsw/patches/default/deployment/ingress-controller.yaml on the CDSW master node.
  3. Copy the tcp-ingress-controller.yaml file to /etc/cdsw/patches/default/deployment/tcp-ingress-controller.yaml on the CDSW master node.
  4. Restart Cloudera Data Science Workbench.

Cloudera Bug: DSE-9587, DSE-9663

Domain name resolution issues after upgrading to CDSW 1.7.x; Pods stuck in CrashLoopBackOff state

After upgrading to CDSW 1.7.x, certain application pods (s2i-registry and image-puller) get stuck in CrashLoopBackOff state. This is due to an issue with the DNS resolver.

Workaround: Remove or comment out the search entry from the /etc/resolv.conf file.

# cat /etc/resolv.conf
.....
# search example.com
nameserver 192.0.2.1
nameserver 192.0.2.2

CDSW shows a warning message to update to a lower Base Image version after upgrading to CDSW 1.7.x

You may see the following warning message after upgrading to CDSW 1.7.x, asking you to update the Base Image version: There is a new version of Base Image available. Latest engine image is: “Base Image v9”.

You can ignore this message because CDSW 1.7.x comes with the Base Image v10. However, if you choose to update to v9 and click Update version, then your host system would try to download the Base Image from the online docker repository: docker.repository.cloudera.com/cdsw/engine:9. And depending on the amount of time the host takes to pull the v9 image, your session may get stuck in a "Scheduling" state.

CDSW does not display this message when you newly install CDSW 1.7.x.

Cloudera Bug: DSE-10170

On a TLS-enabled cluster Cloudera Manager points the Cloudera Data Science Workbench web UI to http:// instead of https://

After upgrading the Cloudera Data Science Workbench parcel and CSD to 1.7.x, the link to the Cloudera Data Science Workbench web UI from Cloudera Manager redirects to http://cdsw.your-company.com instead of https://cdsw.your-company.com on a TLS-enabled cluster.

Workaround: You can manually enter the complete domain name with the https protocol in your web browser. Alternatively, contact Cloudera Support to obtain a hotfix and the instructions to apply the patch. Quote the following issue while raising the support request: ENGESC-199.

Cloudera Bug: ENGESC-199

CDH Integration

CDH client configuration changes require a full Cloudera Data Science Workbench restart

Cloudera Data Science Workbench does not automatically detect configuration changes on the CDH cluster. Therefore, any changes made to CDH services, ranging from updates to service configuration properties to complete CDH or CDS parcel upgrades, must be followed by a full reset of Cloudera Data Science Workbench.

Workaround: Depending on your deployment, use one of the following sets of steps to perform a full reset of Cloudera Data Science Workbench. Note that this reset does not impact your data in any way.
  • CSD Deployments - To reset Cloudera Data Science Workbench using Cloudera Manager:
    1. Log into the Cloudera Manager Admin Console.
    2. On the Cloudera Manager homepage, click to the right of the CDSW service and select Restart. Confirm your choice on the next screen and wait for the action to complete.
    OR
  • RPM Deployments - Run the following steps on the Cloudera Data Science Workbench master host:

    cdsw stop
    cdsw start

Cloudera Manager Integration

CSD distribution/activation fails on mixed-OS clusters when there are third-party parcels running on OSs that are not supported by Cloudera Data Science Workbench

For example, adding a new CDSW gateway host on a RHEL 6 cluster running RHEL-6 compatible parcels will fail. This is because Cloudera Manager will not allow distribution of the RHEL 6 parcels on the new host which will likely be running a CDSW-compatible operating system such as RHEL 7.

Workaround: To ensure adding a new CDSW gateway host is successful, you must create a copy of the 'incompatible' third-party parcel files and give them the corresponding RHEL 7 names so that Cloudera Manager allows them to be distributed on the new gateway host. Use the following sample instructions to do so:
  1. SSH to the Cloudera Manager Server host.
  2. Navigate to the directory that contains all the parcels. By default, this is /opt/cloudera/parcels.
    cd /opt/cloudera/parcels
  3. Make a copy of the incompatible third-party parcel with the new name. For example, if you have a RHEL 6 parcel that cannot be distributed on a RHEL 7 CDSW host:
    cp <PARCELNAME.cdh5.x.x.p0.123>-el6.parcel <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel
  4. Repeat the previous step for parcel's SHA file.
    cp <PARCELNAME.cdh5.x.x.p0.123>-el6.parcel.sha <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel.sha
  5. Update the new files' owner and permissions to match those of existing parcels in the /opt/cloudera/parcels directory.
    chown cloudera-scm:cloudera-scm <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel
    chown cloudera-scm:cloudera-scm <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel.sha
    chmod 640 <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel
    chmod 640 <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel.sha
    
You should now be able to add new gateway hosts for Cloudera Data Science Workbench to your cluster.

Cloudera Bug: OPSAPS-42130, OPSAPS-31880

CDSW Service health status after a restart does not match the actual state of the application

After a restart, the Cloudera Data Science Workbench service in Cloudera Manager will display Good health even though the Cloudera Data Science Workbench web application might need a few more minutes to get ready to serve requests.

CDS Powered By Apache Spark

On TLS-enabled CDSW deployments, the embedded Spark UI does not work

If you have a TLS-enabled CDSW deployment, the embedded Spark UI tab does not render as expected.

Workaround: To work around this issue, launch the Spark UI in a separate tab and append '/jobs' after the URL. For example, if your engineID is tb0z9ydiua5q9v2d and the DOMAIN is example.com then view the Spark UI at: https://spark-tb0z9ydiua5q9v2d.example.com/jobs/

Alternative workaround: To view running Spark jobs, navigate to Spark History Server UI > Show Incomplete Applications > Application ID

Affected Versions: This issue affects CDSW 1.6.x and CDSW 1.7.x on the following platforms:
  • CDH 5: CDS 2.4 release 2 (and lower)
  • CDH 6: Versions of Spark that ship with CDH 6.0.x, CDH 6.1.x, CDH 6.2.1 (and lower), CDH 6.3.2 (and lower)
Solution: Upgrade to CDSW version 1.7.1 or higher, and either:
  • CDH version 6.4.0, 6.2.2, 6.3.3 or higher
  • CDH 5 with Spark 2.4 release 3

Spark lineage collection is not supported with Cloudera Data Science Workbench

Lineage collection is enabled by default in Spark 2.3. This feature does not work with Cloudera Data Science Workbench because the lineage log directory is not automatically mounted into CDSW engines when a session/job is started.

Affected Versions: CDS 2.3 release 2 (and higher) Powered By Apache Spark

With Spark 2.3 release 3 (or higher), if Spark cannot find the lineage log directory, it will automatically disable lineage collection for that application. Spark jobs will continue to execute in Cloudera Data Science Workbench, but lineage information will not be collected.

With Spark 2.3 release 2, Spark jobs will fail in Cloudera Data Science Workbench. Either upgrade to Spark 2.3 release 3 which includes a partial fix (as described above) or use one of the following workarounds to disable Spark lineage:

Workaround 1: Disable Spark Lineage Per-Project in Cloudera Data Science Workbench

To do this, set spark.lineage.enabled to false in a spark-defaults.conf file in your Cloudera Data Science Workbench project. This will need to be done individually for each project as required.

Workaround 2: Disable Spark Lineage for the Cluster

  1. Log in to Cloudera Manager and go to the Spark 2 service.
  2. Click Configuration.
  3. Search for the Enable Lineage Collection property and uncheck the checkbox to disable lineage collection.
  4. Click Save Changes.
  5. Go back to the Cloudera Manager homepage and restart the CDSW service for this change to go into effect.

Cloudera Bug: DSE-3720, CDH-67643

Crashes and Hangs

  • Third-party security and orchestration software (such as McAfee, Tanium, Symantec) can lead to CDSW crashing randomly

    Workaround: Disable all third-party security agents on CDSW hosts.

    Cloudera Bug: DSE-8550

  • High I/O utilization on the application block device can cause the application to stall or become unresponsive. Users should read and write data directly from HDFS rather than staging it in their project directories.

  • Installing ipywidgets or a Jupyter notebook into a project can cause Python engines to hang due to an unexpected configuration. The issue can be resolved by deleting the installed libraries from the R engine terminal.

Third-party Editors

  • Logs generated by a browser IDE do not appear within the IDE. They are displayed in the Logs tab for the session.

    Cloudera Bug: DSE-6570

  • Sessions with Browser IDEs running do not adhere to the limit set in IDLE_MAXIMUM_MINUTES. Session logs show the warning message that states that the idle session will timeout, but the timeout does not occur. The session continues to run and consume resources until the timeout set in SESSION_MAXIMUM_MINUTES is reached. Ensure that you manually stop a session after you are finished, so that the resources are available to other users.

    Cloudera Bug: DSE-6651

  • Sessions with Browser IDEs running time out with no warning after the time limit set in SESSION_MAXIMUM_MINUTES is reached, regardless of whether or not the session is idle. Periodically stop the browser IDE and session manually to avoid reaching SESSION_MAXIMUM_MINUTES.

    Cloudera Bug: DSE-6652

  • The lack of a ROOT CA certificate can cause issues with terminals and the Jupyter editor after upgrading CDSW.

    Problem: After upgrading from CDSW version 1.5 to version 1.7.1, the terminal does not open for any kernel, and the Jupyter notebook does not work.

    Workaround: In CDSW, go to Admin > Security, and paste the internal CA root certificate file contents directly into the Root CA configuration field. You should be able to launch a new session and start the terminal or launch the Jupyter editor. It is not necessary to restart CDSW. This procedure is described at Configuring Custom Root CA Certificate

Engines

  • Configuring duplicate mount points in the site admin panel (Admin > Engines > Mounts) results in sessions crashing in the workbench.

    Cloudera Bug: DSE-3308

  • Spawning remote workers fails in R when the env parameter is not set. For more details, see Distributed Computing with Workers.

    Cloudera Bug: DSE-3384

  • Autofs mounts are not supported with Cloudera Data Science Workbench.

    Cloudera Bug: DSE-2238

  • When using Conda to install Python packages, you must specify the Python version to match the Python versions shipped in the engine image (2.7.11 and 3.6.1). If not specified, the conda-installed Python version will not be used within a project. Pip (pip and pip3) does not face this issue.

  • When engine version 8 (or higher) is used, and the Allow containers to run as root property is disabled, the creation of containers that run with root privileges is prevented. Additionally, the elevation of privileges from the cdsw user to root (for example, using a setuid binary) is also prevented.

    As a result, running the ping command, which is actually a setuid binary, will fail in engine 8 (or higher) when Allow containers to run as root property is disabled.

    $ ping www.google.com
    Ping: icmp open socket: Operation not permitted.

Custom Engine Images

  • Cloudera Data Science Workbench only supports customized engines that are based on the Cloudera Data Science Workbench base image.

  • Cloudera Data Science Workbench does not support creation of custom engines larger than 10 GB.

    Cloudera Bug: DSE-4420

  • Cloudera Data Science Workbench does not support pulling images from registries that require Docker credentials.

    Cloudera Bug: DSE-1521

  • The contents of certain pre-existing standard directories such as /home/cdsw, /tmp, /opt/cloudera, and so on, cannot be modified while creating customized engines. This means any files saved in these directories will not be accessible from sessions that are running on customized engines.

    Workaround: Create a new custom directory in the Dockerfile used to create the customized engine, and save your files to that directory. Or, create a new custom directory on all the Cloudera Data Science Workbench gateway hosts and save your files to those directories. Then, mount this directory to the custom engine.

Experiments

  • (If quotas are enabled) Experiments that are stuck in the Scheduled state due to lack of resources do not automatically start even if you free up existing resources.

    Workaround: Stop the experiment that is stuck in the Scheduled state. Then manually reschedule the experiment.

    Cloudera Bug: DSE-8736

  • Experiments do not store snapshots of project files. You cannot automatically restore code that was run as part of an experiment.

  • Experiments will fail if your project filesystem is too large for the Git snapshot process. As a general rule, any project files (code, generated model artifacts, dependencies, etc.) larger than 50 MB must be part of your project's .gitignore file so that they are not included in snapshots for experiment builds.

  • Experiments cannot be deleted. As a result, be conscious of how you use the track_metrics and track_file functions.
    • Do not track files larger than 50MB.
    • Do not track more than 100 metrics per experiment. Excessive metric calls from an experiment may cause Cloudera Data Science Workbench to hang.
  • The Experiments table will allow you to display only three metrics at a time. You can select which metrics are displayed from the metrics dropdown. If you are tracking a large number of metrics (100 or more), you might notice some performance lag in the UI.

  • Arguments are not supported with Scala experiments.

  • The track_metrics and track_file functions are not supported with Scala experiments.

  • The UI does not display a confirmation when you start an experiment or any alerts when experiments fail.

GPU Support

Only CUDA-enabled NVIDIA GPU hardware is supported

Cloudera Data Science Workbench only supports CUDA-enabled NVIDIA GPU cards.

Heterogeneous GPU hardware is not supported

You must use the same GPU hardware across a single Cloudera Data Science Workbench deployment.

Jobs

  • Job notification emails fail intermittently when attachments are included. Emails are delivered either with blank attachments or no attachments at all.

    Cloudera Bug: DSE-9469, DSE-8806

  • Cloudera Data Science Workbench does not support changing your API key, or having multiple API keys.

  • Currently, you cannot use the Jobs API to create a job, stop a job, or get the status of a job.

  • Jobs pipeline visualization in CML and CDSW no longer shows dependent jobs. Dependencies only show the first step in the chain. Previously (before upgrade to 1.7), the UI displayed the whole chain of jobs. Jobs still run in the correct order, but the UI is no longer clear. Associated Bug: DSE-8003

Models

  • Known Issues with Model Builds and Deployed Models
    • Re-deploying or re-building models results in model downtime (usually brief).

    • Re-starting Cloudera Data Science Workbench does not automatically restart active models. These models must be manually restarted so they can serve requests again.

      Cloudera Bug: DSE-4950

    • Model deployment will fail if your project filesystem is too large for the Git snapshot process. As a general rule, any project files (code, generated model artifacts, dependencies, etc.) larger than 50 MB must be part of your project's .gitignore file so that they are not included in snapshots for model builds.

    • Model builds will fail if your project filesystem includes a .git directory (likely hidden or nested). Typical build stage errors include:
      Error: 2 UNKNOWN: Unable to schedule build: [Unable to create a checkpoint of current source: [Unable to push sources to git server: ...

      To work around this, rename the .git directory (for example, NO.git) and re-build the model.

      Cloudera Bug: DSE-4657

    • JSON requests made to active models should not be more than 5 MB in size. This is because JSON is not suitable for very large requests and has high overhead for binary objects such as images or video. Call the model with a reference to the image or video, such as a URL, instead of the object itself.

    • Any external connections, for example, a database connection or a Spark context, must be managed by the model's code. Models that require such connections are responsible for their own setup, teardown, and refresh.

    • Model logs and statistics are only preserved so long as the individual replica is active. Cloudera Data Science Workbench may restart a replica at any time it is deemed necessary (such as bad input to the model).

  • Limitations
    • Scala models are not supported.

    • Spawning worker threads is not supported with models.

    • Models deployed using Cloudera Data Science Workbench are not highly-available.

    • Dynamic scaling and auto-scaling are not currently supported. To change the number of replicas in service, you will have to re-deploy the build.

Networking

  • CDSW cannot launch sessions due to connection errors resulting from a segfault

    Sample error:
    transport: Error while dialing dial tcp 100.77.93.252:20051: connect: connection refused
    Workaround: Enable IPv6 on all CDSW hosts
    1. Double-check that IPv6 is currently disabled during boot time, i.e. ipv6.disable should be equal to 1.
      $ dmesg 
      [ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-514.el7.x86_64 root=UUID=3e109aa3-f171-4614-ad07-c856f20f9d25 ro console=tty0 crashkernel=auto console=ttyS0,115200 ipv6.disable=1
      $ cat /proc/cmdline
      .....ipv6.disable=1
    2. Edit /etc/default/grub and delete the ipv6.disable=1 entry from GRUB_CMDLINE_LINUX. For example:
      GRUB_CMDLINE_LINUX="rd.lvm.lv=rhel/swap crashkernel=auto rd.lvm.lv=rhel/root"
    3. Run the grub2-mkconfig command to regenerate the grub.cfg file:
      grub2-mkconfig -o /boot/grub2/grub.cfg
      Alternatively, on UEFI systems, you would run the following command:
      grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
    4. Follow the above steps for both CDSW Master and Worker nodes.
    5. Stop the Cloudera Data Science Workbench service.
    6. Reboot all the Cloudera Data Science Workbench hosts to enable IPv6 support.
    7. Start the Cloudera Data Science Workbench service. Run dmesg on the CDSW hosts to ensure there are no segfault errors seen.

    Cloudera Bug: DSE-7238, DSE-7455

  • Custom /etc/hosts entries on Cloudera Data Science Workbench hosts do not propagate to sessions and jobs running in containers.

    Cloudera Bug: DSE-2598

  • Initialisation of Cloudera Data Science Workbench (cdsw init) will fail if localhost does not resolve to 127.0.0.1.

  • Cloudera Data Science Workbench does not support DNS servers running on 127.0.0.1:53. This IP address resolves to the container localhost within Cloudera Data Science Workbench containers. As a workaround, use either a non-loopback address or a remote DNS server.
  • Kubernetes throws the following error when /etc/resolv.conf lists more than three domains:
    Resolv.conf file '/etc/resolv.conf' contains search line consisting of more than 3 domains!
    Due to a limitation in the libc resolver, only two DNS servers are supported in /etc/resolv.conf. Kubernetes uses one additional entry for the cluster DNS.

Quotas

  • If custom quota for a user is enabled, and the quotas feature is then disabled, the custom quota setting continues to remain in effect. That is, even if the quotas feature is disabled, the user will still see the cpu/memory limit reached error when they reach the previously set custom quota limit.

    Workaround: If you want to disable quotas, first manually delete each custom quota row and then switch Quotas toggle to OFF. To remove custom quotas, click the vertical ellipses at the end of each custom quota row and choose Remove.

    Cloudera Bug: DSE-9063

  • (If quotas are enabled) Experiments that are stuck in the Scheduled state due to lack of resources do not automatically start even if you free up existing resources.

    Workaround: Stop the experiment that is stuck in the Scheduled state. Then manually reschedule the experiment.

    Cloudera Bug: DSE-8736

Security

Working in the terminal or an editor should not count as idle session

If a user opens a workbench and is either working exclusively in the terminal or just editing files, Cloudera Data Science Workbench counts that time as idle time and the user gets kicked out after the configured max idle timeout.

Workaround:
  • Increase the idle session timeout by adding a new environmental variable IDLE_MAXIMUM_MINUTES. Click CDSW > Project > Settings > Environmental variables.

    You can set the value of the variables IDLE_MAXIMUM_MINUTES or SESSION_MAXIMUM_MINUTES to their maximum allowed value, which is 35000 (~3 weeks).

  • Alternatively, run a simple script inside CDSW session to keep the session alive. Opening the Cloudera Data Science Workbench and create a file as shown here (assuming Python project), and then run it in the Workbench.
    import time
    time.sleep(10000)

Cloudera Bug: DSE-3080

SSH access to Cloudera Data Science Workbench hosts must be disabled

The container runtime and application data storage is not fully secure from untrusted users who have SSH access to the gateway hosts. Therefore, SSH access to the gateway hosts for untrusted users should be disabled for security and resource utilization reasons.

TLS/SSL

  • Self-signed certificates where the Certificate Authority is not part of the user's trust store are not supported for TLS termination. For more details, see Enabling TLS/SSL - Limitations.

  • Cloudera Data Science Workbench does not support the use of encrypted private keys for TLS.

    Cloudera Bug: DSE-1708

  • A "certificate has expired" error displays when you log in to the Cloudera Data Science Workbench web UI. This issue can occur if Cloudera Data Science Workbench exceeds 365 days of continuous uptime because the internal certificate for Kubernetes expires after 1 year.

    Workaround: Restart the Cloudera Data Science Workbench deployment.
    • For CSD installations, restart the Cloudera Data Science Workbench service in Cloudera Manager.
    • For RPM installations, run the following command on the Master host:
      cdsw restart

Kerberos

  • Using Kerberos plugin modules in krb5.conf is not supported.

  • Modifying the default_ccache_name parameter in krb5.conf does not work in Cloudera Data Science Workbench. Only the default path for this parameter, /tmp/krb5cc_${uid}, is supported.

  • PowerBroker-equipped Active Directory is not supported.

    Cloudera Bug: DSE-1838

  • When you upload a Kerberos keytab to authenticate yourself to the CDH cluster, Cloudera Data Science Workbench might display a fleeting error message ('cancelled') in the bottom right corner of the screen, even if authentication was successful. This error message can be ignored.

    Cloudera Bug: DSE-2344

Usability

  • In some cases, the application switcher (grid icon) does not show any other applications, such as Hue or Ranger.

    Cloudera Bug: DSE-865

  • Scala sessions hang when running large scripts (longer than 100 lines) in the Workbench editor.

    Workaround 1:

    Execute the script in manually-selected chunks. For example, highlight the first 50 lines and select Run > Run Line(s).

    Workaround 2:

    Restructure your code by moving content into imported functions so as to bring the size down to under 100 lines.

  • The R engine is unable to display multi-byte characters in plots. Examples of multi-byte characters include languages such as Korean, Japanese, and Chinese.

    Workaround: Use the showtext R package to support more fonts and characters. For example, to display Korean characters:
    install.packages('showtext')
    library(showtext)
    font_add_google("Noto Sans KR", "noto")
    showtext_auto()

    Cloudera Bug: DSE-7308

  • In a scenario where 100s of users are logged in and creating processes, the nproc and nofile limits of the system may be reached. Use ulimits or other methods to increase the maximum number of processes and open files that can be created by a user on the system.

  • When rebooting, Cloudera Data Science Workbench hosts can take a significant amount of time (about 30 minutes) to become ready.

  • Long-running operations such as fork and clone can time out when projects are large or connections outlast the HTTP timeouts of reverse proxies.

  • The Scala kernel does not support auto-complete features in the editor.

  • Scala and R code can sometimes indent incorrectly in the workbench editor.

    Cloudera Bug: DSE-1218