Known Issues and Limitations in Cloudera Data Science Workbench 1.8.1

Upgrades

Upgrades supported from CDSW 1.6.x (and higher) to CDSW 1.8.x

Cloudera Data Science Workbench only supports upgrades to version 1.8.0 from version 1.6.x and 1.7.x. If you are using an earlier version, you must first upgrade to version 1.6.x or 1.7.x, and then upgrade to version 1.8.0.

Domain name resolution issues after upgrading to CDSW 1.7.x; Pods stuck in CrashLoopBackOff state

After upgrading to CDSW 1.7.x, certain application pods (s2i-registry and image-puller) get stuck in CrashLoopBackOff state. This is due to an issue with the DNS resolver.

Workaround: Remove or comment out the search entry from the /etc/resolv.conf file.

# cat /etc/resolv.conf
.....
# search example.com
nameserver 192.0.2.1
nameserver 192.0.2.2

Post-upgrade error: Pods cannot be deleted

After upgrading to CDSW 1.8.0, when trying to restart an application, you may see an error that pods cannot be deleted. The application may appear to be in the “Stopping” state, and clicking the Restart button has no effect.

Workaround: First, launch a session, job, model, or experiment. After the engine-based workload starts, you can restart the application. After the application is in a running state, the session, job, or experiment can be terminated.

CDSW Restart

After CDSW restart, application links redirect to default CDSW WebUI URL

After system restart, Application links redirect to default WebUI URL instead of showing Application details.

Workaround: After a logout and login, the Application links start to work properly again.

CDH Integration

CDH client configuration changes require a full Cloudera Data Science Workbench restart

Cloudera Data Science Workbench does not automatically detect configuration changes on the CDH cluster. Therefore, any changes made to CDH services, ranging from updates to service configuration properties to complete CDH or CDS parcel upgrades, must be followed by a full reset of Cloudera Data Science Workbench.

Workaround: Depending on your deployment, use one of the following sets of steps to perform a full reset of Cloudera Data Science Workbench. Note that this reset does not impact your data in any way.
  • CSD Deployments - To reset Cloudera Data Science Workbench using Cloudera Manager:
    1. Log into the Cloudera Manager Admin Console.
    2. On the Cloudera Manager homepage, click to the right of the CDSW service and select Restart. Confirm your choice on the next screen and wait for the action to complete.
    OR
  • RPM Deployments - Run the following steps on the Cloudera Data Science Workbench master host:

    cdsw stop
    cdsw start

Cloudera Manager Integration

CSD distribution/activation fails on mixed-OS clusters when there are third-party parcels running on OSs that are not supported by Cloudera Data Science Workbench

For example, adding a new CDSW gateway host on a RHEL 6 cluster running RHEL-6 compatible parcels will fail. This is because Cloudera Manager will not allow distribution of the RHEL 6 parcels on the new host which will likely be running a CDSW-compatible operating system such as RHEL 7.

Workaround: To ensure adding a new CDSW gateway host is successful, you must create a copy of the 'incompatible' third-party parcel files and give them the corresponding RHEL 7 names so that Cloudera Manager allows them to be distributed on the new gateway host. Use the following sample instructions to do so:
  1. SSH to the Cloudera Manager Server host.
  2. Navigate to the directory that contains all the parcels. By default, this is /opt/cloudera/parcel-repo.
    cd /opt/cloudera/parcel-repo
  3. Make a copy of the incompatible third-party parcel with the new name. For example, if you have a RHEL 6 parcel that cannot be distributed on a RHEL 7 CDSW host:
    cp <PARCELNAME.cdh5.x.x.p0.123>-el6.parcel <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel
  4. Repeat the previous step for parcel's SHA file.
    cp <PARCELNAME.cdh5.x.x.p0.123>-el6.parcel.sha <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel.sha
  5. Update the new files' owner and permissions to match those of existing parcels in the /opt/cloudera/parcel-repo directory.
    chown cloudera-scm:cloudera-scm <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel
    chown cloudera-scm:cloudera-scm <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel.sha
    chmod 640 <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel
    chmod 640 <PARCELNAME.cdh5.x.x.p0.123>-el7.parcel.sha
    
You should now be able to add new gateway hosts for Cloudera Data Science Workbench to your cluster.

Cloudera Bug: OPSAPS-42130, OPSAPS-31880

CDSW Service health status after a restart does not match the actual state of the application

After a restart, the Cloudera Data Science Workbench service in Cloudera Manager will display Good health even though the Cloudera Data Science Workbench web application might need a few more minutes to get ready to serve requests.

CDS Powered By Apache Spark

Scala sessions can fail if dependencies take longer than 15 minutes

If the dependencies in spark-defaults.conf (spark.jars, spark.packages, etc) take longer than 15 minutes to resolve, then scala sessions will fail the first time.

Workaround: Use one of the following workarounds:
  • Restart the session.
  • Mount the Spark dependency directory from the CDSW host machines.

Spark UI does not work on HDP and CDP

The Spark UI in CDSW does not work on HDP and CDP clusters.

On TLS-enabled CDSW deployments, the embedded Spark UI does not work

If you have a TLS-enabled CDSW deployment, the embedded Spark UI tab does not render as expected.

Workaround: To work around this issue, launch the Spark UI in a separate tab and append '/jobs' after the URL. For example, if your engineID is tb0z9ydiua5q9v2d and the DOMAIN is example.com then view the Spark UI at: https://spark-tb0z9ydiua5q9v2d.example.com/jobs/

Alternative workaround: To view running Spark jobs, navigate to Spark History Server UI > Show Incomplete Applications > Application ID

Affected Versions: This issue affects CDSW 1.6.x and CDSW 1.7.x on the following platforms:
  • CDH 5: CDS 2.4 release 2 (and lower)
  • CDH 6: Versions of Spark that ship with CDH 6.0.x, CDH 6.1.x, CDH 6.2.1 (and lower), CDH 6.3.2 (and lower)
Solution: Upgrade to CDSW version 1.7.1 or higher, and either:
  • CDH version 6.4.0, 6.2.2, 6.3.3 or higher
  • CDH 5 with Spark 2.4 release 3

Spark lineage collection is not supported with Cloudera Data Science Workbench

Lineage collection is enabled by default in Spark 2.3. This feature does not work with Cloudera Data Science Workbench because the lineage log directory is not automatically mounted into CDSW engines when a session/job is started.

Affected Versions: CDS 2.3 release 2 (and higher) Powered By Apache Spark

With Spark 2.3 release 3 (or higher), if Spark cannot find the lineage log directory, it will automatically disable lineage collection for that application. Spark jobs will continue to execute in Cloudera Data Science Workbench, but lineage information will not be collected.

With Spark 2.3 release 2, Spark jobs will fail in Cloudera Data Science Workbench. Either upgrade to Spark 2.3 release 3 which includes a partial fix (as described above) or use one of the following workarounds to disable Spark lineage:

Workaround 1: Disable Spark Lineage Per-Project in Cloudera Data Science Workbench

To do this, set spark.lineage.enabled to false in a spark-defaults.conf file in your Cloudera Data Science Workbench project. This will need to be done individually for each project as required.

Workaround 2: Disable Spark Lineage for the Cluster

  1. Log in to Cloudera Manager and go to the Spark 2 service.
  2. Click Configuration.
  3. Search for the Enable Lineage Collection property and uncheck the checkbox to disable lineage collection.
  4. Click Save Changes.
  5. Go back to the Cloudera Manager homepage and restart the CDSW service for this change to go into effect.

Cloudera Bug: DSE-3720, CDH-67643

Monitoring Spark Applications invoked from CDSW

To monitor spark_on_yarn applications invoked from CDSW, an embedded Spark UI is displayed right next to the session/job. This was achieved by disabling RM proxy. However, with this change, attempts to access the same Spark application using the RM UI will result in Error 500 (connection refused).

Affected Versions: CDSW 1.6 and higher.

Workaround: If the Administrator wants to troubleshoot a running spark-on-yarn application invoked by an end-user from the workbench, the user must share their session using the Share button on the right side of the console. An alternate workaround which will not provide realtime updates is to access the Spark Application UI from the Spark History Server UI > Incomplete Applications.

Cloudera Bug: DSE-4979

Crashes and Hangs

  • If CDSW is using HTTP or HTTPS_PROXY settings the following will occur:
    • The _Terminal will spin forever.
    • The applications will not reach a started state.
    You can access the _Terminal by going to: tty- {engine id} .cdsw-url.com. There is no workaround.

    Cloudera Bug: DSE-12898

  • The CDSW terminal remains active even after the CDSW web session times out.

    Cloudera Bug: DSE-12064

  • The CDSW web UI might freeze while updating the job schedule to Dependant if there is no other job to be depended on.

    Workaround: If there are no jobs to be dependent on, then refresh the page and select the appropriate setting from the Schedule field before saving other changes on the page.

    Cloudera Bug: DSE-12032

  • Restarting an application from Cloudera Manager crashes the kubelet process on a kernel older than 4.3 and CDSW 1.8 running on the Kubernetes cluster 1.14.x because of an existing Kubernetes bug. This happens because the kernel version does not support cgroups pids controller.

    Workaround: Install CDSW 1.8 on nodes that have the kernels having cgroups for pids feature enabled. To check whether the kernel has the pids feature enabled, run the following command:
    cat /proc/cgroups/

    If the subsystem (subsys_name) in the output does not contain pids, then do not install CDSW 1.8 on that node.

    Sample output showing subsystem does not list pids:
    subsys_name
    hierarchy
    num_cgroups
    enabled
    cpuset
    8
    17
    1
    cpu
    9
    17
    1
    cpuacct
    9
    17
    1
    memory
    10
    17
    1
    devices
    2
    30
    1
    freezer
    6
    17
    1
    net_cls
    3
    17
    1
    blkio
    4
    17
    1
    perf_event
    5
    17
    1
    hugetlb
    7
    17
    1
    Sample output showing subsystem that lists pids:
    #subsys_name
    hierarchy
    num_cgroups
    enabled
    cpuset
    5
    91
    1
    cpu
    9
    91
    1
    cpuacct
    9
    91
    1
    memory
    2
    91
    1
    devices
    8
    91
    1
    freezer
    7
    91
    1
    net_cls
    3
    91
    1
    blkio
    10
    91
    1
    perf_event
    4
    91
    1
    hugetlb
    11
    91
    1
    pids
    6
    91
    1
    net_prio
    3
    91
    1

    Cloudera Bug: DSE-11750

  • Third-party security and orchestration software (such as McAfee, Tanium, Symantec) can lead to CDSW crashing randomly

    Workaround: Disable all third-party security agents on CDSW hosts.

    Cloudera Bug: DSE-8550

  • High I/O utilization on the application block device can cause the application to stall or become unresponsive. Users should read and write data directly from HDFS rather than staging it in their project directories.

  • Installing ipywidgets or a Jupyter notebook into a project can cause Python engines to hang due to an unexpected configuration. The issue can be resolved by deleting the installed libraries from the R engine terminal.

Third-party Editors

  • You cannot disable file upload and download when using the Jupyter Notebook.

    Cloudera Bug: DSE-12065

  • You may see the following message if you try to open a file in CDSW that contains Chinese characters in the filename: An error occurred while trying to open the file /filepath/filename.py.info. The file could not be found..

    Workaround: From the CDSW UI, click the file that you are trying to open and then click Open in Workbench.

    Cloudera Bug: DSE-11891

  • Logs generated by a browser IDE do not appear within the IDE. They are displayed in the Logs tab for the session.

    Cloudera Bug: DSE-6570

  • Sessions with Browser IDEs running do not adhere to the limit set in IDLE_MAXIMUM_MINUTES. Session logs show the warning message that states that the idle session will timeout, but the timeout does not occur. The session continues to run and consume resources until the timeout set in SESSION_MAXIMUM_MINUTES is reached. Ensure that you manually stop a session after you are finished, so that the resources are available to other users.

    Cloudera Bug: DSE-6651

  • Sessions with Browser IDEs running time out with no warning after the time limit set in SESSION_MAXIMUM_MINUTES is reached, regardless of whether or not the session is idle. Periodically stop the browser IDE and session manually to avoid reaching SESSION_MAXIMUM_MINUTES.

    Cloudera Bug: DSE-6652

  • The lack of a ROOT CA certificate can cause issues with terminals and the Jupyter editor after upgrading CDSW.

    Problem: After upgrading from CDSW version 1.5 to version 1.7.1, the terminal does not open for any kernel, and the Jupyter notebook does not work.

    Workaround: In CDSW, go to Admin > Security, and paste the internal CA root certificate file contents directly into the Root CA configuration field. You should be able to launch a new session and start the terminal or launch the Jupyter editor. It is not necessary to restart CDSW. This procedure is described at Configuring Custom Root CA Certificate

Engines

  • The CDSW web UI does not display any acknowledgment message when you update the shared memory and save the change.

    Cloudera Bug: DSE-12034

  • When you create a job with a non-default engine profile, the job overview page displays the Engine Profile value as 1 vCPU / 2 GiB Memory instead of the actual engine profile that was selected while creating a CDSW project.

    Cloudera Bug: DSE-12033

  • The output of the terminal commands including the curl command may wrap incorrectly, resulting in a terminal output that is difficult to read.

    Workaround: Resize the terminal to a bigger size.

    Cloudera Bug: DSE-11956

  • Configuring duplicate mount points in the site admin panel (Admin > Engines > Mounts) results in sessions crashing in the workbench.

    Cloudera Bug: DSE-3308

  • Spawning remote workers fails in R when the env parameter is not set. For more details, see Distributed Computing with Workers.

    Cloudera Bug: DSE-3384

  • Autofs mounts are not supported with Cloudera Data Science Workbench.

    Cloudera Bug: DSE-2238

  • When using Conda to install Python packages, you must specify the Python version to match the Python versions shipped in the engine image (2.7.11 and 3.6.1). If not specified, the conda-installed Python version will not be used within a project. Pip (pip and pip3) does not face this issue.

  • When engine version 8 (or higher) is used, and the Allow containers to run as root property is disabled, the creation of containers that run with root privileges is prevented. Additionally, the elevation of privileges from the cdsw user to root (for example, using a setuid binary) is also prevented.

    As a result, running the ping command, which is actually a setuid binary, will fail in engine 8 (or higher) when Allow containers to run as root property is disabled.

    $ ping www.google.com
    Ping: icmp open socket: Operation not permitted.

Custom Engine Images

  • Cloudera Data Science Workbench only supports customized engines that are based on the Cloudera Data Science Workbench base image.

  • Cloudera Data Science Workbench does not support creation of custom engines larger than 10 GB.

    Cloudera Bug: DSE-4420

  • Cloudera Data Science Workbench does not support pulling images from registries that require Docker credentials.

    Cloudera Bug: DSE-1521

  • The contents of certain pre-existing standard directories such as /home/cdsw, /tmp, /opt/cloudera, and so on, cannot be modified while creating customized engines. This means any files saved in these directories will not be accessible from sessions that are running on customized engines.

    Workaround: Create a new custom directory in the Dockerfile used to create the customized engine, and save your files to that directory. Or, create a new custom directory on all the Cloudera Data Science Workbench gateway hosts and save your files to those directories. Then, mount this directory to the custom engine.

Experiments

  • Experiments do not store snapshots of project files. You cannot automatically restore code that was run as part of an experiment.

  • Experiments will fail if your project filesystem is too large for the Git snapshot process. As a general rule, any project files (code, generated model artifacts, dependencies, etc.) larger than 50 MB must be part of your project's .gitignore file so that they are not included in snapshots for experiment builds.

  • Experiments cannot be deleted. As a result, be conscious of how you use the track_metrics and track_file functions.
    • Do not track files larger than 50MB.
    • Do not track more than 100 metrics per experiment. Excessive metric calls from an experiment may cause Cloudera Data Science Workbench to hang.
  • The Experiments table will allow you to display only three metrics at a time. You can select which metrics are displayed from the metrics dropdown. If you are tracking a large number of metrics (100 or more), you might notice some performance lag in the UI.

  • Arguments are not supported with Scala experiments.

  • The track_metrics and track_file functions are not supported with Scala experiments.

  • The UI does not display a confirmation when you start an experiment or any alerts when experiments fail.

GPU Support

Only CUDA-enabled NVIDIA GPU hardware is supported

Cloudera Data Science Workbench only supports CUDA-enabled NVIDIA GPU cards.

Heterogeneous GPU hardware is not supported

You must use the same GPU hardware across a single Cloudera Data Science Workbench deployment.

GPU image for CDSW does not work with TensorFlow

The LD_LIBRARY_PATH environment variable is not set properly in the technical preview GPU image (docker.repository.cloudera.com/cdsw/cuda-engine:10) which is needed for the TensorFlow framework to work.

Workaround: To use the technical preview GPU image (docker.repository.cloudera.com/cdsw/cuda-engine:10) with TensorFlow:
  1. Install TensorFlow by running the following command:
    pip3 install tensorflow
  2. Add the following to the LD_LIBRARY_PATH environment variable:
    LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/lib/hadoop/lib/native

Jobs

  • When you start a job that has a dependent job, CDSW does not log the job start event for the dependent job in the user_events table.

    Cloudera Bug: DSE-11855

  • Cloudera Data Science Workbench does not support changing your API key, or having multiple API keys.

  • Currently, you cannot use the Jobs API to create a job, stop a job, or get the status of a job.

Models

  • Known Issues with Model Builds and Deployed Models
    • (If quotas are enabled) Models that are stuck in the Scheduled state due to lack of resources do not automatically start even if you free up existing resources.

      Workaround: Stop the Model that is stuck in the Scheduled state. Then manually reschedule that Model.

      Cloudera Bug: DSE-6886

    • Unable to create a model with the name of a deleted Model.

      Workaround: For now, Models shall have unique names across the lifespan of the cluster installation.

      Cloudera Bug: DSE-4237

    • Re-deploying or re-building models results in model downtime (usually brief).

    • Model deployment will fail if your project filesystem is too large for the Git snapshot process. As a general rule, any project files (code, generated model artifacts, dependencies, etc.) larger than 50 MB must be part of your project's .gitignore file so that they are not included in snapshots for model builds.

    • Model builds will fail if your project filesystem includes a .git directory (likely hidden or nested). Typical build stage errors include:
      Error: 2 UNKNOWN: Unable to schedule build: [Unable to create a checkpoint of current source: [Unable to push sources to git server: ...

      To work around this, rename the .git directory (for example, NO.git) and re-build the model.

      Cloudera Bug: DSE-4657

    • JSON requests made to active models should not be more than 5 MB in size. This is because JSON is not suitable for very large requests and has high overhead for binary objects such as images or video. Call the model with a reference to the image or video, such as a URL, instead of the object itself.

    • Any external connections, for example, a database connection or a Spark context, must be managed by the model's code. Models that require such connections are responsible for their own setup, teardown, and refresh.

    • Model logs and statistics are only preserved so long as the individual replica is active. Cloudera Data Science Workbench may restart a replica at any time it is deemed necessary (such as bad input to the model).

  • The use_model_metrics.py file which is available within the CDSW Templates misses the code for setting the user_api_key and is not up-to-date. Use the following code instead:
    import cdsw
    import time
    from sklearn import datasets
    import numpy as np
    
    # This script demonstrates the usage of several model metrics-
    # related functions:
    # - call_model: Calls a model deployed on CDSW as an HTTP endpoint.
    # - read_metrics: Reads metrics tracked for all model predictions
    #   made within a time window. This is useful for  doing analytics 
    #   on the tracked metrics.
    # - track_delayed_metrics: Adds metrics for a given prediction 
    #   retrospectively, after the prediction has already been made.
    #   Common examples of such metrics are ground truth and various
    #   per-prediction accuracy metrics.
    # - track_aggregate_metrics: Adds metrics for a set or batch of
    #   predictions within a given time window, not an individual 
    #   prediction. Common examples of such metrics are mean or 
    #   median accuracy, and various measures of drift.
    
    # This script can be used in a local development mode, or in
    # deployment mode. To use it in deployment mode, please: 
    # - Set dev = False
    # - Create a model deployment from the function 'predict' in
    #   predict_with_metrics.py 
    # - Obtain the model deployment's CRN from the model's overview
    #   page and the model's access key from its settings page and 
    #   paste them below.
    # - If you selected "Enable Authentication" when creating the
    #   model, then create a model API key from your user settings 
    #   page and paste it below as well.
    
    dev = True
    
    # Conditionally import the predict function only if we are in
    # dev mode
    try:
        if dev:
            raise RuntimeError("In dev mode")
    except:
        from predict_with_metrics import predict
    
    if dev:
        model_deployment_crn=cdsw.dev_model_deployment_crn # update modelDeploymentCrn
        model_access_key=None
    else: 
        # The model deployment CRN can be obtained from the model overview
        # page.
        model_deployment_crn=None 
        if model_deployment_crn is None:
            raise ValueError("Please set a valid model deployment Crn")
    
        # The model access key can be obtained from the model settings page.
        model_access_key=None
        if model_access_key is None:
            raise ValueError("Please set the model's access key")
    
        # You can create a models API key from your user settings page.
        # Not required if you did not select "Enable Authentication"
        # when deploying the model. In that case, anyone with the
        # model's access key can call the model.
        user_api_key = None
    
    # First, we use the call_model function to make predictions for 
    # the held-out portion of the dataset in order to populate the 
    # metrics database.
    iris = datasets.load_iris()
    test_size = 20
    
    # This is the input data for which we want to make predictions.
    # Ground truth is generally not yet known at prediction time.
    score_x = iris.data[:test_size, 2].reshape(-1, 1) # Petal length
    
    # Record the current time so we can retrieve the metrics
    # tracked for these calls.
    start_timestamp_ms=int(round(time.time() * 1000))
    
    uuids = []
    predictions = []
    for i in range(len(score_x)):
        if model_access_key is not None:
            output = cdsw.call_model(model_access_key, {"petal_length": score_x[i][0]}, api_key=user_api_key)["response"]
        else:
            output = predict({"petal_length": score_x[i][0]})
        # Record the UUID of each prediction for correlation with ground truth.
        uuids.append(output["uuid"])
        predictions.append(output["prediction"])
    
    # Record the current time.
    end_timestamp_ms=int(round(time.time() * 1000))
    
    # We can now use the read_metrics function to read the metrics we just
    # generated into the current session, by querying by time window.
    data = cdsw.read_metrics(model_deployment_crn=model_deployment_crn,
                start_timestamp_ms=start_timestamp_ms,
                end_timestamp_ms=end_timestamp_ms, dev=dev)
    data = data['metrics']
    
    # Now, ground truth is known and we want to track the true value
    # corresponding to each prediction above.
    score_y = iris.data[:test_size, 3].reshape(-1, 1) # Observed petal width
    
    # Track the true values alongside the corresponding predictions using
    # track_delayed_metrics. At the same time, calculate the mean absolute
    # prediction error.
    mean_absolute_error = 0
    n = len(score_y)
    for i in range(n):
        ground_truth = score_x[i][0]
        cdsw.track_delayed_metrics({"actual_result":ground_truth}, uuids[i], dev=dev)
    
        absolute_error = np.abs(ground_truth - predictions[i])
        mean_absolute_error += absolute_error / n
    
    # Use the track_aggregate_metrics function to record the mean absolute
    # error within the time window where we made the model calls above.
    cdsw.track_aggregate_metrics(
        {"mean_absolute_error": mean_absolute_error}, 
        start_timestamp_ms, 
        end_timestamp_ms, 
        model_deployment_crn=model_deployment_crn,
        dev=dev
    )
  • Limitations
    • Scala models are not supported.

    • Spawning worker threads is not supported with models.

    • Models deployed using Cloudera Data Science Workbench are not highly-available.

    • Dynamic scaling and auto-scaling are not currently supported. To change the number of replicas in service, you will have to re-deploy the build.

Applications

The subdomain names on the Application page has some constraints, but the CDSW UI may not display a complete error message when these are violated.

Workaround: Use the characters from the set of ASCII letters, digits, and hyphens (a-z, 0-9, -) to form the subdomain name. Ensure that the subdomain name does not start or end with a hyphen.

Cloudera Bug: DSE-11883

Platform

  • In the monitoring view of the Grafana the UI, the values displayed in the Used CPU Usage and the Total CPU Usage are the same when both the Node and the Pod filters are set to All.

    Workaround: Use a specific filter other than All to overcome this issue.

    Cloudera Bug: DSE-12072

  • Deleting the projects through the UI might leave residual files on disk.

    Workaround: Please contact Cloudera support to remove residual files in case of disk usage issues.

    Cloudera Bug: DSE-393

Networking

  • CDSW cannot launch sessions due to connection errors resulting from a segfault

    Sample error:
    transport: Error while dialing dial tcp 100.77.93.252:20051: connect: connection refused
    Workaround: Enable IPv6 on all CDSW hosts
    1. Double-check that IPv6 is currently disabled during boot time, i.e. ipv6.disable should be equal to 1.
      $ dmesg 
      [ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-514.el7.x86_64 root=UUID=3e109aa3-f171-4614-ad07-c856f20f9d25 ro console=tty0 crashkernel=auto console=ttyS0,115200 ipv6.disable=1
      $ cat /proc/cmdline
      .....ipv6.disable=1
    2. Edit /etc/default/grub and delete the ipv6.disable=1 entry from GRUB_CMDLINE_LINUX. For example:
      GRUB_CMDLINE_LINUX="rd.lvm.lv=rhel/swap crashkernel=auto rd.lvm.lv=rhel/root"
    3. Run the grub2-mkconfig command to regenerate the grub.cfg file:
      grub2-mkconfig -o /boot/grub2/grub.cfg
      Alternatively, on UEFI systems, you would run the following command:
      grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
    4. Follow the above steps for both CDSW Master and Worker nodes.
    5. Stop the Cloudera Data Science Workbench service.
    6. Reboot all the Cloudera Data Science Workbench hosts to enable IPv6 support.
    7. Start the Cloudera Data Science Workbench service. Run dmesg on the CDSW hosts to ensure there are no segfault errors seen.

    Cloudera Bug: DSE-7238, DSE-7455

  • Custom /etc/hosts entries on Cloudera Data Science Workbench hosts do not propagate to sessions and jobs running in containers.

    Cloudera Bug: DSE-2598

  • Initialisation of Cloudera Data Science Workbench (cdsw init) will fail if localhost does not resolve to 127.0.0.1.

  • Cloudera Data Science Workbench does not support DNS servers running on 127.0.0.1:53. This IP address resolves to the container localhost within Cloudera Data Science Workbench containers. As a workaround, use either a non-loopback address or a remote DNS server.
  • Kubernetes throws the following error when /etc/resolv.conf lists more than three domains:
    Resolv.conf file '/etc/resolv.conf' contains search line consisting of more than 3 domains!
    Due to a limitation in the libc resolver, only two DNS servers are supported in /etc/resolv.conf. Kubernetes uses one additional entry for the cluster DNS.

Security

Working in the terminal or an editor should not count as idle session

If a user opens a workbench and is either working exclusively in the terminal or just editing files, Cloudera Data Science Workbench counts that time as idle time and the user gets kicked out after the configured max idle timeout.

Workaround:
  • Increase the idle session timeout by adding a new environmental variable IDLE_MAXIMUM_MINUTES. Click CDSW > Project > Settings > Environmental variables.

    You can set the value of the variables IDLE_MAXIMUM_MINUTES or SESSION_MAXIMUM_MINUTES to their maximum allowed value, which is 35000 (~3 weeks).

  • Alternatively, run a simple script inside CDSW session to keep the session alive. Opening the Cloudera Data Science Workbench and create a file as shown here (assuming Python project), and then run it in the Workbench.
    import time
    time.sleep(10000)

Cloudera Bug: DSE-3080

SSH access to Cloudera Data Science Workbench hosts must be disabled

The container runtime and application data storage is not fully secure from untrusted users who have SSH access to the gateway hosts. Therefore, SSH access to the gateway hosts for untrusted users should be disabled for security and resource utilization reasons.

TLS/SSL

  • Self-signed certificates where the Certificate Authority is not part of the user's trust store are not supported for TLS termination. For more details, see Enabling TLS/SSL - Limitations.

  • Cloudera Data Science Workbench does not support the use of encrypted private keys for TLS.

    Cloudera Bug: DSE-1708

  • A "certificate has expired" error displays when you log in to the Cloudera Data Science Workbench web UI. This issue can occur if Cloudera Data Science Workbench exceeds 365 days of continuous uptime because the internal certificate for Kubernetes expires after 1 year.

    Workaround: Restart the Cloudera Data Science Workbench deployment.
    • For CSD installations, restart the Cloudera Data Science Workbench service in Cloudera Manager.
    • For RPM installations, run the following command on the Master host:
      cdsw restart

Kerberos

  • Using Kerberos plugin modules in krb5.conf is not supported.

  • Modifying the default_ccache_name parameter in krb5.conf does not work in Cloudera Data Science Workbench. Only the default path for this parameter, /tmp/krb5cc_${uid}, is supported.

  • PowerBroker-equipped Active Directory is not supported.

    Cloudera Bug: DSE-1838

  • When you upload a Kerberos keytab to authenticate yourself to the CDH cluster, Cloudera Data Science Workbench might display a fleeting error message ('cancelled') in the bottom right corner of the screen, even if authentication was successful. This error message can be ignored.

    Cloudera Bug: DSE-2344

Usability

  • Environment variables with the dollar ($) character are not parsed correctly by CDSW. For example, if you set PASSWORD="pass$123" in the project environment variables, and then try to read it using the echo command, you see the following output: pass23

    Workaround: Use one of the following commands to print the $ sign:
    echo 24 | xxd -r -p
    or
    echo JAo= | base64 -d
    Insert the value of the environment variable by wrapping it in the command substitution using $() or ``. For example, if you want to set the environment variable to ABC$123, specify:
    ABC$(echo 24 | xxd -r -p)123
    or
    ABC`echo 24 | xxd -r -p`123
  • You see the HTTP 404 error on navigating to other tabs within the CDSW web UI after updating the project name on the Project Settings page. This is because the backend API still uses the older project name.

    Workaround: After you save the new project name, manually reload the webpage before navigating to the other tabs or pages within the CDSW web UI.

    Cloudera Bug: DSE-11911

  • In some cases, the application switcher (grid icon) does not show any other applications, such as Hue or Ranger.

    Cloudera Bug: DSE-865

  • Scala sessions hang when running large scripts (longer than 100 lines) in the Workbench editor.

    Workaround 1:

    Execute the script in manually-selected chunks. For example, highlight the first 50 lines and select Run > Run Line(s).

    Workaround 2:

    Restructure your code by moving content into imported functions so as to bring the size down to under 100 lines.

  • The R engine is unable to display multi-byte characters in plots. Examples of multi-byte characters include languages such as Korean, Japanese, and Chinese.

    Workaround: Use the showtext R package to support more fonts and characters. For example, to display Korean characters:
    install.packages('showtext')
    library(showtext)
    font_add_google("Noto Sans KR", "noto")
    showtext_auto()

    Cloudera Bug: DSE-7308

  • In a scenario where 100s of users are logged in and creating processes, the nproc and nofile limits of the system may be reached. Use ulimits or other methods to increase the maximum number of processes and open files that can be created by a user on the system.

  • When rebooting, Cloudera Data Science Workbench hosts can take a significant amount of time (about 30 minutes) to become ready.

  • Long-running operations such as fork and clone can time out when projects are large or connections outlast the HTTP timeouts of reverse proxies.

  • The Scala kernel does not support auto-complete features in the editor.

  • Scala and R code can sometimes indent incorrectly in the workbench editor.

    Cloudera Bug: DSE-1218