Known Issues and Limitations (Private Cloud)

You might run into some known issues while using Cloudera Machine Learning on Private Cloud.

DSE-12329: Email invitation feature

The feature to invite new users by email does not work in Public or Private cloud, but it still appears in the UI.

DSE-12289: Proxies are not supported in CML Private Cloud 1.0

Use of a proxy server, for example for external internet connectivity for an airgap cluster, is not supported. Transparent proxies, however, should work normally.

DSE-11979: Known issue with S2I

Due to a Red Hat issue with OpenShift Container Platform 4.3.x, the image registry cluster operator configuration must be set to Managed.

DSE-11837: Timeout limitation for Project API

Settings Fix

  • 5 min was chosen after experimentation on Portworx 2-way replication with the nfs server provisioner.
  • Ideally the external load balancer is equal to or larger than the automated or suggested OpenShift router timeout.

Prerequisites

  • Set any external load balancer server timeout to 5 min.

TLS Enabled Workspace

  • Set the annotation haproxy.router.openshift.io/timeout=300 on each route in a deployed cml workspaces namespace:
    oc annotate route --all=true --overwrite=true -n
                  <cml-namespace> haproxy.router.openshift.io/timeout=300s

Non-TLS Enabled Workspace

  • Automatic post-install operations add the annotation haproxy.router.openshift.io/timeout: 300 for all routes for a workspace.

Workaround: Project creation still occurs. Check the Projects page after a few minutes; project creation should be complete.

DSE-10890: Scala session causes engine to run out of memory.

Launching a Scala session may cause the following error:
Engine ran out of
          memory, please consider increasing engine size.
To resolve this error, increase the memory for the requested engine.

DSE-9549: TLS requires manual steps

To provision a TLS-enabled workspace, the customer needs to perform several manual steps. This procedure is described in Deploy an ML Workspace with Support for TLS.

DSE-12090: User displays as unknown in Event History

In the Event History on the workspace Events tab, a user may display as unknown if they are authenticated by LDAP.

Fix: The user needs to be assigned the IamViewer role to view these details.

DSE-11979: Certificate failure when pulling images from S2I container registry

DSE-12238: Create Project request takes longer than timeout

If a create project request takes longer than a certain timeout, a second request might be submitted. If this happens, multiple projects with similar names might be created.

As a workaround, create an empty project, create a session inside the project, then git clone your project inside a workbench terminal. Additionally, you can upload a zip file or a folder using the file preview table.

If multiple forks are created, delete the extra ones.

DSE-11870: Hung File, Stale File, and Fork issues with NFS

Hung File Operations: Certain file operations, such as stat(2)/stat(1) might hang, and if the file operation was performed through the CML web UI, the web operation might time. This indicates an NFS server that is not reachable for some reason. The error might manifest itself on the web UI when you try to open an ML project in as an HTTP error, code 500. Check the logs for error messages similar to the following:

 2020-07-13 22:42:23.914 1 ERROR AppServer.Lib.Utils Finish grpc, failed data =
        [{"rpc":"1","service":"2","reqId":"3","err":"4"},"stat","VFS","18a07980-c55a-11ea-9bb9-a35829b422d9",{"message":"
        5","stack":"6","code":4,"metadata":"7","details":"8","futureStack":"6"},"4
        DEADLINE_EXCEEDED: Deadline Exceeded","Error: 4 DEADLINE_EXCEEDED: Deadline Exceeded\n at
        Object.exports.createStatusError (/home/cdswint/services
        /web/node_modules/grpc/src/common.js:91:15)\n at Object.onReceiveStatus
        (/home/cdswint/services/web/node_modules/grpc/src/client_interceptors.js:1209:28)\n at
        InterceptingListener._callNext (/home/cdswint/services/web/
        node_modules/grpc/src/client_interceptors.js:568:42)\n at
        InterceptingListener.onReceiveStatus
        (/home/cdswint/services/web/node_modules/grpc/src/client_interceptors.js:618:8)\n at
        callback (/home/cdswint/services/web/n
        ode_modules/grpc/src/client_interceptors.js:847:24)",{"_internal_repr":"9","flags":0},"Deadline
        Exceeded",{}] 

Solution: Check your NFS server and make sure it is running. You will need to restart the NFS clients in your ML workspace’s namespace. These are the “ds-vfs” and “s2i-client” pods. Simply delete the Kubernetes pods whose names start with “ds-vfs” and “s2i-client”.

Stale File Handles: When opening a project from the ML web UI, an error message like “NFS: Stale file handle” shows up on the UI.

Solution: This is indicative of an NFS server and a client being out of sync, probably caused by a server restart along with file system content change on the server that the client is not aware of. You should restart NFS client pods in your ML workspace’s namespace. The are the “ds-vfs”, “s2i-client”, and any user sessions that are affected by the “Stale file handle” error.

Project Fork Creating Multiple Copies: When creating a new project from an existing project using the “Fork” feature, you might see the operation seemingly fail on the UI, but it still ends up creating multiple copies of the source project.

Solution: This issue happens when forking a project takes longer than the idle connection timeout set on the external load balancer, as well as in HA Proxy policy settings on OpenShift. Increase the idle connection timeout to at least 5 minutes. Depending on the performance of the NFS server, a higher timeout may be necessary.