Known Issues and Limitations

You might run into some known issues while using Cloudera Machine Learning on Private Cloud.

DSE-12778: Check if Storage is set up in Cluster Image Registry Operator

If storage is not set up in the cluster image registry operator, the operator does not spin up an operator daemonset to write certificates to each node. To confirm that it is set up:

  1. Run oc get configs.imageregistry.operator.openshift.io -o yaml

    If the result is storage: {}, then storage is not set up.

  2. Set the storage to emptyDir:{} by running this command:
    oc edit configs.imageregistry.operator.openshift.io
  3. Navigate to the storage section under spec: and set it to emptyDir: {}, as shown here:
    storage:
         emptyDir: {}
  4. The status section shows the following output:
     - lastTransitionTime: "2020-07-22T00:34:51Z"
       message: EmptyDir storage successfully created
       reason: Creation Successful
       status: "True"
       type: StorageExists
       observedGeneration: 9
       readyReplicas: 0
       Storage:
       emptyDir: {}

DSE-12541: Self Signed Certificates for Container Registry cause Models and Experiments to fail

If you are using non-trusted certificates or self-signed certificates and you configured the Container registry, the S2I-builder will need the client certificate and key to talk to the s2i-registry. To resolve this issue,

  1. Create a ConfigMap as shown in this example. Here, <namespace> indicates the workspace.

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: external-registry
      namespace: <namespace> 
    data:
      registry.crt: 1
        -----BEGIN CERTIFICATE-----
        < certificate data here >
        -----END CERTIFICATE-----
        -----BEGIN CERTIFICATE-----
        < certificate data here >
        -----END CERTIFICATE-----
  2. Mount the certificates in the s2i-builder deployment as shown here, below Volumemounts:

    - mountPath: /etc/docker/certs.d/172.30.139.122:5000
      name: external-registry
    - configMap:
         defaultMode: 420
         name: external-registry
      name: external-registry
  3. Pull the engine image again. From now on, Models and Experiments should work.

DSE-12329: Email invitation feature

The feature to invite new users by email does not work in Public or Private cloud, but it still appears in the UI.

DSE-12289: Proxies are not supported in CML Private Cloud 1.0

Use of a proxy server, for example for external internet connectivity for an airgap cluster, is not supported. Transparent proxies, however, should work normally.

DSE-12238: Create Project request takes longer than timeout

If a create project request takes longer than a certain timeout, a second request might be submitted. If this happens, multiple projects with similar names might be created.

As a workaround, create an empty project, create a session inside the project, then git clone your project inside a workbench terminal. Additionally, you can upload a zip file or a folder using the file preview table.

If multiple forks are created, delete the extra ones.

DSE-12090: User displays as unknown in Event History

In the Event History on the workspace Events tab, a user may display as unknown if they are authenticated by LDAP.

Fix: The user needs to be assigned the IamViewer role to view these details.

DSE-11979: Known issue with S2I

Due to a Red Hat issue with OpenShift Container Platform 4.3.x, the image registry cluster operator configuration must be set to Managed.

DSE-11870: Hung File, Stale File, and Fork issues with NFS

Hung File Operations: Certain file operations, such as stat(2)/stat(1) might hang, and if the file operation was performed through the CML web UI, the web operation might time. This indicates an NFS server that is not reachable for some reason. The error might manifest itself on the web UI when you try to open an ML project in as an HTTP error, code 500. Check the logs for error messages similar to the following:

 2020-07-13 22:42:23.914 1 ERROR AppServer.Lib.Utils Finish grpc, failed data =
        [{"rpc":"1","service":"2","reqId":"3","err":"4"},"stat","VFS","18a07980-c55a-11ea-9bb9-a35829b422d9",{"message":"
        5","stack":"6","code":4,"metadata":"7","details":"8","futureStack":"6"},"4
        DEADLINE_EXCEEDED: Deadline Exceeded","Error: 4 DEADLINE_EXCEEDED: Deadline Exceeded\n at
        Object.exports.createStatusError (/home/cdswint/services
        /web/node_modules/grpc/src/common.js:91:15)\n at Object.onReceiveStatus
        (/home/cdswint/services/web/node_modules/grpc/src/client_interceptors.js:1209:28)\n at
        InterceptingListener._callNext (/home/cdswint/services/web/
        node_modules/grpc/src/client_interceptors.js:568:42)\n at
        InterceptingListener.onReceiveStatus
        (/home/cdswint/services/web/node_modules/grpc/src/client_interceptors.js:618:8)\n at
        callback (/home/cdswint/services/web/n
        ode_modules/grpc/src/client_interceptors.js:847:24)",{"_internal_repr":"9","flags":0},"Deadline
        Exceeded",{}] 

Solution: Check your NFS server and make sure it is running. You will need to restart the NFS clients in your ML workspace’s namespace. These are the “ds-vfs” and “s2i-client” pods. Simply delete the Kubernetes pods whose names start with “ds-vfs” and “s2i-client”.

Stale File Handles: When opening a project from the ML web UI, an error message like “NFS: Stale file handle” shows up on the UI.

Solution: This is indicative of an NFS server and a client being out of sync, probably caused by a server restart along with file system content change on the server that the client is not aware of. You should restart NFS client pods in your ML workspace’s namespace. The are the “ds-vfs”, “s2i-client”, and any user sessions that are affected by the “Stale file handle” error.

Project Fork Creating Multiple Copies: When creating a new project from an existing project using the “Fork” feature, you might see the operation seemingly fail on the UI, but it still ends up creating multiple copies of the source project.

Solution: This issue happens when forking a project takes longer than the idle connection timeout set on the external load balancer, as well as in HA Proxy policy settings on OpenShift. Increase the idle connection timeout to at least 5 minutes. Depending on the performance of the NFS server, a higher timeout may be necessary.

DSE-11837: Timeout limitation for Project API

Settings Fix

  • 5 min was chosen after experimentation on Portworx 2-way replication with the nfs server provisioner.
  • Ideally the external load balancer is equal to or larger than the automated or suggested OpenShift router timeout.

Prerequisites

  • Set any external load balancer server timeout to 5 min.

TLS Enabled Workspace

  • Set the annotation haproxy.router.openshift.io/timeout=300 on each route in a deployed cml workspaces namespace:
    oc annotate route --all=true --overwrite=true -n
                  <cml-namespace> haproxy.router.openshift.io/timeout=300s

Non-TLS Enabled Workspace

  • Automatic post-install operations add the annotation haproxy.router.openshift.io/timeout: 300 for all routes for a workspace.

Workaround: Project creation still occurs. Check the Projects page after a few minutes; project creation should be complete.

DSE-10890: Scala session causes engine to run out of memory.

Launching a Scala session may cause the following error:
Engine ran out of
          memory, please consider increasing engine size.
To resolve this error, increase the memory for the requested engine.

DSE-9549: TLS requires manual steps

To provision a TLS-enabled workspace, the customer needs to perform several manual steps. This procedure is described in Deploy an ML Workspace with Support for TLS.