Known Issues and Limitations

  • Known Issues with Model Builds and Deployed Models
    • Re-deploying or re-building models results in model downtime (usually brief).

    • Re-starting Cloudera Machine Learning does not automatically restart active models. These models must be manually restarted so they can serve requests again.

      Cloudera Bug: DSE-4950

    • Model deployment will fail if your project filesystem is too large for the Git Engines for Experiments and Models process. As a general rule, any project files (code, generated model artifacts, dependencies, etc.) larger than 50 MB must be part of your project's .gitignore file so that they are not included in snapshots for model builds.

    • Model builds will fail if your project filesystem includes a .git directory (likely hidden or nested). Typical build stage errors include:
      Error: 2 UNKNOWN: Unable to schedule build: [Unable to create a checkpoint of current source: [Unable to push sources to git server: ...

      To work around this, rename the .git directory (for example, NO.git) and re-build the model.

      Cloudera Bug: DSE-4657

    • JSON requests made to active models should not be more than 5 MB in size. This is because JSON is not suitable for very large requests and has high overhead for binary objects such as images or video. Call the model with a reference to the image or video, such as a URL, instead of the object itself.

    • Any external connections, for example, a database connection or a Spark context, must be managed by the model's code. Models that require such connections are responsible for their own setup, teardown, and refresh.

    • Model logs and statistics are only preserved so long as the individual replica is active. Cloudera Machine Learning may restart a replica at any time it is deemed necessary (such as bad input to the model).

    • (MLLib) The MLLib model.save() function fails with the following sample error. This occurs because the Spark executors on CML all share a mount of /home/cdsw which results in a race condition as multiple executors attempt to write to it at the same time.
      Caused by:
                        java.io.IOException: Mkdirs failed to create
                        file:/home/cdsw/model.mllib/metadata/_temporary ....
      Recommended workarounds:
      • Save the model to /tmp, then move it into /home/cdsw on the driver/session.
      • Save the model to either an S3 URL or any other explicit external URL.
  • Limitations
    • Scala models are not supported.

    • Spawning worker threads are not supported with models.

    • Models deployed using Cloudera Machine Learning Private Cloud are highly available subject to the following limitations:
      • Model high availability is dependent on the high availability of the Kubernetes service. If using a third-party Kubernetes service to host CDP Private Cloud, please refer to your chosen provider for precise SLAs.
      • In the event that the Kubernetes pod running either the model proxy service or the load balancer becomes unavailable, the Model may be unavailable for multiple seconds during failover.
    • Dynamic scaling and auto-scaling are not currently supported. To change the number of replicas in service, you will have to re-deploy the build.