Known issues with Cloudera AI Registry standalone API
These are some of the known issues you might run into while using Cloudera AI Registry standlone API.
- NGC model download timeout
-
The NGC model import might time out, and the corresponding model version status is shown as “failed”. You can access the logs found in the API v2 pod by performing the steps mentioned in the Debugging the model import failure troubleshooting section.
- Cloudera AI Inference service is unable to discover the Cloudera AI Registry
- In certain cases, the Cloudera AI Inference service is unable to discover the Cloudera AI Registry.
- Model import failure
- You can download the models concurrently only if their combined size is below approximately 400 GB. Exceeding this limit may result in import failures and unexpected behavior.
- Request Throttling
- Currently, there is no request throttling mechanism implemented. As a result, excessive concurrent requests may lead to model import failures. To minimize the risk, it is recommended to limit concurrent requests to a maximum of 5, which is considered a safe threshold.
- Model Import progress indicator
- A progress bar is not available for model imports. For reference, importing a 70 GB model typically takes approximately 1 hour. Users should plan accordingly and monitor the process through alternative options, if necessary.
- Lack of Model Import failure details
- Currently, the UI and API does not provide specific reasons for model import failures. You have to use Kubernetes logs to diagnose the issues.