Key features for Cloudera AI Inference service

The key features of Cloudera AI Inference service includes:

  • Easy to use interface: Streamlines the complexities of deployment and infrastructure, meaningfully reducing time to value for AI use cases.
  • Real-time predictions: Allows users to serve AI models in real-time, providing low latency predictions for client requests.
  • Monitoring and logging: Includes functionality for monitoring and logging, making it easier to troubleshoot issues and optimize workload performance.
  • Advanced deployment patterns: Includes functionality for advanced deployment patterns, such as canary deployments, enabling users to deploy new versions of models gradually and compare their performance before promoting them to production.
  • Optimized Performance: Integrates with NVIDIA NIM microservices and NVIDIA Triton Inference Server to accelerate inference performance on NVIDIA accelerated infrastructure.
  • Model access: Offers access to NVIDIA foundation models, tailored for NVIDIA hardware to increase inference throughput and to reduce latency.
  • REST API: Provides APIs for deploying, managing, and monitoring of model endpoints. These APIs enable integration with continuous integration and continuous deployment (CI/CD) pipelines and other tools used in the Machine Learning Operations (MLOps) and Large Language Model Operations (LLMOps) workflows.
  • Fine-grained access control: when enabled, administrators can configure precise permission levels for individual users and groups.
  • Multiple Cloudera AI Registries connected to a single Cloudera AI Inference service: In this architecture, Cloudera AI Inference service acts as the unified inference layer that consumes models from any connected Cloudera AI Registry instance. Each Cloudera AI Registry has its own model catalog, versioning, access controls, and lifecycle management, but all models selected for serving are pushed to the same inference endpoint environment. This simplifies operational consistency because monitoring, autoscaling, and serving configurations are defined once within Cloudera AI Inference service, regardless of how many registries feed into it.