Creating a Model Endpoint using UI

The Create Model Endpoint page allows you to select a specific Cloudera AI Inference service instance and a model version from Cloudera AI Registry to create a new model endpoint.

The following steps illustrate how to create a Llama 3.1 model endpoint.

  1. In the Cloudera console, click the Cloudera AI tile.

    The Cloudera AI Workbenches page displays.

  2. Click Model Endpoints on the left navigation menu.

    The Model Endpoints landing page is displayed.

  3. Step 1: Endpoint Details
    1. Select Environment & Inference Service: Select your Cloudera environment and the Cloudera AI Inference service instance within which you want to create the model endpoint.
    2. Name: Enter a unique name for the model endpoint.
    3. Description (Optional): Provide a short description of the model endpoint. Ensure that the description text you enter is below 5000 characters.
    4. Click Next.
  4. Step 2: Model Builder
    1. Model Name: Select the registered model you wish to deploy.
    2. Version: Choose the specific version of the model.
    3. Traffic Allocation (%): Specify the traffic split between different model versions that you deploy. It is always set to 100% for the first model version which you cannot change.
    4. Task: Select a specific task for this model, such as text generation, embedding, or reranking. If left empty, the model will perform its default task.
    5. Click Next.
  5. Step 3: Resource Profile
    1. Instance Type:, Select the type of the instance from the Instance Type dropdown list. For NVIDIA NIM models this field is mandatory. Specify which compute node instance type you wish to run on your model replicas. The instance type you choose depends on the capabilities of the instance type and on what is required by the NVIDIA NIM. The field is optional for normal predictive model endpoints.
    2. Resource Allocation: Specify the required CPU (vCPU) and Memory (GiB). If using a GPU instance, also specify the GPU count.
    3. Endpoint Autoscale Range: Use the Minimum and Maximum number fields to specify the minimum and maximum number of replicas for the model endpoint. Based on which autoscaling parameter you choose, the system scales the number of replicas to meet the incoming load.
    4. Autoscale Metric Type: Select one of the following:
      • Request Per Second (RPS): Scales based on the number of requests per second per replica.
      • Concurrency: Scales based on the number of concurrent requests per replica.
      If you choose to scale as per RPS and the Target Metric Value is set to 200, then the system automatically adds a new replica when it sees that a single replica is handling 200 or more requests per second. If the RPS falls below 200, then the system scales down the model endpoint by terminating a replica.
    5. Target Metric Value: Enter the threshold that triggers a scaling event.
    6. Click Next.
  6. Step 4: Advanced Options
    1. Environment Variables: Add any required variables for the model.
      Use the Name dropdown to select a key (for example, NIM_LOG_LEVEL) and enter the corresponding Value in the text field.
    2. Access Control: If your administrator has enabled Fine-grained Access Control, you must define access levels for users or groups assigned the MLUser or MLAdmin resource roles during endpoint creation. Three access levels can be specified for Model Endpoints:
      • View: The model endpoint appears in the Model Endpoints list and the listEndpoints API. Users can access model endpoint metadata.
      • Access: The user or group run inference on the model endpoint.
      • Manage: The user or group can view the endpoint, run inference, and modify or delete the endpoint.
    3. Tags: Add any custom key and value pairs to help organize your resources.
    4. Click Next.
  7. Step 5: Review and Create
    1. Verify all selected details, including the environment, model version, resource allocation, and scaling range.
    2. Click Create Endpoint to begin the deployment.
      It can take tens of minutes for the model endpoint to be ready. The time taken is determined by the following points:
      • whether a new node has to be scaled-in to the cluster.
      • the time taken to pull the necessary container images.
      • the time taken to download model artifacts to the cluster nodes from the Cloudera AI Registry.

      In the above example, we are creating a model endpoint for the instruct variant of Llama 3.1 8B, which has been optimized to run on two NVIDIA A10G GPUs per replica. The cluster we are deploying into is in AWS, and it has a node group consisting of g5.12xlarge instance types. In this example there are no other node groups that have instance types containing A10G GPUs, so the g5.12xlarge instance type is the only choice. For NVIDIA NIM models that specify GPU models and count, the GPU field of the resource configuration page will be filled in automatically by the UI.