Launching Synthetic Data Studio within a project

You can launch Synthetic Data Studio on the Cloudera AI Platform to generate datasets and evaluate them.

Agent Studio integrates with two major enterprise inference services:

  • Cloudera AI Inference Service: It offers enterprise-grade deployment options.

    To enable Cloudera AI Inference service for Synthetic Data Studio, ensure the followings:
    • The environment variable responsible for enabling Cloudera AI Inference service is CDP_TOKEN. By default CDP_TOKEN is set to null. If left as null, the application will use the JWT stored at /tmp/jwt to run Cloudera AI Inference service. Alternatively, if you provide a value for CDP_TOKEN during the pre-installation configuration of environment variables, it will override the default and be used for authentication.
    • Ensure that the Cloudera AI Inference service endpoints and model IDs are readily available. You will be prompted to provide these details if you choose Cloudera AI Inference service as the AI inference option in Synthetic Data Studio (SDS).
    • All endpoints used must conform to the OpenAI API standard.
    • For Cloudera AI on premises, you must use CDP_TOKEN for authentication. Auto-generated tokens stored in /tmp/jwt/ are not yet available in the Cloudera AI on premises version.

    For more details, see Authenticating Cloudera AI Inference service.

  • AWS Bedrock: It provides scalable cloud-based inference.
Environment Variables: Before installation, Synthetic Data Studio must be configured with the necessary environment variables - CDP_TOKEN- to enable the Cloudera AI Inference service.
  • AWS_DEFAULT_REGION: Defaults to the us-east-1 region.*
  • AWS_ACCESS_KEY_ID: Your AWS access key ID.*
  • AWS_SECRET_ACCESS_KEY: Your AWS secret access key.*
  • Hf_token: Your Hugging Face token for exporting datasets.
  • Hf_username: Your Hugging Face username.
  • CDP_TOKEN: Overrides the JWT token for Cloudera AI Inference service.
  1. In the Cloudera console, click the Cloudera AI tile.

    The Cloudera AI Workbenches page displays.

  2. Click on the name of the workbench.

    The workbenches Home page displays.

  3. Click Projects, and then click New Project to create a new project.

    In the left navigation pane, the new AI Studios option is displayed.

  4. Click AI Studios.
  5. Click the Launch button in the Synthetic Data Studio box. The Configure Studio: Synthetic Data Studio page is displayed.
  6. Set the environment variables for the Synthetic Data Studio, using the details mentioned in the prerequisites.
  7. Select the Runtime version.
  8. Click Launch AI Studio.

    The Synthetic Data Studio page is displayed.

    After launching, you can view the list of tasks being executed as part of the AI studio deployment.

  9. After configuration, Synthetic Data Studio is displayed in the left navigation page under AI Studios.
  10. Click Synthetic Data Studio and click Get Started.
    You can generate synthetic datasets for training models and evaluate the generated datasets for fine-tuning LLMs on this page.