Launching Synthetic Data Studio within a project

You can launch Synthetic Data Studio on the Cloudera AI Platform to generate datasets and evaluate them.

Agent Studio integrates with two major enterprise inference services:

Cloudera AI Inference Service: It offers enterprise-grade deployment options.
AWS Bedrock: It provides scalable cloud-based inference.

Environment Variables: All environment variables are optional if users only utilize CAII endpoints hosted in the same cluster as the SDS application.

AWS_DEFAULT_REGION: Defaults to the us-east-1 region.
AWS_ACCESS_KEY_ID: Your AWS access key ID.
AWS_SECRET_ACCESS_KEY: Your AWS secret access key.
Hf_token: Your Hugging Face token for exporting datasets.
Hf_username: Your Hugging Face username.
CDP_TOKEN: Overrides the JWT token for Cloudera AI Inference service.

Host names: For air-gapped installations that use a proxy setup, it is essential to whitelist the necessary URLs in your firewall rules. For a list of hostnames to whitelist, see Host names and endpoints required for AI Studios .

In the Cloudera console, click the Cloudera AI tile.

The Cloudera AI Workbenches page displays.
Click on the name of the workbench.

The workbenches Home page displays.
Click Projects, and then click New Project to create a new project.

In the left navigation pane, the new AI Studios option is displayed.
Click AI Studios.
Click the Launch button in the Synthetic Data Studio box. The Configure Studio: Synthetic Data Studio page is displayed.
Set the environment variables for the Synthetic Data Studio, using the details mentioned in the prerequisites.
Select the Runtime version.
Click Launch AI Studio.

The Synthetic Data Studio page is displayed.

After launching, you can view the list of tasks being executed as part of the AI studio deployment.
After configuration, Synthetic Data Studio is displayed in the left navigation page under AI Studios.
Click Synthetic Data Studio and click Get Started.
You can generate synthetic datasets for training models and evaluate the generated datasets for fine-tuning LLMs on this page.