Generating synthetic data for a ticketing use case using the Supervised Fine-Tuning workflow

Follow the steps below to generate synthetic data for a ticketing use case using the Supervised Fine-Tuning workflow in Synthetic Data Studio.

In the Cloudera console, click the Cloudera AI tile.

The Cloudera AI Workbenches page displays.
Click on the name of the workbench.

The workbenches Home page displays.
Click AI Studios.
Launch the Synthetic Data Studio..
Under AI Studios, click Synthetic Data Studio, and then click Get Started.
Under Create Datasets, click Getting Started. The Synthetic Dataset Studio page is displayed.
In the Configure tab, specify the following:
1. In Dataset Display Name, enter Ticketing dataset.
2. In Model Provider, select AWS Bedrock.
3. In Model ID, enter us.anthropic.claude-3-5-sonnet-20241022-v2:0 .
4. In Workflow, select Supervised Fine-Tuning.
5. In Template, select Custom.
In the Prompt tab, specify the following:
1. In Prompt, write a prompt that instructs the LLM to create both user queries and system responses. The workflow generates both the user prompt and the completion in a single step.
  Example prompt: The following example instructs the LLM by giving general guidelines on creating a prompt and a completion. Then, give a list of requirements for the data, such as the use of respectful language, the level of detail, and so on. Finally, explains the possible choices of completion and how the system will create the user prompts (queries) along with the system completions (system response).
```
Generate authentic customer support ticket interactions with a user query and system response. 
For each user query, the system generates a keyword used to forward the user to the appropriate subsystem.
Requirements for user queries:
- Use professional, respectful language.
- Avoid assumptions about demographics or identity.
- Include realistic technical details when relevant.
- Provide clear, actionable solutions.
- Use inclusive terminology.
- Maintain a helpful, solution-focused tone.
- Include relevant troubleshooting steps where applicable.
- Follow standard customer service best practices.

Each response should be a single id from the following list:
cancel_ticket,customer_service,pay,report_payment_issue
Here are the explanations of the responses:
cancel_ticket means that the customer wants to cancel the ticket.
customer_service means that customer wants to talk to customer service.
pay means that the customer wants to pay the bill.
report_payment_issue means that the customer is facing payment issues and wants to be forwarded to the billing department to resolve the issue.
```
2. In Seeds Instructions, define seed topics to diversify the generated dataset.
```
Cancellation & Refunds
Event inquiries
Billing Inquires
General Inquires
```
3. In Entries Per Seed, specify 5 as the number of entries to generate for each seed defined in Seeds Instructions.
4. Under Parameters, adjust the following model parameters:
  1. Temperature: Set to 1.0 to allow the LLM to generate diverse synthetic data.
  2. Top K: Set to 100 to explore a wide range of possible solutions.
  3. Top P: Set to 1.0 for broader exploration of outputs.
  4. Max Tokens. Set to 2048 or adjust based on the size of the generated text. For problems with larger generated text, consider increasing Max Tokens.

In the Examples tab, view the details of example prompts. Using examples, you can teach the LLM how to structure the prompt and completions of the generated data.

Under Actions, click Add Example and define prompts and completions so that the LLM knows the format of the data to be generated.

Click Add Example to add the following prompts:

Table 1.
Field name	Value
Example 1 - Prompt	I have received a message that I owe $300 and I was instructed to pay the bill online. I already paid this amount and I am wondering why I received this message.
Example 1 - Completion	report_payment_issue
Example 2 - Prompt	I have received two payment invoices and need to pay my bills using a credit card.
Example 2 - Completion	pay
Example 3 - Prompt	I will not be able to attend the presentation and would like to cancel my RSVP.
Example 3 - Completion	cancel_ticket
Example 4 - Prompt	I am having questions regarding the exact time, location, and requirements of the event and would like to talk to customer service.
Example 4 - Completion	customer_service

In the Summary tab, review all the data generation parameters to confirm that everything is as expected. Click Generate to initiate dataset generation. Alternatively, click Previous to return to previous tabs and make any necessary changes.
In the Finish tab, view the status of the dataset creation. The generated Prompts and Completions will be displayed. The output dataset will be saved in the Project File System within the Cloudera environment.

This workflow ensures that high-quality synthetic datasets are generated for fine-tuning LLMs to handle customer support ticketing use cases effectively.