About the SQL AI Assistant in CDW
Learn about the AI models and services that Hue uses to run the SQL AI Assistant and its limitations. Review what data is shared with the LLM models before you start using the SQL AI Assistant with Hue in Cloudera Data Warehouse (CDW).
A SQL AI Assistant has been integrated into Hue with the capability to leverage the power of Large Language Models (LLMs) for various SQL tasks. It helps you to create, edit, optimize, fix, and succinctly summarize queries using natural language and makes SQL development faster, easier, and less error-prone. Both Hive and Impala dialects are supported.
AI models and services that Hue uses
Service Provider | Model | Model Versions |
---|---|---|
OpenAI | OpenAI GPT |
Current GPT version is 3.5 turbo. You can configure GPT 4 for better results. |
Microsoft Azure | OpenAI GPT |
|
Amazon Bedrock | Anthropic Claude |
Newer models such as Claude 3 are not yet supported on CDW Private Cloud. |
Amazon Bedrock | Amazon Titan |
|
The SQL AI Assistant uses a Retrieval Augmented Generation (RAG)-based architecture
for augmenting results. It uses the sentence-transformer library for semantic search, and Hue
can be configured with any of the pre-trained models for better multi-lingual support. By
default, all-MiniLM-L6-v2
models are used.
Embedding Model | Language Support |
---|---|
all-MiniLM-L6-v2 | English |
distiluse-base-multilingual-cased-v1 | Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, and Turkish. |
What data is shared with the LLM models
The following details are shared with the LLMs:
- Everything that a user inputs
- Dialect in use
- Table details such as table name, column names, column data types and related keys, partitions, and constraints that the logged-in user has access to.
- Three sample rows from the tables (as per the best practices specified in Evaluating the Text-to-SQL Capabilities of Large Language Models)
Limitations
- Non-deterministic nature
- LLMs are non-deterministic, which means you cannot guarantee the same output for the same input every time, and it can lead to different responses to similar queries.
- Ambiguity
- LLMs may struggle to handle ambiguous queries or contexts. SQL queries often rely on specific and unambiguous language, but LLMs can misinterpret or generate ambiguous SQL queries, leading to incorrect results.
- Hallucinations
- In the context of LLMs, hallucination refers to a phenomenon where these models generate text or responses that are incorrect, nonsensical, or fabricated. Occasionally you might see incorrect identifiers or literals in the response.