About the SQL AI Assistant in CDW

Learn about the AI models and services that Hue uses to run the SQL AI Assistant and its limitations. Review what data is shared with the LLM models before you start using the SQL AI Assistant with Hue in Cloudera Data Warehouse (CDW).

A SQL AI Assistant has been integrated into Hue with the capability to leverage the power of Large Language Models (LLMs) for various SQL tasks. It helps you to create, edit, optimize, fix, and succinctly summarize queries using natural language and makes SQL development faster, easier, and less error-prone. Both Hive and Impala dialects are supported.

AI models and services that Hue uses

The SQL AI Assistant supports various LLMs and hosting services. The models run on cloud infrastructure, and the AI Assistant can be configured to use them remotely. Cloudera has tested with GPT running in Open AI, Microsoft Azure, and Amazon Bedrock. The following service-model combinations are supported:
Service Provider Model Model Versions
OpenAI OpenAI GPT
  • gpt-3.5-turbo
  • gpt-3.5-turbo-16k

Current GPT version is 3.5 turbo. You can configure GPT 4 for better results.

Microsoft Azure OpenAI GPT
  • gpt-3.5-turbo
  • gpt-3.5-turbo-16k
Amazon Bedrock Anthropic Claude
  • anthropic.claude-v1
  • anthropic.claude-v2

Newer models such as Claude 3 are not yet supported on CDW Private Cloud.

Amazon Bedrock Amazon Titan
  • amazon.titan-text-express-v1

The SQL AI Assistant uses a Retrieval Augmented Generation (RAG)-based architecture for augmenting results. It uses the sentence-transformer library for semantic search, and Hue can be configured with any of the pre-trained models for better multi-lingual support. By default, all-MiniLM-L6-v2 models are used.

Embedding Model Language Support
all-MiniLM-L6-v2 English
distiluse-base-multilingual-cased-v1 Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, and Turkish.

What data is shared with the LLM models

The following details are shared with the LLMs:

  • Everything that a user inputs
  • Dialect in use
  • Table details such as table name, column names, column data types and related keys, partitions, and constraints that the logged-in user has access to.
  • Three sample rows from the tables (as per the best practices specified in Evaluating the Text-to-SQL Capabilities of Large Language Models)

Limitations

Non-deterministic nature
LLMs are non-deterministic, which means you cannot guarantee the same output for the same input every time, and it can lead to different responses to similar queries.
Ambiguity
LLMs may struggle to handle ambiguous queries or contexts. SQL queries often rely on specific and unambiguous language, but LLMs can misinterpret or generate ambiguous SQL queries, leading to incorrect results.
Hallucinations
In the context of LLMs, hallucination refers to a phenomenon where these models generate text or responses that are incorrect, nonsensical, or fabricated. Occasionally you might see incorrect identifiers or literals in the response.