Query history table

Learn how the query history service in Hive stores past query data in structured tables using Iceberg and ORC for efficient querying and long-term data storage.

The query history service in Hive is designed to store historical query data in a scalable and queryable format. It uses an Iceberg table backed by the ORC file format, enabling efficient SQL queries for analyzing past workloads.

Why Iceberg and ORC were chosen?

To support scalability and queryability, the service avoids storing one file per query or session. Instead, it uses the Iceberg table format, which provides:
  1. Schema evolution
  2. Partitioning support
  3. Compatibility with SQL engines
  4. Efficient metadata management

The ORC format is used for physical storage, offering compact, columnar storage and fast performance for Hive workloads.

The query history service is designed to be pluggable, allowing flexibility to support other storage formats in the future. Hive handles all writing operations, so the service itself does not need to manage storage-level complexity.

Benefits of using Iceberg

The query history service stores data in an Iceberg table instead of traditional file-based logs. Iceberg offers the following advantages:
  • Efficient storage and query performance.
  • Schema evolution support.
  • Partitioning and metadata handling for faster analysis.
  • Integration with query engines like Hive and Impala.

Schema structure

The schema of the query history table is organized into fields and partitioning to efficiently store and query historical data.
  1. Fields:

    The query history service gathers a wide range of query-related data, categorized into basic and runtime fields. These fields address challenges related to data extraction and integration within HiveServer2. Basic fields are extracted during the service's integration into HiveServer2, leveraging encapsulating objects such as QueryProperties, QueryInfo, QueryDisplay, and DriverContext. Each of these objects contains overlapping fields relevant to query tracking.While these objects provide valuable query-related information, the service needed to identify a central access point for all necessary data.DriverContext was ultimately chosen as it offers comprehensive access to all required query details, ensuring efficient data extraction and storage.

  2. Partitioning:

    The table is partitioned by cluster_id, enabling users to filter queries based on the cluster or compute group that executed them. This partitioning strategy improves query performance and ensures data separation across different environments. For more information, see HIVE-28324