Scheduling queries

If you need a simple, yet powerful and secure way to create, manage, and monitor scheduled jobs, you can use Apache Hive scheduled queries. You can replace OS-level schedulers like cron, Apache Oozie, or Apache Airflow with scheduled queries.

Using SQL statements, you can schedule Hive queries to run on a recurring basis, monitor query progress, and optionally disable a query schedule. You can run queries to ingest data periodically, refresh materialized views, replicate data, and perform other repetitive tasks. For example, you can insert data from a stream into a transactional table every 10 minutes, refresh a materialized view used for BI reporting every hour, and replicate data from one cluster to another on a daily basis.

A Hive scheduled query consists of the following parts:
  • A unique name for the schedule
  • The SQL statement to be executed
  • The execution schedule defined by a Quartz cron expression.

Quartz cron expressions are expressive and flexible. For instance, expressions can describe simple schedules such as every 10 minutes, but also an execution happening at 10 AM on the first Sunday of the month in January, February in 2021, 2022. You can describe common schedules in an easily comprehensible format, for example every 20 minutes or every day at ‘3:25:00’.

Operation

A scheduled query belongs to a namespace, which is a collection of HiveServer (HS2) instances that are responsible for executing the query. Scheduled queries are stored in the Hive metastore. The metastore stores scheduled queries, the status of ongoing and previously executed statements, and other information. HiveServer periodically polls the metastore to retrieve scheduled queries that are due to be executed. If you run multiple HiveServer instances within a single deployment, the metastore guarantees that only one of them executes a certain scheduled query at any given time.

You create, alter, and drop scheduled queries using dedicated SQL statements.