Scheduling Spark jobs

Spark jobs can optionally be scheduled so that they are automatically run on an interval. Cloudera Data Engineering uses the Apache Airflow scheduler to create the schedule instances.

Make sure that the Spark job has been created and all necessary resources have been created and uploaded.

  1. Define a running interval for your Spark job:

    The schedule interval is defined by a cron expression. Intervals can be regular, such as daily at 3 a.m., or irregular, such as hourly but only between 2 a.m. and 6 a.m. and only on weekdays. You can provide the cron expression directly or you can generate it using flags.

    Available schedule interval flags are:

    --cron-expression
    A cron expression that is provided directly to the scheduler. For example, 0 */1 * * *
    --every-minutes
    Running frequency in minutes. Valid values are 0-59. Only a single value is allowed.
    --every-hours
    Running frequency in hours. Valid values are 0-23. Only a single value is allowed.
    --every-days
    Running frequency in days. Valid values are 1-31. Only a single value is allowed.
    --every-months
    Running frequency in months. Valid values are 1-12. Only a single value is allowed.
    --for-minutes-of-hour
    The minutes of the hour to run on. Valid values are 0-59. Single value, range (e.g.: 1-5), or list (e.g.: 5,10) are allowed.
    --for-hours-of-day
    The hours of the day to run on. Valid values are 0-23. Single value, range (e.g.: 1-5), or list (e.g.: 5,10) are allowed.
    --for-days-of-month
    The days of the month to run on. Valid values are 1-31. Single value, range (e.g.: 1-5), or list (e.g.: 5,10) are allowed.
    --for-months-of-year
    The months of the year to run on. Valid values are 1-12 and JAN-DEC. Single value, range (e.g.: 1-5), or list (e.g.: APR,SEP) are allowed.
    --for-days-of-week
    The days of the week to run on. Valid values are SUN-SAT and 0-6. Single value, range (e.g.: 1-5), or list (e.g. TUE,THU) are allowed.

    For example, to set the interval as hourly but only between 2 a.m. and 6 a.m. and only on weekdays, use the command:

    cde job create --name test_job --schedule-enabled=true --every-hours 1 --for-minutes-of-hour 0 --for-hours-of-day 2-6 --for-days-of-week MON-FRI --schedule-start 2021-03-09T00:00:00Z
    

    Or, equivalently, using a single cron expression:

    cde job create --name test_job --schedule-enabled=true --cron-expression '0 2-6/1 * * MON-FRI'  --schedule-start 2021-03-09T00:00:00Z
    
  2. Define a time range for your Spark job:

    The schedule also defines the range of time that instances can be created for. The mandatory --schedule-start flag timestamp tells the scheduler the date and time from which the scheduling begins. The optional --schedule-end flag timestamp tells the scheduler the last date and time at which the schedule is active. If --schedule-end is not specified, the job runs at the scheduled interval until it is stopped manually.

    For example, to create a schedule that runs at midnight for each day of a single week, use the following command:

    cde job create --name test_job --schedule-enabled=true --every-days 1 --for-minutes-of-hour 0 --for-hours-of-day 0 --schedule-start 2021-03-09T00:00:00Z --schedule-end 2021-03-15T00:00:00Z