Scheduling Spark jobs
Spark jobs can optionally be scheduled so that they are automatically run on an interval. Cloudera Data Engineering uses the Apache Airflow scheduler to create the schedule instances.
Make sure that the Spark job has been created and all necessary resources have been created and uploaded.
-
Define a running interval for your Spark job:
The schedule interval is defined by a cron expression. Intervals can be regular, such as daily at 3 a.m., or irregular, such as hourly but only between 2 a.m. and 6 a.m. and only on weekdays. You can provide the cron expression directly or you can generate it using flags.
Available schedule interval flags are:
--cron-expression
- A cron expression that is provided directly to the scheduler. For
example,
0 */1 * * *
--every-minutes
- Running frequency in minutes. Valid values are 0-59. Only a single value is allowed.
--every-hours
- Running frequency in hours. Valid values are 0-23. Only a single value is allowed.
--every-days
- Running frequency in days. Valid values are 1-31. Only a single value is allowed.
--every-months
- Running frequency in months. Valid values are 1-12. Only a single value is allowed.
--for-minutes-of-hour
- The minutes of the hour to run on. Valid values are 0-59. Single value, range (e.g.: 1-5), or list (e.g.: 5,10) are allowed.
--for-hours-of-day
- The hours of the day to run on. Valid values are 0-23. Single value, range (e.g.: 1-5), or list (e.g.: 5,10) are allowed.
--for-days-of-month
- The days of the month to run on. Valid values are 1-31. Single value, range (e.g.: 1-5), or list (e.g.: 5,10) are allowed.
--for-months-of-year
- The months of the year to run on. Valid values are 1-12 and JAN-DEC. Single value, range (e.g.: 1-5), or list (e.g.: APR,SEP) are allowed.
--for-days-of-week
- The days of the week to run on. Valid values are SUN-SAT and 0-6. Single value, range (e.g.: 1-5), or list (e.g. TUE,THU) are allowed.
For example, to set the interval as hourly but only between 2 a.m. and 6 a.m. and only on weekdays, use the command:
cde job create --name test_job --schedule-enabled=true --every-hours 1 --for-minutes-of-hour 0 --for-hours-of-day 2-6 --for-days-of-week MON-FRI --schedule-start 2021-03-09T00:00:00Z
Or, equivalently, using a single cron expression:
cde job create --name test_job --schedule-enabled=true --cron-expression '0 2-6/1 * * MON-FRI' --schedule-start 2021-03-09T00:00:00Z
-
Define a time range for your Spark job:
The schedule also defines the range of time that instances can be created for. The mandatory
--schedule-start
flag timestamp tells the scheduler the date and time from which the scheduling begins. The optional--schedule-end
flag timestamp tells the scheduler the last date and time at which the schedule is active. If--schedule-end
is not specified, the job runs at the scheduled interval until it is stopped manually.For example, to create a schedule that runs at midnight for each day of a single week, use the following command:
cde job create --name test_job --schedule-enabled=true --every-days 1 --for-minutes-of-hour 0 --for-hours-of-day 0 --schedule-start 2021-03-09T00:00:00Z --schedule-end 2021-03-15T00:00:00Z