Job Retry parameters

The Job Retry feature is available from Cloudera AI 1.5.5 SP1 or higher releases. Configure Job Retry settings with the help of the Job Retry parameters.

If the Administrator configures the Job Retry settings, the specified values automatically populate the fields in the new job creation form when a user creates a job. In this case the Job Retry settings act as default values that the administrator can recommend to users.

If the Administrator does not configure the Job Retry settings, the fields remain blank in the new job creation form.

In both cases, users have the flexibility to customize the values during job creation or update them later through the job settings page.

Table 1. Job Retry parameters
Parameter Description
Maximum Retry

The Maximum Retry parameter defines the maximum number of retries that can be performed for a single failed job run. Retries continue until either a job run succeeds or the total number of retries reaches the maximum retry count specified by the user.

Setting the Maximum Retry option to a high value can result in higher resource usage.

The value must always be greater than 0.

Retry Delay

The Retry Delay value defines the time between each subsequent retry attempt in minutes.

The Retry Delay period ensures that the transient errors, (for example, temporary network or resource outage, get fixed without overloading the system with job run requests.

If you encounter transient issues, set the Retry Delay parameter to a higher value.

If you address script failures and transient issues are not a concern, a lower value for the retry delay can be configured.

If you have time-sensitive Jobs, set the Retry Delay parameter to a smaller value to trigger the retry at a faster pace.

The value must always be greater than 0 and the minimum retry delay value is 1 minute.

Retry Conditions

The Retry Conditions parameter controls the terminal states of a job run that trigger a retry.

Select minimum one of the criteria if Retry is enabled, but you can select any combination of the Retry Conditions options. The Retry process completes as soon as any of the selected criteria is met.

The following Retry Conditions options can be enabled:

  • Script Failure – It runs the Retry process for user script failures if the user script exits with a non-zero exit code after the execution of the script.

  • System Failure – It runs the Retry process for any kind of system- or engine-related failures not including user script failures.

  • Timed-out Runs – It runs the Retry process for timed-out job runs.

    The timeout value must be set to a reasonable duration. If the value is too short, each retry will encounter the same limit, potentially resulting in a continuous timeout → retry → timeout loop.

  • Skipped Runs – It runs the Retry process for skipped job runs.

Limit Concurrent Retries*

*This parameter can only be set by the Administrator.

It defines the limit value for the Maximum Concurrent Retry Limit parameter. The Maximum Concurrent Retry Limit parameter specifies the maximum number of job retry runs that can execute concurrently across the entire workbench, regardless of the total number of jobs running.

If the maximum limit value defined as Maximum Concurrent Retry Limit is reached, any additional job retry runs are rescheduled until the number of active retry runs falls below the limit.

Enable a hard limit only if job retry runs are consuming excessive resources, otherwise, avoid setting a hard limit.

Administrators can set the Maximum Concurrent Retry Limit value if the Limit Concurrent Retries option is enabled.