Configuring Job Retry settings

The Job Retry feature is available from Cloudera AI 1.5.5 SP1 or higher releases. Job retry runs are designed to operate asynchronously, ensuring they do not disrupt the normal flow of a job run. These retries are executed concurrently to maintain efficiency.

The Administrator can define default values for the Job Retry parameters and only the Administrator can configure a hard limit on the maximum number of job retry runs that can be executed alongside normal job runs. This setting must only be enabled if you want to manage and limit resource usage for job retry runs.

  1. In the Cloudera console, click the Cloudera AI tile.

    The Cloudera AI Workbenches page displays.

  2. Click on the name of the workbench.
    The workbench Home page displays.
  3. Select Site Administration in the left Navigation pane.
  4. Select the Settings tab.
  5. Select Job Retry Configuration > Limit Concurrent Retries.
  6. Enable Limit Concurrent Retries by selecting the checkbox.

    Enabling this option sets a limit to how many job retry runs (at maximum) can be active at the same time.

  7. Define the limit value for Maximum Concurrent Retry Limit.

    The Maximum Concurrent Retry Limit specifies the maximum number of job retry runs that can execute concurrently across the entire workbench, regardless of the total number of jobs running.

    If the maximum limit value defined as Maximum Concurrent Retry Limit is reached, any additional job retry runs are rescheduled until the number of active retry runs falls below the limit.

    Enable this hard limit only if job retry runs are consuming excessive resources, otherwise, avoid setting a hard limit.

    Administrators can set this value if the Limit Concurrent Retires option is enabled.

  8. Under Default Settings for all jobs, select Enable Retry to enable a retry run for the job.

    Define the following parameters for Job Retry:

    • Maximum Retry – The maximum number of retry attempts which can be triggered for a single job run in case of continuous failure of retry job runs.

      The minimum value is 1.

    • Retry Delay (minutes) – The delay between two consecutive retry job runs for a failed instance of the run.

      The minimum value is 1 minute.

    • Retry Conditions – Different options can be configured to control the terminal states of a job run that trigger a retry. The Retry process completes as soon as at least one (or more) option is selected.

      Select at least one of the following criteria if Retry is enabled, but you can select any combination of the following Retry Conditions options:

      • Script Failure – Runs the Retry process for user script failures if the user script exits with a non-zero exit code after the execution of the script.

      • System Failure – Runs the Retry process for any kind of system- or engine-related failures not including user script failures.

      • Timed-out Runs – Runs the Retry process for timed-out job runs.

      • Skipped Runs – Runs the Retry process for skipped job runs.

  9. Click on Update to save the settings.