Managing Jobs and Pipelines in Cloudera Data Science Workbench

Cloudera Data Science Workbench allows you to automate analytics workloads with a built-in job and pipeline scheduling system that supports real-time monitoring, job history, and email alerts. A job automates the action of launching an engine, running a script, and tracking the results, all in one batch process. Jobs are created within the purview of a single project and can be configured to run on a recurring schedule. You can customize the engine environment for a job, set up email alerts for successful or failed job runs, and email the output of the job to yourself or a colleague.

As data science projects mature beyond ad hoc scripts, you might want to break them up into multiple steps. For example, a project may include one or more data acquisition, data cleansing, and finally, data analytics steps. For such projects, Cloudera Data Science Workbench allows you to schedule multiple jobs to run one after another in what is called a pipeline, where each job is dependent on the output of the one preceding it.

Creating a Job

Jobs are created within the scope of a project. When you create a job, you will be asked to select a script to execute as part of the job, and create a schedule for when the job should run. Optionally, you can configure a job to be dependent on another existing job, thus creating a pipeline of tasks to be accomplished in a sequence. Note that the script files and any other job dependencies must exist within the scope of the same project.
  1. Navigate to the project for which you want to create a job.
  2. On the left-hand sidebar, click Jobs.
  3. Click New Job.
  4. Enter a Name for the job.
  5. Select a script to execute for this job by clicking on the folder icon. You will be able to select a script from a list of files that are already part of the project. To upload more files to the project, see Managing Files.
  6. (Optional) Specify command-line arguments that are needed by the scripts that are running within your job in the Arguments field.
  7. Depending on the code you are running, select an Engine Kernel for the job from one of the following options: Python 2, Python 3, R, or Scala.
  8. Select a Schedule for the job runs from one of the following options.
    • Manual - Select this option if you plan to run the job manually each time.
    • Recurring - Select this option if you want the job to run in a recurring pattern every X minutes, or on an hourly, daily, weekly or monthly schedule.
    • Dependent - Use this option when you are building a pipeline of jobs to run in a predefined sequence. From a dropdown list of existing jobs in this project, select the job that this one should depend on. Once you have configured a dependency, this job will run only after the preceding job in the pipeline has completed a successful run.
  9. Select an Engine Profile to specify the number of cores and memory available for each session.
  10. Enter an optional timeout value in minutes.
  11. Click Set environment variables if you want to set any values to override the overall project environment variables.
  12. Specify a list of Job Report Recipients to whom you can send email notifications with detailed job reports for job success, failure, or timeout. You can send these reports to yourself, your team (if the project was created under a team account), or any other external email addresses.
  13. Add any Attachments such as the console log to the job reports that will be emailed.
  14. Click Create Job.

    Starting with version 1.1.x, you can use the Jobs API to schedule jobs from third partly workflow tools. For details, see Cloudera Data Science Workbench Jobs API.

Creating a Pipeline

The Jobs overview presents a list of all existing jobs created for a project along with a dependency graph to display any pipelines you've created. Job dependencies do not need to be configured at the time of job creation. Pipelines can be created after the fact by modifying the jobs to establish dependencies between them. From the job overview, you can modify the settings of a job, access the history of all job runs, and view the session output for individual job runs.

Let's take an example of a project that has two jobs, Read Weblogs and Write Weblogs. Given that you must read the data before you can run analyses and write to it, the Write Weblogs job should only be triggered after the Read Weblogs job completes a successful run. To create such a two-step pipeline:
  1. Navigate to the project where the Read Weblogs and Write Weblogs jobs were created.
  2. Click Jobs.
  3. From the list of jobs, select Write Weblogs.
  4. Click the Settings tab.
  5. Click on the Schedule dropdown and select Dependent. Select Read Weblogs from the dropdown list of existing jobs in the project.
  6. Click Update Job.

Viewing Job History

  1. Navigate to the project where the job was created.
  2. Click Jobs.
  3. Select the relevant job.
  4. Click the History tab. You will see a list of all the job runs with some basic information such as who created the job, run duration, and status. Click individual runs to see the session output for each run.