Creating jobs in Cloudera Data Engineering

A job in Cloudera Data Engineering (CDE) consists of defined configurations and resources (including application code). Jobs can be run on demand or scheduled.

In Cloudera Data Engineering (CDE), jobs are associated with virtual clusters. Before you can create a job, you must create a virtual cluster that can run it. For more information, see Creating virtual clusters.

The following steps are required to allow users to submit jobs. Perform these steps for each user that needs to submit jobs to the virtual cluster.

  1. If you already downloaded the utility script and uploaded it to an ECS or HDFS gateway cluster host as documented in Creating virtual clusters, you can skip to step 8.
  2. Download cdp-cde-utils.sh.gz to your local machine.
  3. Create a directory to store the files, and change to that directory:
    mkdir -p /tmp/cde-1.3.1 && cd /tmp/cde-1.3.1
  4. Extract the file:
    gunzip /path/to/cdp-cde-utils.sh.gz
  5. Embedded Container Service (ECS)
    Copy the extracted utility script (cdp-cde-utils.sh) to one of the Embedded Container Service (ECS) cluster hosts. To identify the ECS cluster hosts:
    1. Log in to the Cloudera Manager web interface.
    2. Go to Clusters > Experience Cluster > ECS > Hosts.
    3. Select one of the listed hosts, and copy the script to that host.
    Red Hat OpenShift Container Platform (OCP)
    Copy the extracted utility script (cdp-cde-utils.sh) and the OpenShift kubeconfig file to one of the HDFS service gateway hosts, and install the kubectl utility:
    1. Log in to the Cloudera Manager web interface.
    2. Go to Clusters > Base Cluster > HDFS > Instances.
    3. Select one of the Gateway hosts, and copy the script to that host.
    4. Copy the OCP kubeconfig file to the same host.
    5. On that host, install the kubectl utility following the instructions in the Kubernetes documentation.
  6. On the cluster host that you copied the script to, set the script permissions to be executable:
    chmod +x /path/to/cdp-cde-utils.sh
  7. Identify the virtual cluster endpoint:
    1. In the Cloudera Manager web UI, go to the Experiences page, and then click Open CDP Private Cloud Experiences.
    2. Click the Data Engineering tile.
    3. Select the CDE service containing the virtual cluster you want to activate.
    4. Click Cluster Details.
    5. Click JOBS API URL to copy the URL to your clipboard.
    6. Paste the URL into a text editor to identify the endpoint host. For example, the URL is similar to the following:
      https://dfdj6kgx.cde-2cdxw5x5.ecs-demo.example.com/dex/api/v1

      The endpoint host is dfdj6kgx.cde-2cdxw5x5.ecs-demo.example.com.

  8. On the ECS or HDFS gateway host, create a filename containing the user principal, and generate a keytab. If you do not have the ktutil utility, you might need to install the krb5-workstation package. The following example commands assume the user principal is psherman@EXAMPLE.COM.
    1. Create a file named <username>.principal (for example, psherman.principal) containing the user principal:
      psherman@EXAMPLE.COM
    2. Generate a keytab named <username>.keytab for the user using ktutil:
      sudo ktutil
      ktutil:  addent -password -p psherman@EXAMPLE.COM -k 1 -e aes256-cts
      Password for psherman@EXAMPLE.COM: 
      ktutil:  addent -password -p psherman@EXAMPLE.COM -k 1 -e aes128-cts
      Password for psherman@EXAMPLE.COM: 
      ktutil:  addent -password -p psherman@EXAMPLE.COM -k 1 -e rc4-hmac
      Password for psherman@EXAMPLE.COM: 
      ktutil:  wkt psherman.keytab
      ktutil:  q
  9. Validate the keytab using klist and kinit:
    klist -ekt psherman.keytab 
    Keytab name: FILE:psherman.keytab
    KVNO Timestamp           Principal
    ---- ------------------- ------------------------------------------------------
       1 08/01/2021 10:29:47 psherman@EXAMPLE.COM (aes256-cts-hmac-sha1-96) 
       1 08/01/2021 10:29:47 psherman@EXAMPLE.COM (aes128-cts-hmac-sha1-96) 
       1 08/01/2021 10:29:47 psherman@EXAMPLE.COM (arcfour-hmac) 
    
    kinit -kt psherman.keytab psherman@EXAMPLE.COM

    Make sure that the keytab is valid before continuing. If the kinit command fails, the user will not be able to run jobs in the virtual cluster. After verifying that the kinit command succeeds, you can destroy the Kerberos ticket by running kdestroy.

  10. Use the cdp-cde-utils.sh script to copy the user keytab to the virtual cluster hosts:
    ./cdp-cde-utils.sh init-user-in-virtual-cluster -h <endpoint_host> -u <user> -p <principal_file> -k <keytab_file>
    For example, using the psherman user, for the dfdj6kgx.cde-2cdxw5x5.ecs-demo.example.com endpoint host:
    ./cdp-cde-utils.sh init-user-in-virtual-cluster -h dfdj6kgx.cde-2cdxw5x5.ecs-demo.example.com -u psherman -p psherman.principal -k psherman.keytab
  11. Repeat these steps for all users that need to submit jobs to the virtual cluster.
  1. Navigate to the Cloudera Data Engineering Overview page by clicking the Data Engineering tile in the Cloudera Data Platform (CDP) management console.
  2. In the Environments column, select the environment containing the virtual cluster where you want to create the job.
  3. In the Virtual Clusters column on the right, click the View Jobs icon on the virtual cluster where you want to create the application.
  4. In the left hand menu, click Jobs.
  5. Click the Create Job button.
  6. Provide the Job Details:
    1. Select Spark for the job type.
    2. Specify the Name.
    3. Select File, Resource, or URL for your application file. You can upload or specify a URL to a Python or JAR file, or select from available resources. For JAR files, specify the Main Class.
    4. Specify arguments if required. You can click the Add Argument button to add multiple command arguments as necessary.
    5. Enter Configurations if needed. You can click the Add Configuration button to add multiple configuration parameters as necessary.
    6. Toggle Advanced Options to display additional customizations, such as driver and executor cores and memory.
    7. Toggle Schedule to define a schedule.
      You can schedule the application to run periodically using the provided Basic options, or you can specify a Cron expression.
  7. If you provided a schedule, click Schedule to create the job and its schedule. If you did not specify a schedule, and you do not want the job to run immediately, click the drop-down arrow on Create and Run and select Create. Otherwise, click Create and Run to run the job immediately.