Creating jobs in Cloudera Data Engineering
A job in Cloudera Data Engineering (CDE) consists of defined configurations and resources (including application code). Jobs can be run on demand or scheduled.
In Cloudera Data Engineering (CDE), jobs are associated with virtual clusters. Before you can create a job, you must create a virtual cluster that can run it. For more information, see Creating virtual clusters.
The following steps are required to allow users to submit jobs. Perform these steps for each user that needs to submit jobs to the virtual cluster.
- If you already downloaded the utility script and uploaded it to an ECS or HDFS gateway cluster host as documented in Creating virtual clusters, you can skip to step 8.
cdp-cde-utils.shto your local machine.
- Create a directory to store the files, and change to that
mkdir -p /tmp/cde-1.3.4 && cd /tmp/cde-1.3.4
- Embedded Container Service (ECS)
- Copy the extracted utility script
cdp-cde-utils.sh) to one of the Embedded Container Service (ECS) cluster hosts. To identify the ECS cluster hosts:
- Log in to the Cloudera Manager web interface.
- Go to .
- Select one of the listed hosts, and copy the script to that host.
- Red Hat OpenShift Container Platform (OCP)
- Copy the extracted utility script
cdp-cde-utils.sh) and the OpenShift
kubeconfigfile to one of the HDFS service gateway hosts, and install the
- On the cluster host that you copied the script to, set the script
permissions to be
chmod +x /path/to/cdp-cde-utils.sh
- Identify the virtual cluster endpoint:
- In the Cloudera Manager web UI, go to the Experiences page, and then click Open CDP Private Cloud Experiences.
- Click the Data Engineering tile.
- Select the CDE service containing the virtual cluster you want to activate.
- Click Cluster Details.
- Click JOBS API URL to copy the URL to your clipboard.
- Paste the URL into a text editor to identify the endpoint host.
For example, the URL is similar to the
The endpoint host is
- On the ECS or HDFS gateway host, create a filename containing the
user principal, and generate a keytab. If you do not have the
ktutilutility, you might need to install the
krb5-workstationpackage. The following example commands assume the user principal is
- Create a file named
psherman.principal) containing the user principal:
- Generate a keytab named
<username>.keytabfor the user using
sudo ktutil ktutil: addent -password -p psherman@EXAMPLE.COM -k 1 -e aes256-cts Password for psherman@EXAMPLE.COM: ktutil: addent -password -p psherman@EXAMPLE.COM -k 2 -e aes128-cts Password for psherman@EXAMPLE.COM: ktutil: addent -password -p psherman@EXAMPLE.COM -k 3 -e rc4-hmac Password for psherman@EXAMPLE.COM: ktutil: wkt psherman.keytab ktutil: q
- Create a file named
- Validate the keytab using
klist -ekt psherman.keytab Keytab name: FILE:psherman.keytab KVNO Timestamp Principal ---- ------------------- ------------------------------------------------------ 1 08/01/2021 10:29:47 psherman@EXAMPLE.COM (aes256-cts-hmac-sha1-96) 1 08/01/2021 10:29:47 psherman@EXAMPLE.COM (aes128-cts-hmac-sha1-96) 1 08/01/2021 10:29:47 psherman@EXAMPLE.COM (arcfour-hmac) kinit -kt psherman.keytab psherman@EXAMPLE.COM
Make sure that the keytab is valid before continuing. If the
kinitcommand fails, the user will not be able to run jobs in the virtual cluster. After verifying that the
kinitcommand succeeds, you can destroy the Kerberos ticket by running
- Use the
cdp-cde-utils.shscript to copy the user keytab to the virtual cluster hosts:
For example, using the
./cdp-cde-utils.sh init-user-in-virtual-cluster -h <endpoint_host> -u <user> -p <principal_file> -k <keytab_file>
pshermanuser, for the
./cdp-cde-utils.sh init-user-in-virtual-cluster -h dfdj6kgx.cde-2cdxw5x5.ecs-demo.example.com -u psherman -p psherman.principal -k psherman.keytab
- Repeat these steps for all users that need to submit jobs to the virtual cluster.
- Navigate to the Cloudera Data Engineering Overview page by clicking the Data Engineering tile in the Cloudera Data Platform (CDP) management console.
- In the Environments column, select the environment containing the virtual cluster where you want to create the job.
- In the Virtual Clusters column on the right, click the View Jobs icon on the virtual cluster where you want to create the application.
- In the left hand menu, click Jobs.
- Click the Create Job button.
- Provide the Job Details:
- Select Spark for the job type.
- Specify the Name.
- Select File or
URL for your application file, and
provide or specify the file. You can upload a new file or select a
file from an existing resource.If you select URL and specify an Amazon AWS S3 URL, add the following configuration to the job:
- If your application code is a JAR file, specify the Main Class.
- Specify arguments if required. You can click the Add Argument button to add multiple command arguments as necessary.
- Enter Configurations if needed. You can click the Add Configuration button to add multiple configuration parameters as necessary.
- If your application code is a Python file, select the Python Version, and optionally select a Python Environment.
- Click Advanced Configurations to display
more customizations, such as additional files, initial executors,
executor range, driver and executor cores and memory.By default, the executor range is set to match the range of CPU cores configured for the virtual cluster. This improves resource utilization and efficiency by allowing jobs to scale up to the maximum virtual cluster resources available, without manually tuning and optimizing the number of executors per job.
- Click Schedule to display scheduling
options.You can schedule the application to run periodically using the Basic controls or by specifying a Cron Expression.
- If you provided a schedule, click Schedule to create the job. If you did not specify a schedule, and you do not want the job to run immediately, click the drop-down arrow on Create and Run and select Create. Otherwise, click Create and Run to run the job immediately.