Cloudera Data Engineering example jobs and sample data

Cloudera Data Engineering provides a suite of example jobs that operate on example data to showcase its core capabilities and make the onboarding easier. The example jobs are a combination of Spark and Airflow jobs, which include scenarios such as reading and writing from object storage, running an Airflow DAG, and expanding on Python capabilities with custom virtual environments. Once loaded, these jobs can be run on demand or scheduled. The sample data will be loaded into the environment's default Data Lake location.

In Cloudera Data Engineering, jobs are associated with virtual clusters. Before you can create a job, you must register a Cloudera environment and Data Lake, and create a Cloudera Data Engineering Service and virtual cluster. For more information, see Environments, Enabling Cloudera Data Engineering service, and Creating virtual clusters .

important

The user interface for Cloudera Data Engineering 1.17 and above has been updated. The left-hand menu was updated to provide easy access to commonly used pages. The steps below will vary slightly, for example, the Overview page has been replaced with the Home page. You can also load an example job by opening a new Virtual Cluster and clicking Load Example Jobs. To view Cloudera Data Engineering Services, click Administration on the left-hand menu. The new home page still displays Virtual Clusters, but now includes quick-access links located at the top for the following categories: Jobs, Resources, and Download & Docs.

You must run the example jobs with a user who is not the Local Administrator, that is, the user must to have been granted DEUser or DEAdmin privileges in the environment associated with your DE workspace. Also ensure you have enough resources to run these example jobs. Below is the description of the different example jobs:

Table 1. Example Jobs
Job	Description
`example-load-data`	Loads the sample data onto the environment data lake. This job runs only once and is then deleted. note This will need to be run manually first if the sample jobs are loaded in any user defined virtual clusters. If the example-load-data job fails, contact Cloudera Support to recreate the example-load-data job.
`example-virtual-env`	Demonstrates Cloudera Data Engineering job configuration that utilizes Python Environment resource type to expand pyspark features via custom virtual env. This example adds pandas support. note You cannot run this job in an air-gapped environment.
`example-resources`	Demonstrates Cloudera Data Engineering job configuration utilizing file-based resource type. Resources are mounted on Spark driver and executor pods. This example uses an input file as a data source for a word-count Spark app. The driver stderr log contains the word count.
`example-resources-schedules`	Demonstrates scheduling functionality for Spark job in Cloudera Data Engineering. This example schedules a job to run at 5:04am UTC each day.
`example-spark-pi`	Demonstrates how to define a Cloudera Data Engineering job. It runs a SparkPi using a scala example jar located on a s3 bucket. The driver stderr log contains the value of pi. note You cannot run this job in an air-gapped environment.
`example-cdeoperator`	Demonstrates job orchestration using Airflow. This example uses a custom Cloudera Data Engineering Operator to run two Spark jobs in sequence, mimicking a pipeline composed of data ingestion and data processing. note You cannot run this job in an air-gapped environment.
`example-object-store`	Demonstrates how to access and write data to object store on different form factors: S3, ADLS, and HDFS. This example reads data already staged to object store and makes changes and then saves back the transformed data to object store. The output of the query ran on the object store table can be viewed in the driver stderr log.
`example-iceberg`	Demonstrates support for iceberg table format. This example reads raw data from object store and saves data in iceberg table format and showcases iceberg metadata inforation, such as snapshots.

In the Cloudera Cloudera Management Console, click the Data Engineering tile and click Overview.
In the Cloudera Data Engineering Services column, select the service containing the virtual cluster where you want to create the job.
In the Virtual Clusters column on the right, click the View Jobs icon on the virtual cluster where you want to create the application.
Select Load Example Jobs from the two options that appear.
note
You will see this window only if you have no existing jobs in the virtual cluster.
If you have existing jobs in the virtual cluster, click on the jobs page to Load Example Jobs.
A dialog box appears explaining the example jobs and sample data. Click Confirm to load example jobs and sample data.

Example jobs will be loaded in the virtual cluster and sample data will be loaded in the environment’s Data Lake location.