CDE example jobs and sample data

Cloudera Data Engineering provides a suite of example jobs that operate on example data to showcase its core capabilities and make the onboarding easier. The example jobs are a combination of Spark and Airflow jobs, which include scenarios such as reading and writing from object storage, running an Airflow DAG, and expanding on Python capabilities with custom virtual environments. Once loaded, these jobs can be run on demand or scheduled. The sample data will be loaded into the environment's default Data Lake location.

In Cloudera Data Engineering (CDE), jobs are associated with virtual clusters. Before you can create a job, you must register a CDP environment and Data Lake, and create a CDE Service and virtual cluster. For more information, see Environments, Enabling Cloudera Data Engineering service, and Creating virtual clusters .

You must run the example jobs with a user who is not the Local Administrator, that is, the user must to have been granted DEUser or DEAdmin privileges in the environment associated with your DE workspace. Also ensure you have enough resources to run these example jobs. Below is the description of the different example jobs:
Table 1. Example Jobs
Job Description
example-load-data Loads the sample data onto the environment data lake. This job runs only once and is then deleted.
example-virtual-env Demonstrates CDE job configuration that utilizes Python Environment resource type to expand pyspark features via custom virtual env. This example adds pandas support.
example-resources Demonstrates CDE job configuration utilizing file-based resource type. Resources are mounted on Spark driver and executor pods. This example uses an input file as a data source for a word-count Spark app. The driver stderr log contains the word count.
example-resources-schedules Demonstrates scheduling functionality for Spark job in CDE. This example schedules a job to run at 5:04am UTC each day.
example-spark-pi Demonstrates how to define a CDE job. It runs a SparkPi using a scala example jar located on a s3 bucket. The driver stderr log contains the value of pi.
example-cdeoperator Demonstrates job orchestration using Airflow. This example uses a custom CDE Operator to run two Spark jobs in sequence, mimicking a pipeline composed of data ingestion and data processing.
example-object-store Demonstrates how to access and write data to object store on different form factors: S3, ADLS, and HDFS. This example reads data already staged to object store and makes changes and then saves back the transformed data to object store. The output of the query ran on the object store table can be viewed in the driver stderr log.
example-iceberg Demonstrates support for iceberg table format. This example reads raw data from object store and saves data in iceberg table format and showcases iceberg metadata inforation, such as snapshots.
  1. In the Cloudera Data Platform (CDP) management console, click the Data Engineering tile and click Overview.
  2. In the CDE Services column, select the service containing the virtual cluster where you want to create the job.
  3. In the Virtual Clusters column on the right, click the View Jobs icon on the virtual cluster where you want to create the application.
  4. Select Load Example Jobs from the two options that appear.
  5. If you have existing jobs in the virtual cluster, click on the jobs page to Load Example Jobs.
  6. A dialog box appears explaining the example jobs and sample data. Click Confirm to load example jobs and sample data.
Example jobs will be loaded in the virtual cluster and sample data will be loaded in the environment’s Data Lake location.