CDE example jobs and sample data

Cloudera Data Engineering provides a suite of example jobs that operate on example data to showcase its core capabilities and make the onboarding easier. The example jobs are a combination of Spark and Airflow jobs, which include scenarios such as reading and writing from object storage, running an Airflow DAG, and expanding on Python capabilities with custom virtual environments. Once loaded, these jobs can be run on demand or scheduled. The sample data will be loaded into the environment's default Data Lake location.

In Cloudera Data Engineering (CDE), jobs are associated with virtual clusters. Before you can create a job, you must register a CDP environment and Data Lake, and create a CDE Service and virtual cluster. For more information, see Environments, Enabling Cloudera Data Engineering service, and Creating virtual clusters .

Below is the description of the different example jobs:

  • example-load-data : this will load the sample data onto the environment data lake.
  • example-virtual-env: demonstrates CDE job configuration that utilizes Python Environment resource type to expand pyspark features via custom virtual env. This example adds pandas support.
  • example-resources: demonstrates CDE job configuration utilizing file-based resource type. Resources are mounted on Spark driver and executor pods. This example uses an input file as a data source for a word-count Spark app.
  • example-resources-schedules: demonstrates scheduling functionality for Spark job in CDE. This example schedules a job to run at 5:04am UTC each day.
  • example-spark-pi: demonstrates how to define a CDE job. It runs a SparkPi using a scala example jar located on a s3 bucket.
  • example-cdeoperator: demonstrates job orchestration using Airflow. This example uses a custom CDE Operator to run two Spark jobs in sequence, mimicking a pipeline composed of data ingestion and data processing.
  • example-object-store: demonstrates how to access and write data to object store on different form factors: S3, ADLS, and HDFS. This example reads data already staged to object store and makes changes and then saves back the transformed data to object store.
  • example-iceberg: demonstrates support for iceberg table format. This example reads raw data from object store and saves data in iceberg table format and showcases iceberg metadata info, such as snapshots.
  1. In the Cloudera Data Platform (CDP) management console, click the Data Engineering tile and click Overview.
  2. In the CDE Services column, select the service containing the virtual cluster where you want to create the job.
  3. In the Virtual Clusters column on the right, click the View Jobs icon on the virtual cluster where you want to create the application.
  4. Select Load Example Jobs from the two options that appear.
  5. If you have existing jobs in the virtual cluster, click on the hamburger icon on the jobs page to Load Example Jobs.
  6. A dialog box appears explaining the example jobs and sample data. Click Confirm to load example jobs and sample data.
Example jobs will be loaded in the virtual cluster and sample data will be loaded in the environment’s Data Lake location.