Migrating Spark Data to Cloudera Private CloudPDF version

Using the Cloudera Data Engineering CLI

If at any time you are having issues with the CDE CLI, you can view the CDE CLI options by adding the --help flag to any CLI commands:

cde spark --help
 
cde spark submit --help
 
cde airflow --help
 
cde resource --help
When new to the CDE CLI, a common approach is to start with the following steps:
  1. Experimenting with CDE Spark Submit CLI
  2. Creating a CDE Resource
  3. Uploading all files to the CDE Resource
  4. Creating CDE jobs with files uploaded to the Resource
  5. Running CDE jobs
This is the fastest way to launch a Spark Submit CLI in CDE. Notice however that the CDE job is not instantiated as a Spark CDE job and is therefore not reschedulable from the CDE UI.
cde spark submit pysparkjob.py
Cloudera recommends that you create one CDE Resource for every Spark Pipeline or Airflow DAG .
cde resource create --name cde_cli_resource

When uploading to a resource the two important inputs are the name of the target CDE Resource and the local path to the files being uploaded.

Cloudera recommends using the --help command to explore more options such as uploading files in bulk.
cde resource upload --name cde_cli_resource --local-path "pysparkjob.py" --resource-path "pysparkjob.py"

Once the files and dependencies have been uploaded you can easily instantiate a CDE job with the job create command.

For example, you can create a CDE job with the CDE Resource file and run it on a schedule.
cde job create --name "cde_cli_job" --type "spark"
                --application-file "pysparkjob.py" 
                --cron-expression "0 */1 * * *" \
                --schedule-enabled "true" 
                --schedule-start "2022-04-29" 
                --schedule-end "2022-05-02" 
                --mount-1-resource "cde_cli_resource"
Creating a resource and uploading dependencies is optional. Once that is done, you can trigger execution of the CDE jobs manually.
cde job run --name "cde_cli_job" --application-file "pysparkjob.py"  

You have now completed a basic workflow to start experimenting with the CDE CLI. Below are some more useful examples:

More CDE CLI examples

  • Search for CDE jobs based on attributes
    You can use attributes for your search. In this case, you can search by name.
    cde job list --filter 'name[like]%name_pattern%'
  • List all CDE job runs
    cde run list
  • Describe CDE job run
    Replace the integer with your job run id. For example, 47 is the ID referred in the below command.
    cde run describe --id 47
  • Create a CDE job with Custom Spark Log Level

    A big advantage of using CDE is Spark observability. Logging level can be easily customized. Furthermore, every log is always available to the CDE user.

    Using the log-level parameter you can choose any of the following options: TRACE, DEBUG, INFO, WARN, ERROR, FATAL, OFF
    cde job create --name "cde_cli_job_custom_log_level" --type "spark"
                    --application-file "pysparkjob.py"
                    --log-level "DEBUG"
                    --schedule-enabled "false" 
                    --mount-1-resource "cde_cli_resource"
  • Collect CDE Job Run Logs
    You can download the Spark logs you have access to in CDE. Notice you have more options e.g. executor logs
    cde run logs --type "driver/stdout" --id 47

    You can modify the log type to any of the available tabs in the corresponding CDE job Run page. For example:

    • driver/stderr or Driver/stdout
    • executor id/stdout

We want your opinion

How can we improve this page?

What kind of feedback do you have?