Transforming and organizing data using Cloudera Data Engineering

You must define the Spark job configurations, resources, and schedule in the Cloudera Data Engineering (CDE) cluster.

In CDE, you perform the organization and transformation piece of the Business Intelligence at Scale pattern. A Spark jar file (bi-workflow_2.11-0.1.jar) has been created for you and is available as part of your pattern artifacts download package to create a Spark job. The Spark job converts the weather data that you put into the AWS S3 bucket in Avro format to Parquet, and makes the Parquet data available to Cloudera Data Warehouse.

  1. Log in to your AWS console and create a tmp sub-folder in your S3 bucket within the path that you are using for this pattern.
  2. Log in to the CDP web interface.
  3. Go to Data Engineering.
  4. In the Environments column, select the environment containing the virtual cluster where you want to create the job.
  5. In the Virtual Clusters column on the right, click the View Jobs icon on the virtual cluster where you want to create the application.
  6. Go to Jobs > Create Job.
  7. Provide the job details:
    1. Select Spark 2.4.7 as the Job Type.
    2. Specify the Name.
    3. Select File and upload the JAR file that you downloaded earlier.
    4. Specify the following in the Main Class field:
      com.cloudera.cde.workflow.BiWorkflow
    5. Specify the following in the Arguments field:
      Argument
      --weatherTable
      weather
      --weatherForecastTable
      weather_forecast
      --weatherDataDirRoot
      s3a://[***FULL-PATH-TO-YOUR-S3-BUCKET***]
      Click (Add Argument) to add multiple arguments as necessary.
      weather and weather_forecast sub-folders are automatically created in the S3 bucket by the Spark job.
    6. Enter the following in the Configurations field:
      Configuration Value
      spark.datasource.hive.warehouse.load.staging.dir [***PATH-TO-S3-BUCKET***]/tmp
      spark.driver.extraJavaOptions -Duser.timezone=UTC
      spark.executor.extraJavaOptions -Duser.timezone=UTC
      spark.sql.hive.hiveserver2.jdbc.url jdbc:hive2://[***HIVE-SERVER-2-FQDN***]-fsio.int.cldr.work:2181/default;httpPath=cliservice;serviceDiscoveryMode=zooKeeper;ssl=true;transportMode=http

      Obtain the [***HIVE-SERVER-2-FQDN***] from going to CDP Home > Management Console > Data Hub Clusters > [***DATA-HUB-WITH-DATA-ENGINEERING-CLUSTER-DEFINITION***] > Cloudera Manager > Clusters > Hive service > Instances and copying the FQDN corresponding to HiveServer2.

      spark.sql.session.timeZone UTC
      Click (Add Configuration) to add multiple configuration parameters as necessary.
    7. Click Schedule to display scheduling options.
  8. Schedule the application to run periodically using the Basic controls as follows:
    Every hour at 30 minute(s) past the hour.
    Select a Start Date and an End Date.
  9. Click Schedule to create the job.