Transforming and organizing data using Cloudera Data Engineering
You must define the Spark job configurations, resources, and schedule in the Cloudera Data Engineering (CDE) cluster.
In CDE, you perform the organization and transformation piece of the Business Intelligence at Scale pattern. A Spark jar file (bi-workflow_2.11-0.1.jar) has been created for you and is available as part of your pattern artifacts download package to create a Spark job. The Spark job converts the weather data that you put into the AWS S3 bucket in Avro format to Parquet, and makes the Parquet data available to Cloudera Data Warehouse.
- Log in to your AWS console and create a tmp sub-folder in your S3 bucket within the path that you are using for this pattern.
- Log in to the CDP web interface.
- Go to Data Engineering.
- In the Environments column, select the environment containing the virtual cluster where you want to create the job.
- In the Virtual Clusters column on the right, click the View Jobs icon on the virtual cluster where you want to create the application.
- Go to .
Provide the job details:
- Select Spark 2.4.7 as the Job Type.
- Specify the Name.
- Select File and upload the JAR file that you downloaded earlier.
Specify the following in the Main Class
Specify the following in the Arguments
Argument --weatherTable weather --weatherForecastTable weather_forecast --weatherDataDirRoot s3a://[***FULL-PATH-TO-YOUR-S3-BUCKET***]Click (Add Argument) to add multiple arguments as necessary.weather and weather_forecast sub-folders are automatically created in the S3 bucket by the Spark job.
Enter the following in the Configurations
Configuration Value spark.datasource.hive.warehouse.load.staging.dir [***PATH-TO-S3-BUCKET***]/tmp spark.driver.extraJavaOptions -Duser.timezone=UTC spark.executor.extraJavaOptions -Duser.timezone=UTC spark.sql.hive.hiveserver2.jdbc.url jdbc:hive2://[***HIVE-SERVER-2-FQDN***]-fsio.int.cldr.work:2181/default;httpPath=cliservice;serviceDiscoveryMode=zooKeeper;ssl=true;transportMode=http
Obtain the [***HIVE-SERVER-2-FQDN***] from going toand copying the FQDN corresponding to HiveServer2.
spark.sql.session.timeZone UTCClick (Add Configuration) to add multiple configuration parameters as necessary.
- Click Schedule to display scheduling options.
Schedule the application to run periodically using the
Basic controls as follows:
Every hour at 30 minute(s) past the hour.Select a Start Date and an End Date.
- Click Schedule to create the job.