Tutorial: Clusters and Jobs on AWS
This tutorial walks you through using the Altus console and CLI to create Altus Data Engineering clusters and submit jobs in Altus. The tutorial uses publicly available data that show the usage of Medicare procedure codes.
Cloudera has created a publicly accessible S3 bucket for files used in Altus examples. This publicly accessible S3 bucket contains the data, scripts, and other artifacts used in the tutorial. You must create an S3 bucket in your AWS account to write output data.
- Prerequisites
- To use this tutorial, you must have an Altus user account and the roles required to create clusters and run jobs in Altus.
- Altus Console Login
- Log in to the Altus console to perform the exercises in this tutorial.
- Exercise 1: Installing the Altus Client
- Learn how to install the Altus client and register an access key to use the CLI.
- Exercise 2: Creating a Spark Cluster and Submitting Spark Jobs
- Learn how to create a cluster with a Spark service and submit a Spark job using the Altus console and the CLI. This exercise provides instructions on how to create a SOCKS proxy and view the cluster and monitor the job in Cloudera Manager. It also shows you how to delete the cluster on the console.
- Exercise 3: Creating a Hive Cluster and Submitting Hive Jobs
- Learn how to create a cluster with a Hive service and submit a group of Hive jobs using the Altus console and the CLI. This exercise also walks you through the process of creating a SOCKS proxy and accessing Cloudera Manager. It also shows you how to delete the cluster on the console.
Prerequisites
Before you start the tutorial, ensure that you have access to resources in your AWS account and an Altus user account with permission to create clusters and run jobs in Altus.
- Altus user account, environment, and roles. An Altus user account allows you to log in to the Altus console and perform the exercises in the tutorial. An
Altus administrator must assign an Altus environment to your user account so that you have access to resources in your AWS account. The Altus administrator must also assign roles to your user account
to allow you to create clusters and run jobs in Altus.
For more information about getting an Altus user account, see Getting Started in Altus.
- Public key. You must provide a public key for Altus to use when creating and configuring clusters in your AWS account.
For more information about creating the SSH key in AWS, see Amazon EC2 Key Pairs. You can also create the SSH keys using other tools, such as ssh-keygen.
- S3 bucket for output.The tutorial provides read access to an S3 bucket that contains the jars, scripts, and input data used in the tutorial exercises. You
must set up an S3 bucket in your AWS account for the output data generated by the jobs. The S3 bucket must have the permissions to allow write access when you run the Altus jobs.
For more information about creating an S3 bucket in AWS, see Creating and Configuring an S3 Bucket.
Altus Console Login
To access the Altus console, go to the following URL: https://console.altus.cloudera.com/.
Log in to Altus with your Cloudera user account. After you are authenticated, the Altus console displays your home page.
The Data Engineering section displays on the side navigation panel. If you have been assigned roles and an environment in Altus, you can click on Clusters and Jobs to create clusters and submit jobs as you follow the tutorial exercises.
Exercise 1: Installing the Altus Client
To use the Altus CLI, you must install the Altus client and configure the client with an access key.
Altus manages access to the Altus services so that only users with a registered access key can run commands to create clusters, submit jobs, or use SDX namespaces. Generate and register an access key with the Altus client to create a credentials file so that you do not need submit your access key with each command.
This exercise provides instructions to download and install the Altus client on Linux, generate a key, and run the CLI command to register the key.
For instructions to install the Altus client on Windows, see Installing the Altus Client on Windows.
Step 1. Install the Altus Client
To avoid conflicts with older versions of Python or other packages, Cloudera recommends that you install the Cloudera Altus client in a virtual environment. Use the virtualenv tool to create a virtual environment and install the client.
mkdir ~/altusclienv virtualenv ~/altusclienv --no-site-packages source ~/altusclienv/bin/activate ~/altusclienv/bin/pip install altuscli
~/altusclienv/bin/pip install --upgrade altuscli
After the client installation process is complete, run the following command to confirm that the Altus client is working:
If virtualenv is activated: altus --version
If virtualenv is not activated: ~/altusclienv/bin/altus --version
Step 2. Configure the Altus Client with the API Access Key
You use the Altus console to generate the access that you register with the client. Keep the window that displays the access key on the console open until you complete the key registration process.
Exercise 2: Creating a Spark Cluster and Submitting Spark Jobs
This exercise shows you how to create a cluster with a Spark service on the Altus console and submit a Spark job on the console and the command line. It also shows you how to create a SOCKS proxy and access the cluster and view the progress of the job on Cloudera Manager.
- Create a cluster with a Spark service on the console.
- Submit a Spark job on the console.
- Create a SOCKS proxy to access the Spark cluster on Cloudera Manager.
- View the Spark cluster and verify the Spark job output.
- Submit a Spark job using the CLI.
- Terminate the Spark cluster
Creating a Spark Cluster on the Console
You must be logged in to the Altus console to perform this task.
Note that it can take a while for Altus to complete the process of creating a cluster.
- In the Data Engineering section of the side navigation panel, click Clusters.
- On the Clusters page, click Create Cluster.
- Create a cluster with the following configuration:
Property Description Cluster Name To help you easily identify your cluster, use your first initial and last name as prefix for the cluster name. This tutorial uses the cluster name mjones-spark-tutorial as an example. Service Type Spark 2.x CDH Version CDH 5.13 Environment Name of the Altus environment to which you have been given access for this tutorial. If you do not know which Altus environment to select, check with your Altus administrator. Node Configuration For the Worker node configuration, set the Number of Nodes to 3. Leave the rest of the node properties with their default setting.
Credentials Configure your access credentials to Cloudera Manager: - SSH Public Key
- If you have your public key in a file, select File Upload and choose the key file. If you have the key available for pasting on screen, select Direct Input to enter the full key code.
- Cloudera Manager User
- Set both the user name and password to guest.
- Verify that all required fields are set and click Create Cluster.
The Altus Data Engineering service creates a CDH cluster with the configuration you set. On the Clusters page, the new cluster displays at the top of the list of clusters.
Submitting a Spark Job
Submit a Spark job to run on the cluster you created in the previous task.
- In the Data Engineering section of the side navigation panel, click Jobs.
- Click Submit Jobs.
- On the Job Settings page, select Single job.
- Select the Spark job type.
- Create a Spark job with the following configuration:
Property Description Job Name Set the job name to Spark Medical Example. Main Class Set the main class to com.cloudera.altus.sample.medicare.transform Jars Use the tutorial jar file: s3a://cloudera-altus-data-engineering-samples/spark/medicare/program/altus-sample-medicare-spark2x.jar Application Arguments Set the application arguments to the S3 bucket to use for job input and output. Add the tutorial S3 bucket for the job input: s3a://cloudera-altus-data-engineering-samples/spark/medicare/input/
Click + and add the S3 bucket you created for the job output: s3a://Path/Of/The/Output/S3Bucket/
Cluster Settings Use an existing cluster and select the cluster that you created in the previous task.
- Verify that all required fields are set and click Submit Jobs.
The Altus Data Engineering service submits the job to run on the selected cluster in your AWS account.
Creating a SOCKS Proxy for the Spark Cluster
Use the Altus CLI to create a SOCKS proxy to log in to Cloudera Manager and view the cluster and progress of the job.
- In the Data Engineering section of the side navigation panel, click Clusters.
- On the Clusters page, find the cluster on which you submitted the job and click the cluster name.
- On the cluster detail page, click View SOCKS Proxy CLI Command.
Altus displays the command that you can use to create a SOCKS proxy to log in to the Cloudera Manager instance for the Spark cluster that you created.
- Click Copy.
- On a terminal window, paste the command.
- Modify the command to use the name of the cluster you created and your private key and run the command:
altus dataeng socks-proxy --cluster-name "YourClusterName" --ssh-private-key="YourPrivateKey" --open-cloudera-manager="yes"
The Cloudera Manager Admin console opens in a Chrome browser.
Viewing the Cluster and Verifying the Spark Job Output
Log in to Cloudera Manager with the guest user account that you set up when you created the cluster.
- Log in to Cloudera Manager using guest as the account name and password.
- On the Home page, click Clusters on the top navigation bar.
- On the cluster window, select YARN Applications.
The following screenshots show the cluster services and workload information that you can view on the Cloudera Manager Admin console:
- Success (0 bytes)
- part-00000 (65.5 KB)
- part-00001 (69.5 KB)
Creating a Spark Job using the CLI
You can submit the same Spark job to run on the same cluster using the CLI. If you want to view the cluster and monitor the job on Cloudera Manager, stay logged in to Cloudera Manager.
altus dataeng submit-jobs \ --cluster-name FirstInitialLastName-tutorialcluster \ --jobs '{ "sparkJob": { "jars": [ "s3a://cloudera-altus-data-engineering-samples/spark/medicare/program/altus-sample-medicare-spark2x.jar" ], "mainClass": "com.cloudera.altus.sample.medicare.transform", "applicationArguments": [ "s3a://cloudera-altus-data-engineering-samples/spark/medicare/input/", "s3a://Path/Of/The/Output/S3Bucket/" ] }}'
To view the workload summary, go to the Cloudera Manager console and click
. Cloudera Manager displays the same workload summary for this job as for the job that you submitted through the console.- Success (0 bytes)
- part-00000 (65.5 KB)
- part-00001 (69.5 KB)
Terminating the Cluster
This task shows you how to terminate the cluster that you created for this tutorial.
- On the Altus console, go to the Data Engineering section of the side navigation panel and click Clusters.
- On the Clusters page, click the name of the cluster that you created for this tutorial.
- On the Cluster details page, review the cluster information to verify that it is the cluster that you want to terminate.
- Click Actions and select Delete Cluster.
- Click OK to confirm that you want to terminate the cluster.
Exercise 3: Creating a Hive Cluster and Submitting Hive Jobs
This exercise shows you how to create a cluster with a Hive service on the Altus console and submit Hive jobs on the console and the command line. It also shows you how to create a SOCKS proxy and access the cluster and view the progress of the jobs on Cloudera Manager.
- Create a cluster with a Hive service on the console.
- Submit a group of Hive jobs on the console.
- Create a SOCKS proxy to access the Hive cluster on Cloudera Manager
- View the Hive cluster and verify the Hive job output.
- Submit a group of Hive jobs using the CLI.
- Terminate the Hive cluster
Creating a Hive Cluster on the Console
You must be logged in to the Altus console to perform this task.
Note that it can take a while for Altus to complete the process of creating a cluster.
- In the Data Engineering section of the side navigation panel, click Clusters.
- On the Clusters page, click Create Cluster.
- Create a cluster with the following configuration:
Property Description Cluster Name To help you easily identify your cluster, use your first initial and last name as prefix for the cluster name. This tutorial uses the cluster name mjones-hive-tutorial as an example. Service Type Hive CDH Version CDH 5.13 Environment Name of the Altus environment to which you have been given access for this tutorial. If you do not know which Altus environment to select, check with your Altus administrator. Node Configuration For the Worker node configuration, set the Number of Nodes to 3. Leave the rest of the node properties with their default setting.
Credentials Configure your access credentials to Cloudera Manager: - SSH Public Key
- If you have your public key in a file, select File Upload and choose the key file. If you have the key available for pasting on screen, select Direct Input to enter the full key code.
- Cloudera Manager User
- Set both the user name and password to guest.
- Verify that all required fields are set and click Create Cluster.
The Altus Data Engineering service creates a CDH cluster with the configuration you set. On the Clusters page, the new cluster displays at the top of the list of clusters.
Submitting a Hive Job Group
Submit multiple Hive jobs as a group to run on the cluster that you created in the previous step.
- In the Data Engineering section of the side navigation panel, click Jobs.
- Click Submit Jobs.
- On the Job Settings page, select Group of jobs.
- Select the Hive job type.
- Set the Job Group Name to Hive Medical Example.
- Click Add Hive Job.
- Create a job with the following configuration:
Property Description Job Name Set the job name to Create External Tables. Script Select Script Path and enter the following script name: s3a://cloudera-altus-data-engineering-samples/hive/program/med-part1.hql Hive Script Parameters Select Hive Script Parameters and add the following variables and values: - HOSPITALS_PATH: s3a://cloudera-altus-data-engineering-samples/hive/data/hospitals/
- READMISSIONS_PATH: s3a://cloudera-altus-data-engineering-samples/hive/data/readmissionsDeath/
- EFFECTIVECARE_PATH: s3a://cloudera-altus-data-engineering-samples/hive/data/effectiveCare/
- GDP_PATH: s3a://cloudera-altus-data-engineering-samples/hive/data/GDP/
Action on Failure Select Interrupt Job Queue.
- Click OK to add the job to the group.
On the Submit Jobs page, Altus adds the Hive Medical Example job to the list of jobs in the group.
- Click Add Hive Job.
- Create a job with the following configuration:
Property Description Job Name Set the job name to Clean Data. Script Select Script Path and enter the following script name: s3a://cloudera-altus-data-engineering-samples/hive/program/med-part2.hql Action on Failure Select Interrupt Job Queue.
- Click OK.
On the Submit Jobs page, Altus adds the Clean Data job to the list of jobs in the group.
- Click Add Hive Job.
- Create a job with the following configuration:
Property Description Job Name Set the job name to Write Output. Script Select Script Path and enter the following script name: s3a://cloudera-altus-data-engineering-samples/hive/program/med-part3.hql Hive Script Parameters Select Hive Script Parameters and add the S3 bucket you created for the job output as a variable: - OUTPUT_DIR: s3a://Path/Of/The/Output/S3Bucket/
Action on Failure Select None.
- Click OK.
On the Submit Jobs page, Altus adds the Write Output job to the list of jobs in the group.
- On the Cluster Settings section, select Use existing and select the Hive cluster you created for this exercise.
The list of clusters displayed include only those clusters that can run Hive jobs.
- Click Submit Jobs to run the job group on your Hive cluster.
Creating a SOCKS Proxy for the Hive Cluster
Use the Altus CLI to create a SOCKS proxy to log in to Cloudera Manager and view the progress of the job.
- In the Data Engineering section of the side navigation panel, click Clusters.
- On the Clusters page, find the cluster on which you submitted the Hive job group and click the cluster name.
- On the cluster detail page, click View SOCKS Proxy CLI Command.
Altus displays the command that you can use to create a SOCKS proxy to log in to the Cloudera Manager instance for the Hive cluster that you created.
- Click Copy.
- On a terminal window, paste the command.
- Modify the command to use the name of the cluster you created and your private key and then run the following command:
altus dataeng socks-proxy --cluster-name "YourClusterName" --ssh-private-key="YourPrivateKey" --open-cloudera-manager="yes"
The Cloudera Manager Admin console opens in a Chrome browser.
Viewing the Hive Cluster and Verifying the Hive Job Output
Log in to Cloudera Manager with the guest user account that you set up when you created the Hive cluster.
- Log in to Cloudera Manager using guest as the account name and password.
- On the Home page, click Clusters on the top navigation bar.
- On the cluster window, select YARN Applications.
The following screenshots show the cluster services and workload information that you can view on the Cloudera Manager Admin console:
- Click Clusters on the top navigation bar and select the default Hive service named HIVE-1. Then click HiveServer2 Web UI.
The following screenshots show the workload information that you can view for the Hive service:
- When the jobs complete, go to the S3 bucket you specified for your job output and verify the file created by the Hive jobs.
The Hive jobs create the following file in your output S3 bucket: 000000_0 (135.9 KB)
Creating a Hive Job Group using the CLI
You can submit the same group of Hive jobs to run on the same cluster using the CLI. If you want to view the cluster and monitor the job on Cloudera Manager, stay logged in to Cloudera Manager.
To submit a group of Hive jobs using the CLI, run the submit-jobs command and provide the list of jobs in the jobs parameter. Run it on the same cluster and use the same job group name.
altus dataeng submit-jobs \ --cluster-name FirstInitialLastName-tutorialcluster \ --job-submission-group-name "Hive Medical Example" \ --jobs '[ { "name": "Create External Tables", "failureAction": "INTERRUPT_JOB_QUEUE", "hiveJob": { "script": "s3a://cloudera-altus-data-engineering-samples/hive/program/med-part1.hql", "params": ["HOSPITALS_PATH=s3a://cloudera-altus-data-engineering-samples/hive/data/hospitals/", "READMISSIONS_PATH=s3a://cloudera-altus-data-engineering-samples/hive/data/readmissionsDeath/", "EFFECTIVECARE_PATH=s3a://cloudera-altus-data-engineering-samples/hive/data/effectiveCare/", "GDP_PATH=s3a://cloudera-altus-data-engineering-samples/hive/data/GDP/"] }}, { "name": "Clean Data", "failureAction": "INTERRUPT_JOB_QUEUE", "hiveJob": { "script": "s3a://cloudera-altus-data-engineering-samples/hive/program/med-part2.hql" }}, { "name": "Output Data", "failureAction": "NONE", "hiveJob": { "script": "s3a://cloudera-altus-data-engineering-samples/hive/program/med-part3.hql", "params": ["outputdir=s3a://Path/Of/The/Output/S3Bucket/"] }} ]'
- To view the workload summary, click .
- To view the job information, click .
When the jobs complete, go to the S3 bucket you specified for your job output and verify the file created by the Hive jobs. The Hive job group creates the following file in your output S3 bucket: 000000_0 (135.9 KB)
Terminating the Hive Cluster
This task shows you how to terminate the cluster that you created for this tutorial.
- On the Altus console, go to the Data Engineering section of the side navigation panel and click Clusters.
- On the Clusters page, click the name of the cluster that you created for this tutorial.
- On the Cluster details page, review the cluster information to verify that it is the cluster that you want to terminate.
- Click Actions and select Delete Cluster.
- Click OK to confirm that you want to terminate the cluster.