2. Instructions

This section provides you instructions on the following:

 2.1. Deploying Talend Open Studio

Use the following instructions to set up Talend Open Studio:

  1. Download and launch the application.

    1. Download the Talend Open Studio add-on for HDP from here.

    2. After the download is complete, unzip the contents in an install location.

    3. Invoke the executable file corresponding to your operating system.

    4. Read and accept the end-user license agreement.

      Talend displays the "Welcome to Talend Open Studio" dialog.

  2. Create a new project.

    1. Ignore the Select a Demo Project field. Instead, create a new project. Provide a project name (for example, HDPIntro), then click Create.

      Talend displays the New Project dialog.

    2. Click Finish on the New Project dialog.

    3. Select the newly-created project, then click Open.

      The Connect To TalendForge dialog displays.

    4. Either choose to register now, or click Skip to continue.

      Talend displays the progress information bar. When progress is complete, Talend displays a Welcome window.

    5. Wait for the application to initialize, then click Start now!. Talend displays the main Talend Open Studio (TOS) window.

 2.2. Writing a Talend Job for Data Import

This section describes how to design a simple job for importing a file into the Hadoop cluster.

  1. Create a new job.

    1. In the Repository tree view, right-click the Job Designs node, then select Create job.

    2. In the New Job wizard, provide a name for the new job (for example, HDPJob), then click Finish.

    3. An empty design workspace, named after the new job, opens.

  2. Create a sample input file.

    1. At your master deployment machine's the /tmp directory, create a text file (for example: input.txt) with the following contents:

      101;Adam;Wiley;Sales
      102;Brian;Chester;Service
      103;Julian;Cross;Sales
      104;Dylan;Moore;Marketing
      105;Chris;Murphy;Service
      106;Brian;Collingwood;Service
      107;Michael;Muster;Marketing
      108;Miley;Rhodes;Sales
      109;Chris;Coughlan;Sales
      110;Aaron;King;Marketing
              

  3. Build the job.

    Jobs are composed of components that are available in the Palette.

    1. Expand the Big Data tab in the Palette.

    2. Click the tHDFSPut component, then select the design workspace to drop this component.

    3. To define the component in its Basic Settings view, double-click tHDFSPut.

    4. Set the values in the Basic Settings corresponding to your HDP cluster:

  4. You now have a working job.

    To run the job, click Play. Talend displays something similar to the following:

  5. Verify the import operation. From the gateway machine or the HDFS client, open a console window and execute:

    hadoop dfs -ls /user/testuser/data.txt
    

    At your terminal window, Talend displays:

    Found 1 items
    -rw-r--r-- 3 testuser testuser
    252 2012-06-12 12:52 /user/
    testuser/data.txt
    

    This message indicates that the local file was successfully created in your Hadoop cluster.

 2.3. Modifying the Job to Perform Data Analysis

Use the following instructions to aggregate data using Apache Pig.

  1. Add the Pig component from the Big Data Palette.

    1. In the Big Data palette, select the Pig tab.

    2. Click the tPigLoad component and drag it into the design workspace.

  2. Define basic settings for the Pig component.

    1. To define the component's basic settings, double-click tPigLoad.

    2. Click Edit Schema. Define the schema of the input data as shown below, then click OK:

    3. Provide the values for the mode, configuration, NameNode URI, JobTracker host, load function, and input file URI fields as shown.

      [Important]Important

      Ensure that the NameNode URI and the JobTracker hosts correspond to accurate values available in your HDP cluster. (The Input File URI corresponds to the path of the previously imported input.txt file.)

  3. Connect the Pig and HDFS components to define the workflow.

    1. Right-click the source component (tHDFSPut) on your design workspace.

    2. From the contextual menu, select Trigger -> On Subjob Ok.

    3. Click the target component (tPigLoad).

  4. Add and connect Pig aggregate component.

    1. Add the component tPigAggregate next to tPigLoad.

    2. From the contextual menu, right-click on tPigLoad and select -> Pig Combine.

    3. Click on tPigAggregate.

  5. Define basic settings for the Pig Aggregate component.

    1. Double-click tPigAggregate to define the component in its Basic Settings.

    2. Click on the “Edit schema” button and define the output schema as shown below:

  6. Define aggregation function for the data.

    1. For Group by add a Column and select dept.

    2. In the Operations table, choose the people_count in the Additional Output column, function as count and input column id as shown:

  7. Add and connect Pig data storage component

    1. Add the component tPigStoreResult next to tPigAggregate.

    2. From the contextual menu, right-click on tPigLoad, select Row -> Pig Combine and click on tPigStoreResult.

  8. Define basic settings for the data storage component.

    1. Double-click tPigStoreResult to define the component in its Basic Settings view.

    2. Specify the result directory on HDFS as shown:

  9. Run the modified Talend job. The modified Talend job is ready for execution.

    Save the job and click the play icon to run as instructed in Step 4.

  10. Verify the results.

    1. From the gateway machine or the HDFS client, open a console window and execute the following command:

      hadoop dfs -cat /user/testuser/output/part-r-00000
      
    2. You should see the following output:

      Sales;4
      Service;3
      Marketing;3
      


loading table of contents...