Tutorial: Using Impala, Hive and Hue with Virtual Private Clusters

Set Up an Environment

Set up your environment with Compute and Base clusters as follows: (See Adding a Compute Cluster and Data Context.)
  1. Create clusters where the Cloudera Manager and CDH version match, for example both are 6.2.0. The clusters must use Kerberos and TLS.
  2. If Base cluster has Sentry, then make sure the user executing cross cluster queries is added to the correct role that has all the necessary privileges to create/insert data into tables. (more in workflow #3).
  3. Configure a Regular cluster called Cluster 1 to be used as a Base cluster. This cluster must have high availability enabled.
  4. Create two Compute clusters called Compute 1 and Compute 2.
  5. Configure services for the three clusters as shown below:

Workflow #1: Using impala-shell and Hive

This workflow desribes how to create a table using Impala, how to insert sample data on Compute cluster 1, and how to access and modify the data using beeline from Compute cluster 2.

In a Virtural Private Cluster environment, Hue and the impala-shell can be used to setup databases, tables, insert and retrieve data using queries. This workflow focuses on running a few queries using impala-shell command line tool.

This tutorial uses a kerberized environment with TLS, so you must kinit the user first. If you want to add a new user, see Step 6: Get or Create a Kerberos Principal for Each User Account and Enabling Sentry Authorization for Impala for documentation to create and add the user to the Kerberos principal and the required Linux groups.

  1. Identify a host running Impala Daemon to launch impala-shell using this breadcrumb. In the Cloudera Manager Admin Console, go to Cloudera Manager > Compute Cluster 1 > IMPALA-1 > Instances.
  2. Note the hostname of a host that is running the Impala Daemon and open an ssh session to that host.
    ssh <hostname>



  3. Kinit the user (because this is a Kerberized environment):
    kinit -kt <keytab path of authorized user> <authorized user>
  4. Start the Impala-shell:

    In the Cloudera Manager Admin Console, go to the Impala service and click the Status tab.

    The Impala Shell command displays in the upper-left. Copy that string and use it as the command to open Impala shell. For example:
    impala-shell -i vpc_host-cqdbms-2.tut.myco.com:25003 -d default -k --ssl --ca_cert=/etc/cdep-ssl-conf/CA_STANDARD/truststore.pem
    

    The command uses the following form:
    impala-shell -i <Impala Daemon host URL>:25003 -d default -k --ssl --ca_cert=<path to truststore>/truststore.pem
    Query: use `default`
    [vpc_host-cqdbms-2.tut.myco.com:25003] default>
  5. Verify that impala-shell is in the connected status.
  6. Create a database and a table:
    [vpc_host-cqdbms-2.tut.myco.com:25003] default> create database test_data;
    Query: create database test_data
    +----------------------------+
    | summary                    |
    +----------------------------+
    | Database has been created. |
    +----------------------------+
    [vpc_host-cqdbms-2.tut.myco.com:25003] default> show databases;
    Query: show databases
    +------------------+----------------------------------------------+
    | name             | comment                                      |
    +------------------+----------------------------------------------+
    | _impala_builtins | System database for Impala builtin functions |
    | default          | Default Hive database                        |
    | test_data        |                                              |
    +------------------+----------------------------------------------+
    [vpc_host-cqdbms-2.tut.myco.com:25003] default> use test_data;
    [vpc_host-cqdbms-2.tut.myco.com:25003] test_data> create table test_table (year int, winner string);
    +-------------------------+
    | summary                 |
    +-------------------------+
    | Table has been created. |
    +-------------------------+
    Fetched 1 row(s) in 0.47s
  7. Verify that the table has been created on the Base cluster HDFS
    [vpc_host-cqdbms-2.tut.myco.com:25003] test_data> show create table test_table;
    Query: show create table test_table
    +-------------------------------------------------------------------+
    | result                                                            |
    +-------------------------------------------------------------------+
    | CREATE TABLE test_data.test_table (                               |
    |   year INT,                                                       |
    |   winner STRING                                                   |
    | )                                                                 |
    | STORED AS TEXTFILE                                                |
    | LOCATION 'hdfs://ns1/user/hive/warehouse/test_data.db/test_table' |
    |                                                                   |
    +-------------------------------------------------------------------+
  8. Insert test data into the table:
    [vpc_host-cqdbms-2.tut.myco.com:25003] test_data> insert into table test_table values (2018, 'France'), (2014, 'Germany'), (2010, 'Spain'), (2006, 'Italy');
    Query: insert into table test_table values (2018, 'France'), (2014, 'Germany'), (2010, 'Spain'), (2006, 'Italy')
    Query progress can be monitored at: https://vpc_host-cqdbms-2.tut.myco.com:25000/query_plan?query_id=334fc3bd7e421cce:540f171500000000
    [vpc_host-cqdbms-2.tut.myco.com:25003] test_data> select * from test_table;
    Query progress can be monitored at: https://vpc_host-cqdbms-2.tut.myco.com:25000/query_plan?query_id=be4c4c177520a5fd:7efba74700000000
    +------+---------+
    | year | winner  |
    +------+---------+
    | 2018 | France  |
    | 2014 | Germany |
    | 2010 | Spain   |
    | 2006 | Italy   |
    +------+---------+
  9. Log in using ssh to the host running HiveServer2 on the Compute cluster. You can find a host in the Cloudera Manager Admin Console by going to Clusters > Compute 2 > Hive Execution Service > Instances.
  10. Because this is a Kerberized environment, kinit the user:
    [root@vpc_host-nnwznq-1 ~]# kinit hive
    Password for hive@tut.myco.com:
  11. Launch beeline:
    • TLS is not enabled:
      [root@vpc_host-nnwznq-1 ~]# beeline -u 'jdbc:hive2://localhost:10000/default;principal=hive/vpc_host-nnwznq-1.tut.myco.com@tut.myco.com'
    • Auto-TLS is enabled:
      [root@vpc_host-nnwznq-1 ~]# beeline -u 'jdbc:hive2://localhost:10000/default;ssl=true;sslTrustStore=<CMCA_TRUSTORE_LOC>.jks;trustStorePassword=cloudera;principal=hive/vpc_host-nnwznq-1.tut.myco.com@tut.myco.com'
      
    • Regular TLS is enabled:
      [root@vpc_host-nnwznq-1 ~]# beeline -u 'jdbc:hive2://localhost:10000/default;ssl=true;sslTrustStore=/etc/cdep-ssl-conf/CA_STANDARD/truststore.jks;trustStorePassword=cloudera;principal=hive/vpc_host-nnwznq-1.tut.myco.com@tut.myco.com'
      
      /CA_STANDARD/truststore.jks;trustStorePassword=cloudera;principal=hive/<HiveServer2 Host URL>@VPC.CLOUDERA.COM'
  12. Access the tables created through Impala in the previous section:
    0: jdbc:hive2://localhost:10000/default> show databases;
    INFO  : Compiling command(queryId=hive_20190309192547_09146fd4-58b9-4f60-ad40-c9de3f98d470); Time taken: 0.987 seconds
    INFO  : Starting task [Stage-0:DDL] in serial mode
    INFO  : Completed executing command(queryId=hive_20190309192547_09146fd4-58b9-4f60-ad40-c9de3f98d470); Time taken: 0.041 seconds
    INFO  : OK
    +----------------+
    | database_name  |
    +----------------+
    | default        |
    | test_data      |
    +----------------+
    0: jdbc:hive2://localhost:10000/default> select * from test_table;
    INFO  : Compiling command(queryId=hive_20190309192621_701914ad-0417-4639-9209-335a63818b82): select * from test_table
    command(queryId=hive_20190309192621_701914ad-0417-4639-9209-335a63818b82); Time taken: 0.38 seconds
    +------------------+--------------------+
    | test_table.year  | test_table.winner  |
    +------------------+--------------------+
    | 2018             | France             |
    | 2014             | Germany            |
    | 2010             | Spain              |
    | 2006             | Italy              |
    +------------------+--------------------+
  13. Modify the table:
    0: jdbc:hive2://localhost:10000/default> insert into test_table values (2002, 'Brazil');
    INFO  : Compiling 
    INFO  : Executing command(queryId=hive_20190309192705_218b79aa-aa94-4102-95ab-a1d4bc7a0381): insert into test_table values (2002, 'Brazil')
    WARN  : 
    INFO  : Query ID = hive_20190309192705_218b79aa-aa94-4102-95ab-a1d4bc7a0381
    INFO  : Total jobs = 3
    INFO  : Launching Job 1 out of 3
    INFO  : Starting task [Stage-1:MAPRED] in serial mode
    INFO  : Submitting tokens for job: job_1552095496593_0001
    INFO  : The url to track the job: https://vpc_host-nnwznq-1.tut.myco.com:8090/proxy/application_1552095496593_0001/
    INFO  : Starting Job = job_1552095496593_0001, Tracking URL = https://vpc_host-nnwznq-1.tut.myco.com:8090/proxy/application_1552095496593_0001/
    hdfs://ns1/user/hive/warehouse/test_data.db/test_table/.hive-staging_hive_2019-03-09_19-27-05_193_3963732700280111926-1/-ext-10000 from hdfs://ns1/user/hive/warehouse/test_data.db/test_table/.hive-staging_hive_2019-03-09_19-27-05_193_3963732700280111926-1/-ext-10002
    INFO  : Starting task [Stage-0:MOVE] in serial mode
    INFO  : Loading data to table test_data.test_table from hdfs://ns1/user/hive/warehouse/test_data.db/test_table/.hive-staging_hive_2019-03-09_19-27-05_193_3963732700280111926-1/-ext-10000
    INFO  : MapReduce Jobs Launched: 
    INFO  : Stage-Stage-1: Map: 1   Cumulative CPU: 2.4 sec   HDFS Read: 4113 HDFS Write: 88 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 2 seconds 400 msec
    INFO  : Completed executing command(queryId=hive_20190309192705_218b79aa-aa94-4102-95ab-a1d4bc7a0381); Time taken: 31.097 seconds
    INFO  : OK
    1 row affected (31.434 seconds)
    
  14. Verify and track the Yarn job submitted by the Hive Execution Service using the Cloudera Manager Admin Console by going to Clusters > Compute 2 > YARN 2 > Applications.

    YARN Job:


    Map Reduce job:


  15. Verify that new data was added to the table:
    0: jdbc:hive2://localhost:10000/default> select * from test_table;
    INFO  : Completed executing command(queryId=hive_20190309192752_38d2b5a0-bf12-46b2-9a55-62e06beb9fb9); Time taken: 0.0 seconds
    +------------------+--------------------+
    | test_table.year  | test_table.winner  |
    +------------------+--------------------+
    | 2002             | Brazil             |
    | 2018             | France             |
    | 2014             | Germany            |
    | 2010             | Spain              |
    | 2006             | Italy              |
    +------------------+--------------------+

Workflow #2: View HDFS directory structure of Compute clusters

In a Virtual Private Cluster environment, all Compute clusters have a local HDFS. But, crucial directories (JHS for spark, logging for yarn) for compute services are created in the Base cluster HDFS.

One of the design assumptions of Compute clusters is that they would be transient and so the user should still have a way to access important logs after the Compute clusters have been destroyed.

  1. Open the Cloudera Manager Admin Console and view the HDFS hierarchy on the Base cluster HDFS service by opening the File Browser: Cluster 1 > HDFS-1 > File Browser.



    All the logs pertaining to Compute clusters are under the “mc” directory.


    This Base cluster has 2 Compute clusters associated with it, Compute 1 and Compute 2.

    Each Compute cluster (based on its ID) gets a folder under this directory, so folder 2 belongs to Compute 1 and 3 belongs to Compute 2. The ID of the cluster can be identified from the URL used to access the cluster. Click on Compute 1 in the CM Cluster view and inspect the URL.

    http://quasar-wfrgnj-1.vpc.cloudera.com:7180/cmf/clusters/2/status

    The ID is the segment following /clusters in the URL. This is also the subfolder name under the /mc folder.

    This is the directory where all the logs for services in Compute 1 are stored.

  2. Navigate to the file browser of a Compute cluster.



    Note that folder 2 which is dedicated for the Compute 1 cluster is not visible to the Compute 2 cluster.

    Navigating to folders below this hierarchy, you can see the folders created for services present on the Compute 2 cluster.



Workflow #3: Insert data in test_table through Spark

  1. Identify a host to start a spark-shell on the Compute cluster, Compute 2. Open the Cloudera Manager Admin Console and go to Clusters > Compute 2 > SPARK_ON_YARN-1 > Instances.



  2. Open a terminal session host <HiveServer2 Host URL>
  3. Verify access for the user:
    Assign the user starting spark-shell to a Linux group that has create/insert access configured in Sentry. .e. hive.
    [root@vpc_host-nnwznq-1 ~]# usermod -aG hive systest

    The user will also need to be created and added to the group on all the hosts of the Base cluster.

  4. kinit the user:
    [root@vpc_host-nnwznq-1 ~]# kinit -kt /cdep/keytabs/systest.keytab systest
  5. Start spark-shell:
    [root@vpc_host-nnwznq-1 ~]# sudo -u systest spark-shell
    Setting default log level to "WARN".
    Spark context Web UI available at http://vpc_host-nnwznq-1.tut.myco.com:4043
    Spark context available as 'sc' (master = yarn, app id = application_1552095496593_0007).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 2.4.0-cdh6.2.0
          /_/
    scala> 
    
  6. Insert data into the table:
    scala> val insertData = sqlContext.sql("insert into test_data.test_table values (1998, 'France')")
    19/03/10 00:21:26 WARN shims.HdfsUtils: Unable to inherit permissions for file hdfs://ns1/user/hive/warehouse/test_data.db/test_table/part-00000-f0fc5e0d-daa3-4fe2-9ded-d62e1ef45ca9-c000 from file hdfs://ns1/user/hive/warehouse/test_data.db/test_table
    insertData: org.apache.spark.sql.DataFrame = []
    scala> val tableTestData = sqlContext.sql("select * from test_data.test_table")
    tableTestData: org.apache.spark.sql.DataFrame = [year: int, winner: string]
     
    scala> tableTestData.show()
    +----+-------+                                                                  
    |year| winner|
    +----+-------+
    |2002| Brazil|
    |2018| France|
    |2014|Germany|
    |2010|  Spain|
    |2006|  Italy|
    |1998| France|
    +----+-------+
    
  7. Verify and track the queries in the Yarn service application on the Compute cluster:



Workflow #4: Hue in a Virtual Private Cluster Environment

  1. Login into Hue. Open the Cloudera Manager Admin Console and go to Clusters > Compute Clusters > Hue > Hue Web UI > Load Balanced Web UI.

    Note: If the Base cluster uses Sentry, ensure that the logged-in user is a part of a Linux group that has the necessary grants to access the Hive warehouse on the Base cluster.

  2. Select Editor > Hive.

    The tables displa in the left panel.

  3. Run a select query to display data in the test_table previously created:



  4. Insert data into the table:



  5. Change the editor to Impala:

    Go to Hue > Query > Impala.

  6. Run INVALIDATE METADATA query in the Impala editor.
  7. Verify the data added from the Hive editor in the test_table shows up in the Impala editor: