Importing Data into Hive with Sqoop Through HiverServer2

Importing Data Through Hiveserver2

In addition to importing with the Hive CLI, Sqoop supports import into Hive through HiveServer2 as well.

There are three HiveServer2 specific command options that the user can define for the sqoop import tool when importing data:
  • --hs2-url: The JDBC connection string to HiveServer2 as one would specify it for Beeline.
  • --hs2-user: Specifies the user for creating the JDBC connection to HiveServer2. If a user is not specified, the current OS user is used by default.
  • --hs2-keytab: The path to the keytab file of the user connecting to HiveServer2. If the --hs2-user option is specified then --hs2-keytab option has to specified as well otherwise it can be omitted. The keytab has to be available on the machine the Sqoop command is executed on.

HiveServer2 imports can be initiated with the --hs2-url. When the user specifies the --hs2-url option, commands are sent to HiveServer2 through a JDBC connection. The data itself is not transferred via the JDBC connection. It is written directly to HDFS and moved to the Hive warehouse using the LOAD DATA INPATH command just like in the case of the default Hive import. When the --hs2-url option is not specified, Sqoop imports the data into Hive using the Hive CLI, which is the default method. For more information regarding the default import method, see upstream documents.

HiveServer2 provides proper Sentry authorization. As a result, Cloudera recommends importing data into Hive through HiveServer2 instead of the default method. Currently, Sqoop can authenticate to HiveServer2 using Kerberos only.

Importing Data

Prerequisites

Before importing data make sure that the following prerequisites are satisfied:

  • A properly configured user with permissions to execute CREATE TABLE and LOAD DATA INPATH statements in Hive.
  • Default ACLs defined for the temporary import folder so that the new folder, when created, inherits the ACLs of the parent.

Steps

  1. Create a temporary import folder with read, write, and execute permissions for the Hive user. For example:
    hdfs dfs -mkdir /user/username/importdir
    hdfs dfs -setfacl -m default:user:hive:rwx /user/username/importdir
    

    The LOAD DATA INPATH statement is executed by the Hive superuser, therefore, the temporary HDFS folder that Sqoop imports into has to have read, write, and execute permission for the Hive user as well.

  2. Execute a Hive import. Use either of the following methods:
    1. Execute a Hive import with the current OS user:
      sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table "employees_test" --target-dir "/user/username/importdir/employees_test" --hive-import --hs2-url "jdbc:hive2://hs2host:10000/default;principal=hive/hs2host@DOMAIN.COM;ssl=true;sslTrustStore=/etc/cdep-ssl-conf/CA_STANDARD/truststore.jks;trustStorePassword=password"
    2. Execute a Hive import with the username and keytab specified:
      sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table "employees_test" --target-dir "/user/username/importdir/employees_test" --delete-target-dir --hive-import --hs2-url "jdbc:hive2://hs2host:10000/default;principal=hive/hs2host@DOMAIN.COM;ssl=true;sslTrustStore=/etc/cdep-ssl-conf/CA_STANDARD/truststore.jks;trustStorePassword=password" --hs2-user username --hs2-keytab "/path/to/sqooptestkeytab"