Setting Up Apache Sqoop Using the Command Line

Apache Sqoop 1 is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. You can use Sqoop 1 to import data from external structured datastores into the Hadoop Distributed File System (HDFS) or related systems such as Hive and HBase. Conversely, you can use Sqoop 1 to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.

Upgrading Sqoop 1 from an Earlier CDH 5 release

These instructions assume that you are upgrading Sqoop 1 as part of an upgrade to the latest CDH 5 release, and have already performed the steps under Upgrading from an Earlier CDH 5 Release to the Latest Release.

To upgrade Sqoop 1 from an earlier CDH 5 release, install the new version of Sqoop 1 using the RPM or Debian packages.

Sqoop 1 Prerequisites

Sqoop 1 requires the following:

  • An operating system supported by CDH 5.
  • Oracle JDK.
  • Services that you want to use with Sqoop, such as HBase, Hive HCatalog, and Accumulo. When you run Sqoop, it checks to see if these services are installed and configured. It logs warnings for services it does not find. These warnings, shown below, are harmless. You can suppress these error messages by setting the variables $HBASE_HOME, $HCAT_HOME and $ACCUMULO_HOME to any existing directory.
    > Warning: /usr/lib/sqoop/../hbase does not exist! HBase imports will fail.
    > Please set $HBASE_HOME to the root of your HBase installation.
    > Warning: /usr/lib/sqoop/../hive-hcatalog does not exist! HCatalog jobs will fail.
    > Please set $HCAT_HOME to the root of your HCatalog installation.
    > Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
    > Please set $ACCUMULO_HOME to the root of your Accumulo installation. 

Installing the Sqoop 1 RPM or Debian Packages

The Sqoop 1 packages consist of:

  • sqoop — Complete Sqoop 1 distribution
  • sqoop-metastore — For installation of the Sqoop 1 metastore only

To install Sqoop 1 on a RHEL-compatible system:

$ sudo yum install sqoop

To install Sqoop 1 on an Ubuntu or other Debian system:

$ sudo apt-get install sqoop

To install Sqoop 1 on a SLES system:

$ sudo zypper install sqoop

If you have already configured CDH on your system, there is no further configuration necessary for Sqoop 1. You can start using Sqoop 1 by using commands such as:

$ sqoop help
$ sqoop version
$ sqoop import

Installing the JDBC Drivers for Sqoop 1

Sqoop 1 does not ship with third party JDBC drivers. You must download them separately and save them to the /var/lib/sqoop/ directory on the server. The following sections show how to install the most common JDBC Drivers.

Before you begin:

Make sure the /var/lib/sqoop directory exists and has the correct ownership and permissions:
mkdir -p /var/lib/sqoop
chown sqoop:sqoop /var/lib/sqoop
chmod 755 /var/lib/sqoop

This sets permissions to drwxr-xr-x.

For JDBC drivers for Hive, Impala, Teradata, or Netezza, see the Connectors documentation.

Installing the MySQL JDBC Driver

Download the MySQL JDBC driver from http://www.mysql.com/downloads/connector/j/5.1.html. You will need to sign up for an account if you do not already have one, and log in, before you can download it. Then copy it to the /var/lib/sqoop/ directory. For example:

$ sudo cp mysql-connector-java-version/mysql-connector-java-version-bin.jar /var/lib/sqoop/

Installing the Oracle JDBC Driver

You can download the JDBC Driver from the Oracle website, for example http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html. You must accept the license agreement before you can download the driver. Download the ojdbc6.jar file and copy it to the /var/lib/sqoop/ directory:

$ sudo cp ojdbc6.jar /var/lib/sqoop/

Installing the Microsoft SQL Server JDBC Driver

Download the Microsoft SQL Server JDBC driver from http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774 and copy it to the /var/lib/sqoop/ directory. For example:

$ curl -L 'http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz' | tar xz
$ sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /var/lib/sqoop/

Installing the PostgreSQL JDBC Driver

Download the PostgreSQL JDBC driver from http://jdbc.postgresql.org/download.html and copy it to the /var/lib/sqoop/ directory. For example:

$ curl -L 'http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar' -o postgresql-9.2-1002.jdbc4.jar
$ sudo cp postgresql-9.2-1002.jdbc4.jar /var/lib/sqoop/

Syntax for Configuring JDBC Connection Strings

These are the JDBC connection strings for supported databases.

MySql Connection String

Syntax:

jdbc:mysql://<HOST>:<PORT>/<DATABASE_NAME>

Example:

jdbc:mysql://my_mysql_server_hostname:3306/my_database_name

Oracle Connection String

Syntax:

jdbc:oracle:thin:@<HOST>:<PORT>:<DATABASE_NAME>

Example:

jdbc:oracle:thin:@my_oracle_server_hostname:1521:my_database_name

PostgreSQL Connection String

Syntax:

jdbc:postgresql://<HOST>:<PORT>/<DATABASE_NAME>

Example:

jdbc:postgresql://my_postgres_server_hostname:5432/my_database_name

Netezza Connection String

Syntax:

jdbc:netezza://<HOST>:<PORT>/<DATABASE_NAME>
Example:
jdbc:netezza://my_netezza_server_hostname:5480/my_database_name

Teradata Connection String

Syntax:

jdbc:teradata://<HOST>/DBS_PORT=1025/DATABASE=<DATABASE_NAME>
Example:
jdbc:teradata://my_teradata_server_hostname/DBS_PORT=1025/DATABASE=my_database_name

Setting HADOOP_MAPRED_HOME for Sqoop 1

  • For each user who will be submitting MapReduce jobs using MapReduce v2 (YARN), or running Pig, Hive, or Sqoop 1 in a YARN installation, make sure that the HADOOP_MAPRED_HOME environment variable is set correctly, as follows:
    $ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Viewing the Sqoop 1 Documentation

For additional documentation see the Sqoop user guides.