Setting Up Apache Sqoop 2 Using the Command Line

Sqoop 2 is a server-based tool designed to transfer data between Hadoop and relational databases. You can use Sqoop 2 to import data from a relational database management system (RDBMS), such as MySQL or Oracle, into the Hadoop Distributed File System (HDFS), transform the data with Hadoop MapReduce, and then export it back into an RDBMS.

Sqoop 2 has three packaging options for installation:

  • Tarball (.tgz) that contains both the Sqoop 2 server and the client.
  • Separate RPM packages for Sqoop 2 server (sqoop2-server) and client (sqoop2-client)
  • Separate Debian packages for Sqoop 2 server (sqoop2-server) and client (sqoop2-client)

Feature Differences - Sqoop 1 and Sqoop 2

Feature Sqoop 1 Sqoop 2
Connectors for all major RDBMS Supported.

Not supported.

Workaround: Use the generic JDBC Connector which has been tested on the following databases: Microsoft SQL Server, PostgreSQL, MySQL and Oracle.

This connector should work on any other JDBC compliant database. However, performance might not be comparable to that of specialized connectors in Sqoop.

Kerberos Security Integration Supported.

Supported.

Data transfer from RDBMS to Hive or HBase Supported.

Not supported.

Workaround: Follow this two-step approach.
  1. Import data from RDBMS into HDFS
  2. Load data into Hive or HBase manually using appropriate tools and commands such as the LOAD DATA statement in Hive
Data transfer from Hive or HBase to RDBMS Not supported.
Workaround: Follow this two-step approach.
  1. Extract data from Hive or HBase into HDFS (either as a text or Avro file)
  2. Use Sqoop to export output of previous step to RDBMS

Not supported.

Follow the same workaround as for Sqoop 1.

Sqoop 2 Prerequisites

  • An operating system supported by CDH 5.
  • Oracle JDK.
  • Hadoop must be installed on the host that runs the Sqoop 2 server component.

  • Services that you want to use with Sqoop, such as HBase, Hive HCatalog, and Accumulo. Sqoop checks for these services when you run it, and finds services that are installed and configured. It logs warnings for services it does not find. These warnings, shown below, are harmless.
    > Warning: /usr/lib/sqoop/../hbase does not exist! HBase imports will fail.
    > Please set $HBASE_HOME to the root of your HBase installation.
    > Warning: /usr/lib/sqoop/../hive-hcatalog does not exist! HCatalog jobs will fail.
    > Please set $HCAT_HOME to the root of your HCatalog installation.
    > Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
    > Please set $ACCUMULO_HOME to the root of your Accumulo installation. 

Installing Sqoop 2

Sqoop 2 is distributed as two separate packages: a client package (sqoop2-client) and a server package (sqoop2-server). Install the server package on one host in the cluster; because the Sqoop 2 server acts as a MapReduce client, this host must have Hadoop installed and configured.

Install the client package on each host that acts as a client. A Sqoop 2 client always connects to the Sqoop 2 server to perform any actions, so Hadoop does not need to be installed on the client hosts.

Depending on what you are planning to install, choose the appropriate package and install it using your preferred package manager application.

To install the Sqoop 2 server package on a RHEL-compatible system:

$ sudo yum install sqoop2-server

To install the Sqoop 2 client package on a RHEL-compatible system:

$ sudo yum install sqoop2-client

To install the Sqoop 2 server package on a SLES system:

$ sudo zypper install sqoop2-server

To install the Sqoop 2 client package on a SLES system:

$ sudo zypper install sqoop2-client

To install the Sqoop 2 server package on an Ubuntu or Debian system:

$ sudo apt-get install sqoop2-server

To install the Sqoop 2 client package on an Ubuntu or Debian system:

$ sudo apt-get install sqoop2-client

Configuring Sqoop 2

This section explains how to configure the Sqoop 2 server.

Configuring which Hadoop Version to Use

The Sqoop 2 client does not interact directly with Hadoop MapReduce, and so it does not require any MapReduce configuration.

The Sqoop 2 server can work with either MRv1 or YARN. It cannot work with both simultaneously.You set the MapReduce version the Sqoop 2 server works with by means of the alternatives command (or update-alternatives, depending on your operating system):
  • To use YARN:
    alternatives --set sqoop2-tomcat-conf /etc/sqoop2/tomcat-conf.dist
  • To use MRv1:
    alternatives --set sqoop2-tomcat-conf /etc/sqoop2/tomcat-conf.mr1

Configuring Sqoop 2 to Use PostgreSQL instead of Apache Derby

Deciding which Database to Use

Sqoop 2 has a built-in Derby database, but Cloudera recommends that you use a PostgreSQL database instead, for the following reasons:
  • Derby runs in embedded mode and it is not possible to monitor its health.
  • Though it might be possible, Cloudera currently has no live backup strategy for the embedded Derby database.
  • Under load, Cloudera has observed locks and rollbacks with the embedded Derby database that do not happen with server-based databases.
See CDH and Cloudera Manager Supported Databases for tested database versions.
Install PostgreSQL 8.4.x or 9.0.x

See Install and Configure PostgreSQL for Cloudera Software.

Create the Sqoop User and Sqoop Database

For example, using the PostgreSQL psql command-line tool:

$ psql -U postgres
Password for user postgres: *****

postgres=# CREATE ROLE sqoop LOGIN ENCRYPTED PASSWORD 'sqoop'
 NOSUPERUSER INHERIT CREATEDB NOCREATEROLE;
CREATE ROLE

postgres=# CREATE DATABASE "sqoop" WITH OWNER = sqoop
 ENCODING = 'UTF8'
 TABLESPACE = pg_default
 LC_COLLATE = 'en_US.UTF8'
 LC_CTYPE = 'en_US.UTF8'
 CONNECTION LIMIT = -1;
CREATE DATABASE

postgres=# \q
Stop the Sqoop 2 Server
$ sudo /sbin/service sqoop2-server stop
Configure Sqoop 2 to use PostgreSQL

Edit the sqoop.properties file (normally /etc/sqoop2/conf) as follows:

org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.postgresql.PostgresqlRepositoryHandler
org.apache.sqoop.repository.jdbc.transaction.isolation=isolation level
org.apache.sqoop.repository.jdbc.maximum.connections=max connections
org.apache.sqoop.repository.jdbc.url=jdbc URL
org.apache.sqoop.repository.jdbc.driver=org.postgresql.Driver
org.apache.sqoop.repository.jdbc.user=username
org.apache.sqoop.repository.jdbc.password=password
org.apache.sqoop.repository.jdbc.properties.property=value
Restart the Sqoop 2 Server
$ sudo /sbin/service sqoop2-server start

Installing the JDBC Drivers

Sqoop 2 does not ship with third party JDBC drivers. You must download them separately and save them to the /var/lib/sqoop2/ directory on the server. The following sections show how to install the most common JDBC drivers. Once you have installed the JDBC drivers, restart the Sqoop 2 server so that the drivers are loaded.

Installing the MySQL JDBC Driver

Download the MySQL JDBC driver here. You must sign up for an account if you do not already have one, then log in before you can download the driver. Copy it to the /var/lib/sqoop2/ directory. For example:

$ sudo cp mysql-connector-java-version/mysql-connector-java-version-bin.jar /var/lib/sqoop2/
At the time of publication, version was 5.1.31, but the version might change by the time you read this.

Installing the Oracle JDBC Driver

You can download the JDBC Driver from the Oracle website, for example here. You must accept the license agreement before you can download the driver. Download the ojdbc6.jar file and copy it to /var/lib/sqoop2/ directory:

$ sudo cp ojdbc6.jar /var/lib/sqoop2/

Installing the Microsoft SQL Server JDBC Driver

Download the Microsoft SQL Server JDBC driver here and copy it to the /var/lib/sqoop2/ directory. For example:

$ curl -L 'http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz' | tar xz
$ sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /var/lib/sqoop2/

Installing the PostgreSQL JDBC Driver

Download the PostgreSQL JDBC driver here and copy it to the /var/lib/sqoop2/ directory. For example:

$ curl -L 'http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar' -o postgresql-9.2-1002.jdbc4.jar
$ sudo cp postgresql-9.2-1002.jdbc4.jar /var/lib/sqoop2/

Syntax for Configuring JDBC Connection Strings

These are the JDBC connection strings for supported databases.

MySql Connection String

Syntax:

jdbc:mysql://<HOST>:<PORT>/<DATABASE_NAME>

Example:

jdbc:mysql://my_mysql_server_hostname:3306/my_database_name
Oracle Connection String

Syntax:

jdbc:oracle:thin:@<HOST>:<PORT>:<DATABASE_NAME>

Example:

jdbc:oracle:thin:@my_oracle_server_hostname:1521:my_database_name
PostgreSQL Connection String

Syntax:

jdbc:postgresql://<HOST>:<PORT>/<DATABASE_NAME>

Example:

jdbc:postgresql://my_postgres_server_hostname:5432/my_database_name
Netezza Connection String

Syntax:

jdbc:netezza://<HOST>:<PORT>/<DATABASE_NAME>
Example:
jdbc:netezza://my_netezza_server_hostname:5480/my_database_name
Teradata Connection String

Syntax:

jdbc:teradata://<HOST>/DBS_PORT=1025/DATABASE=<DATABASE_NAME>
Example:
jdbc:teradata://my_teradata_server_hostname/DBS_PORT=1025/DATABASE=my_database_name

Starting, Stopping, and Accessing the Sqoop 2 Server

Starting the Sqoop 2 Server

After you have completed all of the required configuration steps, you can start Sqoop 2 server:

$ sudo /sbin/service sqoop2-server start

Stopping the Sqoop 2 Server

$ sudo /sbin/service sqoop2-server stop

Checking that the Sqoop 2 Server has Started

You can verify whether the server has started correctly by connecting to its HTTP interface. The simplest way is to get the server version using following command:

$ wget -qO - localhost:12000/sqoop/version

You should get a text fragment in JSON format similar to the following:

{"version":"1.99.2-cdh5.0.0",...}

Accessing the Sqoop 2 Server with the Sqoop 2 Client

Start the Sqoop 2 client:

sqoop2

Identify the host where your server is running (we will use localhost in this example):

sqoop:000> set server --host localhost

Test the connection by running the command show version --all to obtain the version number from server. You should see output similar to the following:

sqoop:000> show version --all
server version:
  Sqoop 1.99.2-cdh5.0.0 revision ...
  Compiled by jenkins on ...
client version:
  Sqoop 1.99.2-cdh5.0.0 revision ...
  Compiled by jenkins on ...
Protocol version:
  [1]

Viewing the Sqoop 2 Documentation

For more information about Sqoop 2, see Highlights of Sqoop 2 and https://archive.cloudera.com/cdh5/cdh/5/sqoop2.