You create a single
Sqoop import command that imports data from diverse data sources, such as a relational
database, into Hive using Apache Sqoop.
You enter the Sqoop import command on the command line of your Hive cluster to import
data from a data source into HDFS and Hive. The import can includes the following
information, for example:
- Database connection information: database URI, database name, and connection
protocol, such as
jdbc:mysql:
- The data to import
- Parallel processing directives for performant data transfer
- Destination for imported data
-
Create an import command that specifies the Sqoop connection to the RDBMS.
- To enter a password for the data source on the command line, use the
-P
option in the connection string.
- To specify a file where the password is stored, use the
--password-file
option.
Password on command line:
sqoop import --connect jdbc:mysql://db.foo.com:3306/bar \
<data to import> \
--username <username> \
-P
Specify password file:
sqoop import --connect jdbc:mysql://db.foo.com:3306/bar \
--table EMPLOYEES \
--username <username> \
--password-file ${user.home}/.password
| Note |
---|
Sqoop is tested to work with Connector/J 5.1. If you have upgraded to
Connector/J 8.0, and want to use the zeroDateTimeBehavior
property to handle values of '0000-00-00\' in DATE columns, explicitly specify
zeroDateTimeBehavior=CONVERT_TO_NULL in the connection string.
For example, jdbc:mysql://<MySQL
host>/<DB>?zeroDateTimeBehavior=CONVERT_TO_NULL |
-
Specify the data to import in the command.
- Import an entire table.
- Import a subset of the columns.
- Import data using a free-form query.
Entire table:
sqoop import \
--connect jdbc:mysql://db.foo.com:3306/bar \
--table EMPLOYEES
Subset of columns:
sqoop import
--connect jdbc:mysql://db.foo.com:3306/bar \
--table EMPLOYEES \
--columns "employee_id,first_name,last_name,job_title"
Free-form query to import the latest data:
sqoop import \
--connect jdbc:mysql://db.foo.com:3306/bar \
--table EMPLOYEES \
--where "start_date > '2018-01-01'"
-
Optionally, specify write parallelism in the import statement to execute a
number of map tasks in parallel:
- Set mappers: If the source table has a primary key, explicitly set the number
of mappers using
--num-mappers
.
- Split by: If primary keys are not evenly distributed, provide a split key
using
--split-by
- Sequential: If you do not have a primary key or split key, import data
sequentially using
--num-mappers 1
or
--autoreset-to-one-mapper
in query.
- Setting mappers evenly splits the primary key range of the source table.
- Split by evenly splits the data using the split key instead of a primary
key.
-
Specify importing the data into Hive using Hive default delimiters by specifying
the import option
--hive-import
.
-
Specify the Hive destination of the data.
- If you think the table does not already exist in Hive, name the table using
--hive-table <db>.<table_name> and use the --create-hive-table
option.
- If you want to insert the imported data into an existing Hive external table,
name the table using --hive-table <db>.<table_name>. Do not use the
--create-hive-table option.
This command imports the MySQL EMPLOYEES table to a new Hive table named in the
default HDFS location
/user/hive/warehouse.
sqoop import --connect jdbc:mysql://db.foo.com:3306/corp \
--table EMPLOYEES \
--hive-import \
--create-hive-table \
--hive-table mydb.newtable
This command imports the MySQL EMPLOYEES table to an external table in HDFS.
sqoop import --connect jdbc:mysql://db.foo.com:3306/corp \
--table EMPLOYEES \
--hive-import \
--hive-table mydb.myexternaltable