Chapter 2. Hortonworks Connector for Teradata
Introduction
Hortonworks Connector for Teradata (powered by the Teradata Connector for Hadoop) is an implementation of a Sqoop connector that enables those conversant with the Apache Sqoop tool to transfer data between the Teradata MPP DBMS and Apache Hadoop environments.
Background
Sqoop provides facilities for bulk transfer of data between external data stores and the Hadoop environment exploiting the Map Reduce paradigm. Sqoop depends on JDBC interfaces to access the external databases.
Most of the databases also have specialized access methods for high-speed bulk data transfers for efficient batch processing needs, such as backups, etc.
To accommodate the varieties of database mechanisms to facilitate bulk transfer, Sqoop provides extensible base implementations of the data transfer functions utilizing the JDBC interface that can optionally be enhanced to suit a database-specific method of data transfer.
Terminology
Sqoop has the notion of Connectors, which contain the specialized logic to read and write to external systems.
The Hortonworks Connector for Teradata ("Hortonworks Connector") is a Sqoop Connector implementation for Teradata.
It is built on the Teradata Connector for Hadoop, a Teradata product.
Supported Features
The Hortonworks Connector supports the following features:
Import/Export tools that run Hadoop MR jobs to transfer data.
Support for Text, Sequence, ORCFiles, Avro, and RCFiles as the source for export operations and target for im port operations.
Note: If you will run Avro jobs, download avro-mapred-1.7.4-hadoop2.jar and place it under $SQOOP_HOME/lib.
Importable or query data from Teradata to:
An existing partitioned or non-partitioned Hive table.
A new partitioned or non-partitioned Hive table created by the connector.
An HCatalog table.
Export data from HDFS files, Hive or HCatalog tables to empty or non-empty Teradata tables.
Facilities for mapping schemas between Teradata and Hive/HCatalog, including necessary data type conversions.
Connector Feature Checklist
Import all tables: Supported.
Incremental import: Sqoop options are not supported but can be emulated, as specified in the sample invocation Incremental Import.
BLOB and CLOB: Limited to 64 KB.
Import data to Sqoop
TextFormat, delimited: Supported.
SequenceFile: Supported.
RCFile: Supported.
ORCFile: Supported with HDP 2.3.2 or later.
Avro file: Supported.
Hive arguments: Support for all standard Hive arguments. All data types except Union are supported.
Export from / import to HCatalog table: Supported.
Automatic schema mapping to/from HCatalog: Supported.
Import using a query: Supported.
Update table: Not supported.
Compression: Not supported.
Software Versions and Installation
Connector Version
This document discusses the Hortonworks Connector for Teradata ("Hortonworks Connector") built on version 1.5.4 of the Teradata Connector for Hadoop.
Supported Product Versions
This section lists the product versions supported in the current release of the Hortonworks Connector.
HDP Version
HDP 2.6.0 or later
Teradata Database Versions
The following Teradata database versions are supported:
Teradata Database 16.00
Teradata Database 15.10
Teradata Database 15.00
Teradata Database 14.10
Teradata Database 14.00
Hive Version
Hive 1.2.1
Sqoop Versions
Sqoop 1.4.6
Requirements and Dependencies
System Requirements
The Hortonworks Connector requires JRE/JDK 1.7.x or 1.8.x.
Dependencies
Teradata GSS Client Driver 15.00 or later versions (tdgssconfig)
Teradata JDBC Driver 15.00 or later versions (terajdbc)
Teradata Connector for Hadoop 1.5.4
Installation
Installation Dependencies
Sqoop must be installed first.
Installing the Software
Download the tarball from the "Add-Ons" for the latest version of Hortonworks Data Platform (HDP) here: https://hortonworks.com/downloads/.
Extract the contents of the tar archive to $SQOOP_HOME/lib. Sqoop will then distribute the contents of the tar to the necessary nodes.
Configuration
This section provides information about connection credentials and configuration options.
Database Connection Credentials
Refer to Sqoop documentation for the Teradata database connection credentials.
Configuration Options
The Hortonworks Connector defines many connector-specific options. A good selection of them is also available as Sqoop options (although not all Sqoop options are directly translatable to Hortonworks Connector options).
Configuration Option Precedence
Options can be specified using any of these techniques:
a configuration file
-D
command line optionSqoop options (where applicable): apart from standard Sqoop options, a few connectorspecific options are supported
Therefore the following precedence is established:
Sqoop connector-specific extra arguments have the highest precedence. (Sqoop command line options must match, or execution will fail.)
If
-D
command line options are provided, they override the configuration file values.The value in the configuration file is the default.
As an example, if the configuration file sets the number of input mappers to 4 and
the command line option (-D
com.teradata.db.input.num.mappers
) sets it to 5, but the Sqoop option
--num-mappers
is set to 6, then the import job will use 6
mappers.
In some cases, option constraints and the relationships between options affect the
configuration value used. For example, import options
job.type
and file.format
are
interrelated. These options are described in Connector
Import Options.
Sqoop Options
The Sqoop option --connection-manager
must be set as follows to use the
Hortonworks Connector for Teradata (see the
Sample Invocations):
--connection-manager org.apache.sqoop.teradata.TeradataConnManager
Some of the Sqoop options are unsupported in the current release of the Hortonworks Connector for Hadoop. See the Appendix for a list of unsupported Sqoop options.
Hortonworks Connector Options
The Appendix describes the Hortonworks Connector options, including Connector Import Options and Connector-specific Extra Arguments.
Data Type Support
The Hortonworks Connector data types depend on Teradata database types.
Support for Teradata Data Types
BIGINT BYTEINT INTEGER SMALLINT DOUBLE PRECISION FLOAT REAL DECIMAL (n,m) NUMERIC (n,m) CHAR (n) VARCHAR (n) LONG VARCHAR DATE |
TIME (n) TIMESTAMP (n) PERIOD (DATE) PERIOD (TIME (n)) PERIOD (TIMESTAMP (n)) INTERVAL YEAR (n) INTERVAL YEAR (n) TO MONTH INTERVAL MONTH (n) INTERVAL DAY (n) INTERVAL DAY (n) TO HOUR INTERVAL DAY (n) TO MINUTE INTERVAL DAY (n) TO SECOND (m) INTERVAL HOUR (n) INTERVAL HOUR (n) TO MINUTE |
INTERVAL HOUR (n) TO SECOND (m) INTERVAL MINUTE (n) INTERVAL MINUTE (n) TO SECOND (m) INTERVAL SECOND (n) The following data types are supported with some limitations:
|
Support for Hive Data Types
BIGINT INT SMALLINT TINYINT DOUBLE FLOAT STRING BOOLEAN |
The following Hive types are supported with some limitations:
|
Unsupported Data Types
These Teradata types are unsupported:
|
This Hive type is unsupported:
|
Hive and HCatalog Support
Importing from Hive and HCatalog requires that HADOOP_CLASSPATH and LIB_JARS be specified before the sqoop command is run. This shows the environment variable setup:
export HADOOP_CLASSPATH=$(hcat -classpath) HIVE_HOME=/usr/hdp/current/hive-client HCAT_HOME=/usr/hdp/current/hive-webhcat export LIB_JARS=$HCAT_HOME/share/hcatalog/hcatalog-core-<version>.jar, $HIVE_HOME/lib/hive-metastore-<version>.jar, $HIVE_HOME/lib/libthrift-<version>.jar,\ $HIVE_HOME/lib/hive-exec-<version>.jar,\ $HIVE_HOME/lib/libfb303-<version>.jar,\ $HIVE_HOME/lib/jdo2-api-<version>.jar,\ $HIVE_HOME/lib/slf4j-api-<version>.jar,\ $HIVE_HOME/lib/hive-cli-<version>.jar,\
Note: Change the HIVE_HOME and HCAT_HOME variables as needed and change the versions of the jar to what is available under the directories mentioned.
Hive and HCatalog jobs can be run as shown in the next section.
Sample Invocations
The following examples assume that the SQOOP_HOME environment variable is set to the base directory of the Sqoop installation.
Important | |
---|---|
When importing data from Teradata Connector for Hadoop version 1.5.1 or later into Hive in a non-TDE cluster, the user must be in the same group as the group with access to the Hive table warehouse directory. |
Import Data from Teradata to Hadoop and Hive
$SQOOP_HOME/bin/sqoop import \ -libjars $LIB_JARS \ --connection-manager org.apache.sqoop.teradata.TeradataConnManager \ --username tduser \ --password tduserpass \ --table tablename --hcatalog-table hcat table
Import Data from Teradata into an HCatalog Table
$SQOOP_HOME/bin/sqoop import \ -libjars $LIB_JARS \ --connect jdbc:teradata://td-host/Database=dbname \ --connection-manager org.apache.sqoop.teradata.TeradataConnManager \ --username tduser \ --password tduserpass \ --table tablename --hcatalog-table hcat table
Incremental Import
Teradata incremental import emulates the check-column and last value options. Here is an example for a table which has 'hire_date' as the date column to check against and 'name' as the column that can be used to partition the data.
export USER=dbc export PASS=dbc export HOST=<dbhost> export DB=<dbuser> export TABLE=<dbtable> export JDBCURL=jdbc: teradata: //$HOST/DATABASE=$DB export IMPORT_DIR=<hdfs-dir to import> export VERBOSE=--verbose export MANAGER=org.apache.sqoop.teradata.TeradataConnManager export CONN_MANAGER="--connection-manager $MANAGER" export CONNECT="--connect $JDBCURL" MAPPERS="--num-mappers 4" DATE="'1990-12-31'" FORMAT="'yyyy-mm-dd'" LASTDATE="cast( $DATE as date format $FORMAT)" SQOOPQUERY="select * from employees where hire_date < $LASTDATE AND \ $CONDITIONS" $SQOOP_HOME/bin/sqoop import $TDQUERY $TDSPLITBY $INPUTMETHOD $VERBOSE $CONN_MANAGER $CONNECT -query "$SQOOPQUERY" --username $USER --password $PASS --target-dir $IMPORT_DIR --split-by name
Export Data to Teradata
$SQOOP_HOME/bin/sqoop export \ --connect jdbc:teradata://172.16.68.128/Database=employees \ --connection-manager org.apache.sqoop.teradata.TeradataConnManager \ --username dbc \ --password dbc \ --table employees2 \ --export-dir /user/hrt_qa/test-sqoop/out \ --batch
Troubleshooting
Hive Import Error
Issue: Received the error message, "Failed with exception org.apache.hadoop.security.AccessControlException: User does not belong to <group-name>".
Cause: When a table is created in Hive, the Hive table directory inherits the warehouse permissions, including group ownership. To access the Hive table directory, a user must belong to a group with permission to access that directory. If a user tries to load data to a Hive directory without having proper permissions, the job fails with an error.
Workaround: When importing data from Teradata Connector for Hadoop version 1.5.1 or later into Hive in a non-TDE cluster, the user must be in the same group as the group with access to the Hive table warehouse directory.
Stage Tables
Issue: The export option
--stage-table
does not work.
Cause: The behavior of stage tables is different between Hortonworks Connector
and Sqoop, and this causes deadlocks during job cleanup if the Sqoop -staging-table
option is used.
Workaround: Use the Hortonworks Connector option
teradata.db.output.stage.table.name
for specifying the stage table name.
Fastload
Issue: The export option 'fastload.soclet.host'
does not work.
Cause: The internal.fastload
method used for
Teradata exports can cause resource exhaustion (running out of database AMPs) if the number of reducers
exceeds the number of available AMPs.
Workaround: Use the option teradata.db.output.num.reducers
to restrict the resource usage.
Appendix: Configuration Options
This appendix describes the Hortonworks Connector configuration options and lists the Sqoop options that are currently unsupported.
Sqoop Options
To use the Hortonworks Connector, you must set the Sqoop option --connection-manager
to org.apache.sqoop.teradata.TeradataConnManager
as shown in the
Sample Invocations.
Some of the Sqoop options are unsupported in the current release of the Hortonworks Connector for Hadoop. The tables below list the unsupported import and export options.
Note | |
---|---|
Imports and exports are defined from the Hadoop perspective, that is, an import brings data into Hadoop from the database and an export moves data out of Hadoop into the database. |
Unsupported Sqoop Import Options
Import Category | Unsupported Options |
Control Options |
|
Incremental Options |
|
Output Formatting Options |
|
Hive Support Options |
|
HBase Support Options |
|
Data Mapping Options |
|
Unsupported Sqoop Export Options
Export Category | Unsupported Options |
Control Options |
|
Input Parsing Options |
|
Data Mapping Options |
|
Hortonworks Connector Options
This section describes configuration options provided by the Hortonworks Connector.
For information about how the options can be specified, see Configuration Option Precedence.
Note | |
---|---|
Imports and exports are defined from the Hadoop perspective, that is, an import brings data into Hadoop from the database and an export moves data out of Hadoop into the database. |
Connector Import Options
All option names below are prefixed by "teradata.db.input."
when specified in the configuration files or in the -D
command line option.
For example, the job.type
option is specified as
teradata.db.input.job.type
.
Connector Import Option (teradata.db.input.* ) |
Description | Overrriding Sqoop Option |
job.type |
The type of import job. Required: no Supported values: hcat, hive, hdfs Default value: hdfs |
None for 'hcat' and 'hive' settings; also none for 'hdfs' when the file format is 'textfile'. But for file formats other than 'textfile' the 'hdfs' job type is reset to 'hive', therefore the following Sqoop option overrides a job.type of 'hdfs':
|
file.format |
The format of a to-be-imported data file in HDFS. An 'hcat' or 'hive' job type supports 'rcfile', 'sequencefile', and 'textfile' file formats; and an 'hdfs' job type supports only 'textfile' format. Required: no Supported values: orcfile, refile, sequencefile, textfile Default value: textfile |
|
target.paths |
The directory with which to place the imported data. It is required for an 'hdfs' job, optional for a 'hive' job, and not valid for an 'hcat' job. For a 'hive' job, either specify this or the 'target.table' parameter but not both. Required: no Supported values: string Default value: The value of property 'mapred.output.dir' |
|
num.mappers |
The number of mappers for the import job. It is also the number of splits the Hortonworks Connector will attempt to create. Required: no Supported values: an integer greater than 0 Default value: 2 |
|
source.query |
The SQL query to select data from a Teradata database; either specify this or the 'source.table' parameter, but not both. Required: no Supported values: The select SQL query (Teradata database supported) |
|
source.table |
The name of the source table in a Teradata system from which the data is imported. Either specify this or the 'source.query' parameter, but not both. Required: no Supported values: string |
|
source.field.names |
The names of columns to import from the source table in a Teradata system, in comma-separated format. The order of the source field names must match exactly the order of the target field names for schema mapping. This parameter must be present when the 'target.field.names' parameter is specified. If not specified, then all columns from the source table will be retrieved. Required: no Supported values: string |
|
target.table |
The name of the target table in Hive or HCatalog. It is required with an 'hcat' job, optional with a 'hive' job, and not valid with an 'hdfs' job. For a 'hive' job, either specify this parameter or the 'target.paths' parameter, but not both. Required: no Supported values: string |
|
target.field.names |
The names of fields to write to the target file in HDFS, or to the target Hive or HCatalog table, in comma separated format. The order of the target field names must match exactly the order of the source field names for schema mapping. This parameter must be provided when the 'source.field.names' parameter is specified. Required: no Supported values: string |
Driven by the imported columns |
batch.size |
The number of rows a Hortonworks Connector fetches each time from the Teradata system, up to a 1 MB buffer size limit. Required: no Supported values: an integer greater than 0 Default value: 10000 |
|
separator |
The field separator to use with the imported files. This parameter is only applicable with the 'textfile' file format. Required: no Supported values: string Default value: \t |
|
split.by.column |
The name of a table column to be used for splitting import tasks. It is optional with the 'split.by.hash' and 'split.by.value' methods, and not valid with the 'split.by.partition' method. If this parameter is not specified, the first column of the table’s primary key or primary index will be used. Required: no Supported values: a valid table column name |
|
Connector Export Options
All option names below are prefixed by "teradata.db.output." when specified
in the configuration files or in the -D
command line option.
For example, target.table
is specified as
teradata.db.output.target.table
.
Connector Export Option (teradata.db.output.* ) |
Description | Overrriding Sqoop Option |
target.table |
The name of the target table in a Teradata system. Required: yes Supported values: string |
|
source.paths |
The directory of to-be exported source files in HDFS. It is required for an 'hdfs' job, optional with a 'hive' job, and not valid with an 'hcat' job. For a 'hive' job, either specify this or the 'source.table' parameter but not both. Required: no Supported values: string |
|
num.mappers |
The maximum number of output mapper tasks. If the value is zero, then the number of mappers will be the same as the number of file blocks in HDFS. Use either this parameter or 'num.reducers', but not both. Required: no Supported values: an integer greater than or equal to zero Default value: 2 |
|
target.field.names |
The names of fields to export to the target table in a Teradata system, in comma-separated format. The order of the target field names must match the order of the source field names for schema mapping. This parameter must be provided when the 'source.field.names' parameter is specified. Required: no Supported values: string |
|
separator |
The separator of fields in the source to-be-exported files. This parameter is only valid with 'textfile' file format. Required: no Supported values: string Default value: \t |
|
Connector-specific Extra Arguments
The Hortonworks connector for Teradata has the following connector-specific extra arguments:
Type of Argument | Argument | Description |
Common Options |
||
jobtype | The job type: hdfs, hive or hcat. | |
fileformat | File format: sequencefile, textfile, avrofile, orcfile or refile. Default is textfile. | |
usexview | Use X views for metadata queries. (X views take security into consideration.) | |
stagedatabase | Database to use for creating stage tables. | |
stagetablename | Stage table name to use; if blank, a default name is generated. | |
batchsize | Fetch size or insert batch size. | |
queryband | Query band for the session. | |
Import-specific Options | numpartitionshating | Number of partitions to be created in the staging table. |
method | One of split.by.{value | hash | partition | amp} | |
accesslock | Row lock is used for fetching rows. | |
avroschemafile | Avro schema file for Avro imports. | |
targettableschema | Schema for the partitioning columns. Needed when Hive table is to be created. | |
targetpartitionschema | Schema for the partitioning columns. Needed when Hive table is to be created. | |
targetfieldnames | Field names for the target fields. Needed when Hive table is to be created. | |
Export Options | sourcetableschema | Schema for the source hive table. |
sourcepartitionschema | Schema for the partitioning columns. | |
sourcefieldnames | Field names for the source fields to export. | |
fastloadsockethost | Host for Fastload exports. | |
fastloadsocketport | Port for the Fastload exports. | |
fastloadsockettimeout | Timeout for the Fastload export operation. | |
errortablename | Error table name for use with Fast load. | |
keepstagetable | Keep stage table after export. ( If not present, stage table is dropped after export.) | |
forcestage | Force creation of a stage table. |