Using the Cloudera Connector Powered by Teradata
Connection string format and key commands
After you have installed the connector and copied the JDBC drivers for Teradata to the lib directory of the Sqoop installation, use this connector by invoking Sqoop tools with the appropriate connection string.
The connection string format is jdbc:teradata://<td-host>/DATABASE=<td-instance>:
- <td-host> is the hostname of the machine on which the Teradata server runs.
- <td-instance> is the Teradata database instance name.
For example, the following command invokes the Sqoop import tool with three mappers:
$ sqoop import --connect jdbc:teradata://localhost/DATABASE=sqooptest \ --username sqooptest --password xxxxx --table MY_TABLE --num-mappers 3 \ --target-dir /user/sqooptest/MY_TABLE
The following command invokes the Sqoop export tool with three mappers:
$ sqoop export --connect jdbc:teradata://localhost/DATABASE=sqooptest \ --username sqooptest --password xxxxx --export-dir /user/sqooptest/MY_TABLE \ --table MY_TABLE_TARGET --num-mappers 3
You can control the behavior of the connector by using extra arguments. Extra arguments must appear at the end of the command. Use a double-dash separator (--) to indicate the end of the standard arguments and the beginning of the extra arguments. For example, the following command uses the double-dash (--) separator (in bold for emphasis) to separate the standard arguments --table and --num-mappers from the extra arguments --input-method and --query-band:
$ sqoop ... --table MY_TABLE --num-mappers 3 -- --input-method split.by.amp --query-band DC=BP\;Location=Europe
Parameter | Tool | Description |
---|---|---|
--staging-table | import and export | Override the default staging table name. This parameter applies only if staging tables are used during data transfer. |
--staging-database | import and export | Override the default staging database name. This parameter applies only if staging tables are used during the data transfer. |
--staging-force | import and export | Force the connector to create the staging table if the input/output method supports staging tables. |
--input-method | import | Specify the input method used to transfer data from Teradata to Hadoop. |
--output-method | export | Specify the output method used to transfer data from Hadoop to Teradata. |
--batch-size | import and export | Specify the number of rows processed together in one batch. |
--access-lock | import | Improve concurrency. When used, the import job is not blocked by concurrent accesses to the same table. |
--query-band | import and export | Allow arbitrary query bands to be set for all queries that are run by the connector. The expected format is a semicolon-separated key=value pair list. A final semicolon is required after the last key=value pair. For example, Data_Center=XO;Location=Europe;. |
--error-table | export (only for internal.fastload) | Specify a prefix for created error tables. |
--error-database | export (only for internal.fastload) | Override the default error database name. |
--fastload-socket-hostname | export (only for internal.fastload) | Hostname or IP address of the host on which you are running Sqoop, one that is visible from the Hadoop cluster. The connector can autodetect the interface. This parameter overrides the autodection routine. |
--keep-staging-table | import | By default, the connector drops all automatically created staging tables when export fails. This option leaves the staging tables with partially imported data in the database. |
--num-partitions-for-staging-table | import (only for split.by.partition) | Number of partitions to use for the automatically created staging table. The connector automatically generates the value based on the number of mappers used. |
--skip-xviews | import and export | By default, the connector uses Teradata system views to obtain metadata. With this parameter, the connector switches to XViews instead. |
--date-format | import and export | Use custom format for columns of date type. The parameter uses SimpleDateFormat formatting options. |
--time-format | import and export | Use custom format for columns of time type. The parameter uses SimpleDateFormat formatting options. |
--timestamp-format | import and export | Use custom format for columns of timestamp type. The parameter uses SimpleDateFormat formatting options. |
Input Methods
Cloudera Connector Powered by Teradata supports the following methods for importing data from Teradata to Hadoop:
- split.by.amp
- split.by.value
- split.by.partition
- split.by.hash
split.by.amp Method
This optimal method retrieves data from Teradata. The connector creates one mapper per available Teradata AMP, and each mapper subsequently retrieves data from each AMP. As a result, no staging table is required. This method requires Teradata 14.10 or higher.
split.by.value Method
This method creates input splits as ranges on the split by column (usually the table’s primary key). Each split is subsequently processed by a single mapper to transfer the data using SELECT queries. All splits can access all AMPs to retrieve data, so you should set the number of mappers between 20 and 30 because there is a limit for all-AMP concurrently running operations on the Teradata appliance. Ensure that users transferring data have sufficient spool space available for the SELECT queries.
split.by.partition Method
This method is preferred for extracting a large amount of data from the Teradata system. Behavior of this method depends whether source table is partitioned or not.
split.by.hash Method
This input method is similar to the split.by.partition method. Instead of directly operating on value ranges of one column, this method operates on the hash of the column. You can use importing by hash to extract data in situations where split.by.value and split.by.partition are not appropriate. Each mapper can access all AMPs available in the system, so set the number of mappers between 20 and 30 because there is a limit for all-AMP concurrent jobs on the Teradata appliance.
The following example shows import using input method split.by.hash:
$ sqoop import --connect jdbc:teradata://localhost/DATABASE=sqooptest \ --username sqooptest --password xxxxx --table MY_TABLE --num-mappers 3 \ --target-dir /user/sqooptest/MY_TABLE - --input-method split.by.hash
If your input table is not partitioned, the connector creates a partitioned staging table and runs an INSERT into SELECT query to move data from the source table into the staging table. Subsequently, each mapper transfers data from one partition, resulting in a single AMP operation. With a single AMP, you can use a large number of mappers to obtain optimal performance. The amount of available permanent space must be as large as your source table and the amount of spool space required to run the SELECT queries.
If your table is already partitioned, no extra staging table is created. However, you can force the connector to re-partition your data using the --staging-force parameter to achieve better performance. Without forcing repartition of the data, this method opens all-AMP operation, so you should use between 20 and 30 mappers. If your source table is a PI table, and your split by column is the table’s primary key, the connector creates a single AMP operation, and you can use high number of mappers.
Output Methods
Cloudera Connector Powered by Teradata supports the following output methods to export data from Hadoop to Teradata:
- batch.insert
- internal.fastload
batch.insert Method
This method uses JDBC batch jobs to export data to the Teradata appliance. This method should be used only when other methods are not a viable. It creates a partitioned stating table before the export job, and then subsequently each mapper transfers data from Hadoop to one partition. After all mappers end, the INSERT into SELECT statement is called to transfer the data from staging to table to final destination. Ensure that you have sufficient permanent space for two copies of your data. You also need sufficient spool space for running the INSERT into SELECT query. The number of mappers that this method can use is limited only by the number of concurrent sessions allowed on the Teradata appliance.
internal.fastload Method
This method requires a partitioned staging table. Data is first exported by each mapper into a different partition and then moved to the target table, using the INSERT into SELECT statement. Make sure that you have sufficient permanent and spool space to store two copies of your data and to move them from the staging table to the target table. All mappers participate in one FastLoad job coordinated by an internal protocol. This is the fastest method for exporting data from Hadoop to a Teradata appliance. Because all mappers participate in the one FastLoad job, only one Teradata utility slot is used for the export job. The number of mappers is limited only by total number of AMPs available in your system.
The Teradata server is started on the machine where the sqoop command is running, and all mappers started by this sqoop command must connect to it. Because a mapper can be run on any hosts in the cluster, all hosts must have access to the machine running the sqoop command.