Apache Sqoop Known Issues

MySQL JDBC driver shipped with CentOS 6 systems does not work with Sqoop

CentOS 6 systems currently ship with version 5.1.17 of the MySQL JDBC driver. This version does not work correctly with Sqoop.

Bug: None

Affected Versions: MySQL JDBC 5.1.17, 5.1.4, 5.3.0

Cloudera Bug: CDH-23180

Workaround: Install version 5.1.31 of the JDBC driver as detailed in Installing the JDBC Drivers for Sqoop 1 (Sqoop 1) or Configuring Sqoop 2 (Sqoop 2).

MS SQL Server "integratedSecurity" option unavailable in Sqoop

The integratedSecurity option is not available in the Sqoop CLI.

Bug: None

Workaround: None

Sqoop 1

Sqoop does not support Hive metastore HA for Parquet format

If the Hive metastore (HMS) is configured for high availability (HA), the value of the hive.metastore.uris property in /etc/hive/conf.cloudera.HIVE-1/hive-site.xml contains a list of URIs instead of one URI. For example:

<value>thrift://hive-1.example.com:9083,thrift://hive-2.example.com:9083,thrift://hive-3.example.com:9083</value>

Sqoop uses Kite for Parquet file handling and Kite does not support HMS HA. Using Kite, Sqoop takes the first URI from the hive.metastore.uris list and uses that to connect to the HMS. If that address is not available, the Sqoop command fails.

This issue occurs when you use the following Sqoop arguments with the sqoop-import command: --hive-import and --as-parquetfile.

Bug: CDH-57895

Affected Versions: CDH 5.12.x and below.

Fixed in Versions: CDH 5.13 and higher.

Cloudera Bug: CDH-57895

Workaround: None.

Sqoop1 Metastore job tool ignores overrides

Sqoop job command option overrides are being ignored due to SQOOP-2896. The exec action should let you override arguments in a saved job by passing options after --, as in:
sqoop job --exec job -- options_to_override>
However, the values in options_to_override are not being passed to the saved job. Sqoop jobs that use overrides can break or execute incorrectly, as shown in the example below.
sqoop job --create my_eval_query –- eval --connect jdbc:mysql://fqdn.example.com:3306/db --username user --password password --query "select * from t1"
This command executes the job defined in my_eval_query and returns the rows in the table:
sqoop job --exec my_eval_query
c1  c2           c3
1   2017-02-17   row data 1
2   2017-02-17   row data 2
3   2017-02-17   row data 3
In this next run below, the override query is ignored. The job uses the original parameters stored for the job and returns the same rows as before rather than the single row where column c1 = 2:
sqoop job --exec my_eval_query -- --query  "select * from t1 where c1 = 2"
c1  c2           c3
1   2017-02-17   row data 1
2   2017-02-17   row data 2 <--- Only this row should be returned
3   2017-02-17   row data 3

Affected releases: CDH 5.8.0, 5.8.1, 5.8.3, 5.8.4; CDH 5.9.0, 5.9.1; CDH 5.10.0

Bug: SQOOP-2896

Cloudera Bug: CDH-51101

Fixed in Versions: CDH 5.8.5, CDH 5.9.2, CDH 5.9.3, CDH 5.10.1, CDH 5.10.2, CDH 5.11.0 (and higher) releases.

Sqoop1 (doc import + --as-parquetfile) limitation with KMS/KTS Encryption at Rest

Due to a limitation with Kite SDK, it is not possible to use (sqoop import --as-parquetfile) with KMS/KTS Encryption zones. See the following example.
sqoop import --connect jdbc:db2://djaxludb1001:61035/DDBAT003 --username=dh810202 --P --target-dir /data/hive_scratch/ASDISBURSEMENT --delete-target-dir -m1 --query "select disbursementnumber,disbursementdate,xmldata FROM DB2dba.ASDISBURSEMENT where DISBURSEMENTNUMBER = 2011113210000115311 AND \$CONDITIONS" -hive-import --hive-database adminserver -hive-table asdisbursement_dave --map-column-java XMLDATA=String --as-parquetfile

16/12/05 12:23:46 INFO mapreduce.Job: map 100% reduce 0%
16/12/05 12:23:46 INFO mapreduce.Job: Job job_1480530522947_0096 failed with state FAILED due to: Job commit failed: org.kitesdk.data.DatasetIOException: Could not move contents of hdfs://AJAX01-ns/tmp/adminserver/.temp/job_1480530522947_0096/mr/job_1480530522947_0096 to hdfs://AJAX01-ns/data/RetiredApps/INS/AdminServer/asdisbursement_dave
<SNIP>
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): /tmp/adminserver/.temp/job_1480530522947_0096/mr/job_1480530522947_0096/5ddcac42-5d69-4e46-88c2-17bbedac4858.parquet can't be moved into an encryption zone.

Cloudera Bug: CDH-40826 and SQOOP-2943

Workaround: Use an alternate data file type, for example text or avro.

Doc import as Parquet files may result in out-of-memory errors

Out-of-memory (OOM) errors can be caused in the following two cases:
  • With many very large rows (multiple megabytes per row) before initial-page-run check (ColumnWriter)
  • When rows vary significantly by size so that the next-page-size check is based on small rows and is set very high followed by many large rows

Bug: PARQUET-99

Cloudera Bug: CDH-50896

Workaround: None, other than restructuring the data.

Hive, Pig, and Sqoop 1 fail in MRv1 tarball due to incorrect HADOOP_MAPRED_HOME setting

This issue affects tarball installations only.

Bug: None

Cloudera Bug: CDH-6640. Fixed for packages only: CDH-6648.

Workaround: For MRv1 only, manually edit as follows:

  • Change the path for the Hadoop MapReduce home setting in /etc/default/hadoop from:
    export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

    to:

    export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
  • Remove /usr/lib/hadoop-mapreduce from the HADOOP_CLASSPATH.

Sqoop 2

Sqoop 2 client cannot be used with a different version of the Sqoop 2 server

The Sqoop 2 client and server must be running the same CDH version.

Cloudera Bug: CDH-11901

Workaround: Make sure all Sqoop 2 components are running the same version of CDH.

Sqoop 2 upgrade may fail if any job's source and destination links point to the same connector

For example, the links for the job shown in the following output both point to generic-jdbc-connector:

sqoop:000> show job --all
1 job(s) to show:
Job with id 1 and name job1 (Enabled: true, Created by null at 5/13/15 3:05 PM, Updated by null at 5/13/15 6:04 PM)
  Throttling resources
    Extractors:
    Loaders:
From link: 1
  From database configuration
    Schema name: schema1
    Table name: tab1
    Table SQL statement:
    Table column names: col1
    Partition column name:
    Null value allowed for the partition column: false
    Boundary query:
  Incremental read
    Check column:
    Last value:
To link: 2
  To database configuration
    Schema name: schema2
    Table name: tab2
    Table SQL statement:
    Table column names: col2
    Stage table name:
    Should clear stage table:

sqoop:000> show link --all
2 link(s) to show:
link with id 1 and name try1 (Enabled: true, Created by null at 5/13/15 2:59 PM, Updated by null at 5/13/15 5:47 PM)
Using Connector generic-jdbc-connector with id 2
  Link configuration
    JDBC Driver Class: com.mysql.jdbc.Driver
    JDBC Connection String: jdbc:mysql://mysql.server/database
    Username: nvaidya
    Password:
    JDBC Connection Properties:
link with id 2 and name try2 (Enabled: true, Created by null at 5/13/15 3:01 PM, Updated by null at 5/13/15 5:47 PM)
Using Connector generic-jdbc-connector with id 2
  Link configuration
    JDBC Driver Class: com.mysql.jdbc.Driver
    JDBC Connection String: jdbc:mysql://mysql.server/database
    Username: nvaidya
    Password:
    JDBC Connection Properties:

Cloudera Bug: CDH-27779

Workaround: Before upgrading, make sure no jobs have source and destination links that point to the same connector.