Apache Hadoop Incompatible Changes and Limitations

Hadoop Common

The following incompatible changes have been introduced in CDH 5:

  • HADOOP-12672 introduced a change to the behavior of RPC timeouts. Because of this change, a NoSuchMethodError exception may occur if the hadoop-common.jar is present on the classpath with a newer version of the hadoop-hdfs.jars. This can happen when deploying JARs to the distributed cache or reordering classpaths with flags, such as mapreduce.job.user.classpath.first or hbase.use.dynamic.jars.

    Cloudera Issue: CDH-69462

    Affected versions: Upgrades from CDH 5.11.0 to 5.11.1 or 5.11.2; upgrades from CDH 5.10.0 or 5.10.1 to 5.10.2

    Workaround: Ensure that all client jars have the same maintenance release dot version for all Hadoop jars. Cloudera uses the following pattern for versions: major.minor.maintenance. For example, the maintenance version for CDH 5.11.2 is 2.

HDFS

The following incompatible changes have been introduced in CDH 5:

  • HADOOP-14014, HADOOP-14814: API incompatibility introduced in the FsServerDefaults class

    • Affected Versions: CDH 5.12.0, 5.12.1, 5.12.2, 5.13.0, and 5.13.1
    • Workaround: Upgrade to CDH 5.13.2 or 5.14
  • HDFS FUSE does not support ACL.
    • Affected Versions: All CDH 5 versions
    • Cloudera Bug: CDH-60847
  • CDH-53734, HDFS-7933:

    In CDH5.10.0 and higher, the output of the fsck command changed to also contain information about decommissioned replicas, and therefore if you try to parse the output of fsck, the parser could fail.

  • CDH-54432, HADOOP-11385:

    In CDH 5.12.0 and higher, JMXJSONServlet replaced JSONP output with Cross-origin resource sharing (CORS) to prevent potential cross-site scripting attacks. External applications relying on the existing output may need to be updated because the new output is not compatible.

  • CDH-26136, HDFS-6962: ACL inheritance conflicts with umaskmode.
    • Affected Versions: CDH5.5.6, CDH5.7.5, CDH5.7.6, CDH5.8.4, CDH5.8.5, and CDH 5.9.1 and higher
  • CDH-42931, HDFS-6434 Default permission for creating file should be 644 for WebHdfs/HttpFS.
    • Affected Versions: CDH5.5.5, CDH5.5.6, CDH5.7.3, CDH5.7.4, CDH5.7.5, CDH5.7.6, CDH5.8.2, CDH5.8.3, CDH5.8.4, CDH5.8.5, and CDH 5.9.0 and higher
  • HDFS-9085 Show renewer information in DelegationTokenIdentifier#toString.
    • Affected Versions: CDH5.4.11 CDH5.5.5 CDH5.5.6 CDH5.7.2 CDH5.7.3 CDH5.7.4 CDH5.7.5 CDH5.7.6 CDH5.8.2 CDH5.8.3 CDH5.8.4 CDH5.8.5, and CDH 5.9.0 and higher
  • The getSnapshottableDirListing() method returns null when there are no snapshottable directories. This is a change from CDH 5 Beta 2 where the method returns an empty array instead.
  • Files named .snapshot or .reserved must not exist within HDFS.
  • HADOOP-10020: Disable symlinks temporarily.
    • Affected Versions: All CDH 5 versions
  • HDFS-2832 - The HDFS internal layout version has changed between CDH 5 Beta 1 and CDH 5 Beta 2, so a file system upgrade is required to move an existing Beta 1 cluster to Beta 2.
  • HDFS-4451: HDFS balancer command returns exit code 0 on success instead of 1.
    • Affected Versions: All CDH 5 versions
  • HDFS-4594: WebHDFS open sets Content-Length header to what is specified by length parameter rather than how much data is actually returned.
    • Affected Versions: All CDH 5 versions
    • Impact: In CDH 5, Content-Length header will contain the number of bytes actually returned, rather than the request length.
  • HDFS-4659: Support setting execution bit for regular files.
    • Affected Versions: All CDH 5 versions
    • Impact: In CDH 5, files copied out of copyToLocal may now have the executable bit set if it was set when they were created or copied into HDFS.
  • HDFS-4997 - libhdfs functions now return correct error codes in errno in case of an error, instead of always returning 255.
    • Affected Versions: All CDH 5 versions
  • HDFS-5138 - The -finalize NameNode startup option has been removed. To finalize an in-progress upgrade, you should instead use the hdfs dfsadmin -finalizeUpgrade command while your NameNode is running, or while both NameNodes are running in a High Availability setup.
    • Affected Versions: All CDH 5 versions
  • HDFS-7279 - In CDH 5.5.0 and higher, DataNode WebHDFS implementation uses Netty as an HTTP server instead of Jetty.
  • HADOOP-13508 - In CDH 5.11 and higher, the behavior of org.apache.hadoop.fs.permission.FsPermission#FsPermission(String mode) changed to fix a bug in parsing sticky bits. The new behavior may cause incompatible changes if an application depends on the original behavior.
  • HDFS-11056 - CDH 5.5.6 and 5.9.1 fixed a critical block corruption bug, but the fix introduced a new bug. Specifically, DataNodes may accumulate too many open file descriptors for deleted meta files over time. The bug is fixed in newer versions, including CDH5.9.2 and CDH5.10.0

    Workaround: Restart DataNodes.

  • HADOOP-14104 - In CDH 5.12.0 and higher, the client always asks the NameNode for the KMS provider path. When HDFS high availability is used and the nameservices share a name, Cloudera Manager's Backup and Disaster Recovery replication fails.

    Workaround: Use unique names for the nameservices.

    • Affected versions:: CDH 5.12.x, CDH 5.13.x, 5.14.x

Change in High-Availability Support

In CDH 5, the only high-availability (HA) implementation is Quorum-based storage; shared storage using NFS is no longer supported.

MapReduce

To use MRv1 from a tarball installation, complete the following steps:
  1. Extract the files from the tarball.
  2. Create a symbolic link as follows:
    ln -s install_dir/bin-mapreduce1 install_dir/share/hadoop/mapreduce1/bin
  3. Create a second symbolic link as follows:
    ln -s install_dir/etc/hadoop-mapreduce1 install_dir/share/hadoop/mapreduce1/conf
  4. Set the HADOOP_HOME and HADOOP_CONF_DIR environment variables in your execution environment as follows:
    $ export HADOOP_HOME=install_dir/share/hadoop/mapreduce1
    $ export HADOOP_CONF_DIR=$HADOOP_HOME/conf 
  5. Copy your existing start-dfs.sh and stop-dfs.sh scripts to install_dir/bin-mapreduce1
  6. For convenience, add install_dir/bin to the PATH variable in your execution environment.

Apache MapReduce 2.0 (YARN) Incompatible Changes

The following incompatible changes occurred for Apache MapReduce 2.0 (YARN) between CDH 4.x and CDH 5 Beta 2:
  • The CATALINA_BASE variable no longer determines whether a component is configured for YARN or MRv1. Use the alternatives command instead, and make sure CATALINA_BASE is not set. see the Oozie and Sqoop2 configuration sections for instructions.
  • YARN-1288 - YARN Fair Scheduler ACL change. Root queue defaults to everybody, and other queues default to nobody.
  • YARN High Availability configurations have changed. Configuration keys have been renamed among other changes.
  • The YARN_HOME property has been changed to HADOOP_YARN_HOME.
  • Note the following changes to configuration properties in yarn-site.xml:
    • The value of yarn.nodemanager.aux-services should be changed from mapreduce.shuffle to mapreduce_shuffle.
    • yarn.nodemanager.aux-services.mapreduce.shuffle.class has been renamed to yarn.nodemanager.aux-services.mapreduce_shuffle.class
    • yarn.resourcemanager.resourcemanager.connect.max.wait.secs has been renamed to yarn.resourcemanager.connect.max-wait.secs
    • yarn.resourcemanager.resourcemanager.connect.retry_interval.secs has been renamed to yarn.resourcemanager.connect.retry-interval.secs
    • yarn.resourcemanager.am. max-retries is renamed to yarn.resourcemanager.am.max-attempts
    • The YARN_HOME environment variable used in the yarn.application.classpathhas been renamed to HADOOP_YARN_HOME. Make sure you include $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/* in the classpath. For more information, see Step 2: Configure YARN daemons in the instructions for deploying CDH with YARN in the Cloudera Installation Guide.

Changed default location of container executor binary

The default location of container-executor binary and .cfg files was changed to /var/lib/yarn-ce. It used to be /opt/cloudera/parcels/<CDH_parcel_version>.

Apache MapReduce 2.0 (YARN) Limitations

YARN rolling log aggregation not supported

Cloudera does not support YARN rolling log aggregation.

DockerContainerExecutor not supported in YARN

Cloudera does not support DockerContainerExecutor in YARN.

Node Manager configuration for YARN

A Spark on YARN job launches (with a ProcessBuilder shell executable) a Sqoop job with:

yarn jar /data/home/.../infaLib/sqoop-1.4.6-client.jar import -libjars \
  file:///data/home/.../lib/avro-mapred-1.7.5-hadoop2.jar --connect jdbc:oracle:thin:@**********:1521 \
  --username ******* -m 1 --as-avrodatafile --columns CUSTOMER_ID,ORDER_ID,.... --table ORDERS \
  --target-dir hdfs://0.0.0.0:8020/user/..../d9b49b82_0f2e_41d3_bb31_ad6ff28aa966 \
  --password-file ******

This command appears to work fine from the command line, using the default configuration files from /etc/hadoop/conf. However, when you launch it from the Spark application (with a ProcessBuilder shell exec), configuration from the Cloudera parcel directory is picked up instead, and the job fails with the following error:

Log Length: 88 
Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster

Workaround

To correct this issue:
  1. Open the default etc/hadoop/conf/mapred-site.xml.
  2. Copy the excerpt with the mapreduce.application.classpath property.
  3. Go to the YARN NodeManager configuration page in Cloudera Manager.
  4. Paste the excerpt in the NodeManager Advanced Configuration Snippet field. For example:
<property>
  <name>mapreduce.application.classpath</name>
  <value>$HADOOP_CLIENT_CONF_DIR,$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/,
    $HADOOP_COMMON_HOME/lib/,$HADOOP_HDFS_HOME/,$HADOOP_HDFS_HOME/lib/,
    $HADOOP_YARN_HOME/,$HADOOP_YARN_HOME/lib/
  </value>
</property>