Apache Hadoop Known Issues
This page includes known issues and related topics, including:
Deprecated Properties
Several Hadoop and HDFS properties have been deprecated as of Hadoop 2.0.0 (Hadoop 0.23.1, CDH 4 Beta) and later. For details, see Deprecated Properties.
Hadoop Common
KMS Load Balancing Provider Fails to invalidate Cache on Key Delete
The KMS Load balancing Provider has not been correctly invalidating the cache on key delete operations. The failure to invalidate the cache on key delete operations can result in the possibility that data can be leaked from the framework for a short period of time based on the value of the hadoop.kms.current.key.cache.timeout.ms property. Its default value is 30,000ms. When the KMS is deployed in an HA pattern the KMSLoadBalancingProvider class will only send the delete operation to one KMS role instance in a round-robin fashion. The code lacks a call to invalidate the cache across all instances and can leave key information including the metadata and key stored (the deleted key) in the cache on one or more KMS instances up to the key cache timeout.
-
CDH
-
HDP
-
CDP
-
CDH 5.x
-
CDH 6.x
-
CDP 7.0.x
-
CDP 7.1.4 and earlier
-
HDP 2.6 and later
Users affected: Customers with Data-at-rest encryption enabled that have more than 1 kms role instance and the services Key Cache enabled.
Impact: Key Meta-data and Key material may remain active within the service cache.
Severity: Medium
- CDH customers: Upgrade to CDP 7.1.5 or request a patch
- HDP customers: Request a patch
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: TSB 2020-434: KMS Load Balancing Provider Fails to invalidate Cache on Key Delete
Zip Slip Vulnerability
“Zip Slip” is a widespread arbitrary file overwrite critical vulnerability, which typically results in remote command execution. It was discovered and responsibly disclosed by the Snyk Security team ahead of a public disclosure on June 5, 2018, and affects thousands of projects.
Cloudera has analyzed our use of zip-related software, and has determined that only Apache Hadoop is vulnerable to this class of vulnerability in CDH 5. This has been fixed in upstream Hadoop as CVE-2018-8009.
Products affected: Hadoop
Releases affected:
- CDH 5.12.x and all prior releases
- CDH 5.13.0, 5.13.1, 5.13.2, 5.13.3
- CDH 5.14.0, 5.14.2, 5.14.3
- CDH 5.15.0
Users affected: All
Date of detection: April 19, 2018
Detected by: Snyk
Severity: High
Impact: Zip Slip is a form of directory traversal that can be exploited by extracting files from an archive. The premise of the directory traversal vulnerability is that an attacker can gain access to parts of the file system outside of the target folder in which they should reside. The attacker can then overwrite executable files and either invoke them remotely or wait for the system or user to call them, thus achieving remote command execution on the victim’s machine. The vulnerability can also cause damage by overwriting configuration files or other sensitive resources, and can be exploited on both client (user) machines and servers.
CVE: CVE-2018-8009
Immediate action required: Upgrade to a version that contains the fix.
Addressed in release/refresh/patch: CDH 5.14.4, CDH 5.15.1
For the latest update on this issue, see the corresponding Knowledge article:
TSB: 2018-307: Zip Slip Vulnerability
Apache Hadoop MapReduce Job History Server (JHS) vulnerability CVE-2017-15713
A vulnerability in Hadoop’s Job History Server allows a cluster user to expose private files owned by the user running the MapReduce Job History Server (JHS) process. See http://seclists.org/oss-sec/2018/q1/79 for reference.
Products affected: Apache Hadoop MapReduce
Releases affected: All releases prior to CDH 5.12.0. CDH 5.12.0, CDH 5.12.1, CDH 5.12.2, CDH 5.13.0, CDH 5.13.1, CDH 5.14.0
Users affected: Users running the MapReduce Job History Server (JHS) daemon
Date/time of detection: November 8, 2017
Detected by: Man Yue Mo of lgtm.com
Severity (Low/Medium/High): High
Impact: The vulnerability allows a cluster user to expose private files owned by the user running the MapReduce Job History Server (JHS) process. The malicious user can construct a configuration file containing XML directives that reference sensitive files on the MapReduce Job History Server (JHS) host.
CVE: CVE-2017-15713
Immediate action required: Upgrade to a release where the issue is fixed.
Addressed in release/refresh/patch: CDH 5.13.2, 5.14.2
Hadoop LdapGroupsMapping does not support LDAPS for self-signed LDAP server
Hadoop LdapGroupsMapping does not work with LDAP over SSL (LDAPS) if the LDAP server certificate is self-signed. This use case is currently not supported even if Hadoop User Group Mapping LDAP TLS/SSL Enabled, Hadoop User Group Mapping LDAP TLS/SSL Truststore, and Hadoop User Group Mapping LDAP TLS/SSL Truststore Password are filled properly.
Bug: HADOOP-12862
Affected Versions: All CDH 5 versions.
Workaround: None.
HDFS
CVE-2018-1296 Permissive Apache Hadoop HDFS listXAttr Authorization Exposes Extended Attribute Key/Value Pairs
AHDFS exposes extended attribute key/value pairs during listXAttrs, verifying only path-level search access to the directory rather than path-level read permission to the referent.
Products affected: Apache HDFS
- CDH 5.4.0 - 5.15.1, 5.16.0
- CDH 6.0.0, 6.0.1, 6.1.0
Users affected: Users who store sensitive data in extended attributes, such as users of HDFS encryption.
Date/time of detection: Dcember 12, 2017
Detected by: Rushabh Shah, Yahoo! Inc., Hadoop committer
Severity (Low/Medium/High): Medium
Impact: HDFS exposes extended attribute key/value pairs during listXAttrs, verifying only path-level search access to the directory rather than path-level read permission to the referent. This affects features that store sensitive data in extended attributes.
CVE: CVE-2018-1296
- Upgrade: Update to a version of CDH containing the fix.
- Workaround: If a file contains sensitive data in extended attributes, users and admins need to change the permission to prevent others from listing the directory that contains the file.
- CDH 5.15.2, 5.16.1
- CDH 6.1.1, 6.2.0
Clusters running CDH 5.16.1, 6.1.0, or 6.1.1 can lose some HDFS file permissions any time the NameNode is restarted
When a cluster is upgraded to 5.16.1, 6.1.0, or 6.1.1 roles with SELECT and/or INSERT privileges on an Impala database or table will have the REFRESH privilege added as part of the upgrade process. HDFS ACLs for roles with the REFRESH privilege get set with empty permissions whenever the NameNode is restarted. This can cause any jobs or queries run by users within affected roles to fail because they will no longer be able to access affected Impala database or tables.
Products Affected: HDFS and components that access files in HDFS
Affected Versions: CDH 5.16.1, 6.1.0, 6.1.1
Users Affected: Clusters with Impala and HDFS ACLs managed by Sentry upgrading from any release to CDH 5.16.1, 6.1.0, and 6.1.1.
Severity (Low/Medium/High): High
Root Cause and Impact: The new privilege REFRESH was introduced in CDH 5.16 and 6.1 and applies to Impala databases and tables. When a cluster is upgraded to 5.16.1, 6.1.0, or 6.1.1, roles with SELECT or INSERT privileges on an Impala database or table will have the REFRESH privilege added during the upgrade.
HDFS ACLs for roles with the REFRESH privilege get set with empty permissions whenever the NameNode is restarted. The NameNode is restarted during the upgrade.
For example if a group appdev is in role appdev_role and has SELECT access to the Impala table "project" the HDFS ACLs prior to the upgrade would look similar to:
group: appdev group::r--
After the upgrade the HDFS ACLs will be set with no permissions and will look like this:
group: appdev group::---
Any jobs or queries run by users within affected roles will fail because they will no longer be able to access affected Impala database or tables. This impacts any SQL client accessing the affected databases and tables. For example, if a Hive client is used to access a table created in Impala it will also fail. Jobs accessing the files directly through HDFS, e.g. via Spark, will also be impacted.
The HDFS ACLs will get reset whenever the NameNode is restarted.
Immediate action required: If possible, do not upgrade to releases CDH 5.16.1, 6.1.0, or 6.1.1 if Impala is used and Sentry manages HDFS ACLs within your environment. Subsequent CDH releases will resolve the problem with a product fix under SENTRY-2490.
If an upgrade is being considered, reach out to your account team to discuss other possibilities, and to receive additional insight into future product release schedules.
If an upgrade must be executed, contact Cloudera Support indicating the upgrade plan and why an upgrade is being executed. Options are available to assist with the upgrade if necessary.
Addressed in release/refresh/patch: Patches for 5.16.1, 6.1.0 and 6.1.1 are available for major supported operating systems. Customers are encouraged to contact Cloudera Support for a patch. The patch should be applied immediately after upgrade to any of the affected versions.
The fix for this TSB will be included in 6.1.2, 6.2.0, 5.16.2, and 5.17.0.
Potential data corruption due to race conditions between concurrent block read and write
Under rare conditions when an HDFS file is open for write, an application reading the same HDFS blocks might read up-to-date block data of the partially written file, while reading a stale checksum that corresponds to the block data before the latest write. The block is incorrectly declared corrupt as a result. Normally the HDFS NameNode schedules additional replica for the same block from other replicas if a replica is corrupted, but if the frequency of concurrent write and read is high enough, there is a small probability that all replicas of a block can be declared corrupt, and the file becomes corrupt and unrecoverable as well.
2017-10-18 11:23:46,627 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: ip-168-61-2-30:50010:DataXceiver error processing WRITE_BLOCK operation src: /168.61.2.32:48163 dst: /168.61.12.31:50010 java.io.IOException: Terminating due to a checksum error.java.io.IOException: Unexpected checksum mismatch while writing BP-1666924250-168.61.12.36-1494235758065:blk_1084584428_5057054 from /168.61.12.32:48163 at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:604) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:894) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:794) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246) at java.lang.Thread.run(Thread.java:745)
The bug is fixed by HDFS-11056, HDFS-11160 and HDFS-11229.
Products Affected: HDFS
- All CDH 5.4 releases and lower
- CDH 5.5.0, 5.5.1, 5.5.2, 5.5.4, 5.5.5
- CDH 5.6.0, 5.6.1
- CDH 5.7.0, 5.7.1, 5.7.2, 5.7.3, 5.7.4, 5.7.5
- CDH 5.8.0, 5.8.2, 5.8.3
- CDH 5.9.0, 5.9.1
Users Affected: Workloads that require reading a file while it’s being concurrently written to HDFS.
Severity (Low/Medium/High): Low
Impact: If the workload requires reading and writing the same file concurrently, there is a small probability that all replicas of a block can be declared corrupt, and the file becomes corrupt as well.
Immediate action required: Customers are advised to upgrade to a CDH version containing the fix if the workloads are susceptible to this bug.
- CDH 5.5.6 and higher
- CDH 5.7.6 and higher
- CDH 5.8.4 and higher
- CDH 5.9.2 and higher
- CDH 5.10.0 and higher
Cannot re-encrypt an encryption zone if a previous re-encryption on it was canceled
When canceling a re-encryption on an encryption zone, the status of the re-encryption may continue to show "Processing". When this occurs, future re-encrypt commands for this encryption zone will fail inside the NameNode, and the re-encryption will never complete.
Cloudera Bug: CDH-59073
Affected Versions: CDH 5.13.0
Fixed in Versions: CDH 5.13.1 and higher
Workaround: To halt, or remove the "Processing" status for the encryption zone, re-issue the cancel re-encryption command on the encryption zone. If a new re-encryption command is required for this encryption zone, restart the NameNode before issuing the command.
Potential Block Corruption and Data Loss During Pipeline Recovery
A bug in the HDFS block pipeline recovery code can cause blocks to be unrecoverable due to miscalculation of the block checksum. On a busy cluster where data is written and flushed frequently, when a write pipeline recovery occurs, a node newly added to the write pipeline may calculate the checksum incorrectly. This miscalculation is very rare, but when it does occur, the replica becomes corrupted and data can be lost if all replicas are simultaneously affected.
java.io.IOException: Terminating due to a checksum error.java.io.IOException: Unexpected checksum mismatch while writing BP-1800173197-10.x.y.z-1444425156296:blk_1170125248_96458336 from /10.x.y.z
- CDH 5.0.0, 5.0.1, 5.0.2, 5.0.3, 5.0.4, 5.0.5, 5.0.6
- CDH 5.1.0, 5.1.2, 5.1.3, 5.1.4, 5.1.5
- CDH 5.2.0, 5.2.1, 5.2.3, 5.2.4, 5.2.5, 5.2.6
- CDH 5.3.0, 5.3.1, 5.3.2, 5.3.3, 5.3.4, 5.3.5, 5.3.7, 5.3.8, 5.3.9, 5.3.10
- CDH 5.4.0, 5.4.1, 5.4.2, 5.4.3, 5.4.4, 5.4.5, 5.4.7, 5.4.8, 5.4.9, 5.4.10
- CDH 5.5.0, 5.5.1
Users affected: All users running the affected CDH versions and using the HDFS file system.
Severity (Low/Medium/High): High
Impact: Potential loss of block data.
- CDH 5.4.11, CDH 5.5.2, CDH 5.6.0 and higher
DiskBalancer Occasionally Emits False Error Messages
Diskbalancer occasionally emits false error messages. For example:
2016-08-03 11:01:41,788 ERROR org.apache.hadoop.hdfs.server.datanode.DiskBalancer: Disk Balancer is not enabled.
You can safely ignore this error message if you are not using DiskBalancer.
Affected Versions: CDH 5.8.1 and below.
Fixed in Versions: CDH 5.8.2 and higher.
Bug: HDFS-10588
Workaround: Use the following command against all DataNodes to suppress DiskBalancer logs:
hadoop daemonlog -setlevel <host:port> org.apache.hadoop.hdfs.server.datanode.DiskBalancer FATAL
Another workaround is to suppress the warning by setting the log level of DiskBalancer to FATAL. Add the following to log4j.properties (DataNode Logging Advanced Configuration Snippet (Safety Valve)) and restart your DataNodes:
log4j.logger.org.apache.hadoop.hdfs.server.datanode.DiskBalancer = FATAL
Upgrade Requires an HDFS Upgrade
Upgrading from any release earlier than CDH 5.2.0 to CDH 5.2.0 or later requires an HDFS Upgrade.
See Upgrading Unmanaged CDH Using the Command Line for further information.
Optimizing HDFS Encryption at Rest Requires Newer openssl Library on Some Systems
CDH 5.3 implements the Advanced Encryption Standard New Instructions (AES-NI), which provide substantial performance improvements. To get these improvements, you need a recent version of libcrypto.so on HDFS and MapReduce client hosts that is, any host from which you originate HDFS or MapReduce requests. Many OS versions have an older version of the library that does not support AES-NI.
See HDFS Transparent Encryption in the Encryption section of the Cloudera Security guide for instructions for obtaining the right version.
Other HDFS Encryption Known Issues
Potentially Incorrect Initialization Vector Calculation in HDFS Encryption
A mathematical error in the calculation of the Initialization Vector (IV) for encryption and decryption in HDFS could cause data to appear corrupted when read. The IV is a 16-byte value input to encryption and decryption ciphers. The calculation of the IV implemented in HDFS was found to be subtly different from that used by Java and OpenSSL cryptographic routines. The result is that data could possibly appear to be corrupted when it is read from a file inside an Encryption Zone.
Fortunately, the probability of this occurring is extremely small. For example, the maximum size of a file in HDFS is 64 TB. This enormous file would have a 1-in-4- million chance of hitting this condition. A more typically sized file of 1 GB would have a roughly 1-in-274-billion chance of hitting the condition.
Affected Versions: CDH 5.2.1 and below
Fixed in Versions: CDH 5.3.0 and higher
Cloudera Bug: CDH-23618
Workaround: If you are using the experimental HDFS encryption feature in CDH 5.2, upgrade to CDH 5.3 and verify the integrity of all files inside an Encryption Zone.
DistCp between unencrypted and encrypted locations fails
By default, DistCp compares checksums provided by the filesystem to verify that data was successfully copied to the destination. However, when copying between unencrypted and encrypted locations, the filesystem checksums will not match since the underlying block data is different.
Affected Versions: CDH 5.2.1 and below.
Fixed in Versions: CDH 5.2.2 and higher.
Bug: HADOOP-11343
Workaround: Specify the -skipcrccheck and -update distcp flags to avoid verifying checksums.
Cannot move encrypted files to trash
With HDFS encryption enabled, you cannot move encrypted files or directories to the trash directory.
Affected Versions: All CDH 5 versions
Bug: HDFS-6767
rm -r -skipTrash /testdir
HDFS NFS gateway and CDH installation (using packages) limitation
HDFS NFS gateway works as shipped ("out of the box") only on RHEL-compatible systems, but not on SLES, Ubuntu, or Debian. Because of a bug in native versions of portmap/rpcbind, the HDFS NFS gateway does not work out of the box on SLES, Ubuntu, or Debian systems when CDH has been installed from the command-line, using packages. It does work on supported versions of RHEL-compatible systems on which rpcbind-0.2.0-10.el6 or later is installed, and it does work if you use Cloudera Manager to install CDH, or if you start the gateway as root. For more information, see supported versions.
Bug: 731542 (Red Hat), 823364 (SLES), 594880 (Debian)
- On Red Hat and similar systems, make sure rpcbind-0.2.0-10.el6 or later is installed.
- On SLES, Debian, and Ubuntu systems, do one of the following:
- Install CDH using Cloudera Manager; or
- As of CDH 5.1, start the NFS gateway as root; or
- Start the NFS gateway without using packages; or
- You can use the gateway by running rpcbind in insecure mode, using the -i option, but keep in mind that this allows anyone from a remote host to bind to the portmap.
HDFS does not currently provide ACL support for the NFS gateway
No error when changing permission to 777 on .snapshot directory
Snapshots are read-only; running chmod 777 on the .snapshots directory does not change this, but does not produce an error (though other illegal operations do).
Affected Versions: All CDH 5 versions
Bug: HDFS-4981
Cloudera Bug: CDH-13062
Workaround: None
Snapshot operations are not supported by ViewFileSystem
Affected Versions: All CDH 5 versions
Cloudera Bug: CDH-12600
Workaround: None
Snapshots do not retain directories' quotas settings
Permissions for dfs.namenode.name.dir incorrectly set.
Hadoop daemons should set permissions for the dfs.namenode.name.dir (or dfs.name.dir) directories to drwx------ (700), but in fact these permissions are set to the file-system default, usually drwxr-xr-x (755).
Affected Versions: All CDH 5 versions
Bug: HDFS-2470
Workaround: Use chmod to set permissions to 700. See Configuring Local Storage Directories for Use by HDFS for more information and instructions.
hadoop fsck -move does not work in a cluster with host-based Kerberos
Affected Versions: All CDH 5 versions
Cloudera Bug: CDH-7017
Workaround: Use hadoop fsck -delete
HttpFS cannot get delegation token without prior authenticated request
A request to obtain a delegation token cannot initiate an SPNEGO authentication sequence; it must be accompanied by an authentication cookie from a prior SPNEGO authentication sequence.
Affected Versions: CDH 5.1 and below
Fixed in Versions: CDH 5.2 and higher
Bug: HDFS-3988
Cloudera Bug: CDH-8144
Workaround: Make another WebHDFS request (such as GETHOMEDIR) to initiate an SPNEGO authentication sequence and then make the delegation token request.
DistCp does not work between a secure cluster and an insecure cluster in some cases
See the upstream bug reports for details.
Affected Versions: All CDH 5 versions
Bug: HDFS-7037, HADOOP-10016, HADOOP-8828
Cloudera Bug: CDH-14945, CDH-18779
Workaround: None
Port configuration required for DistCp to Hftp from secure cluster (SPNEGO)
To copy files using DistCp to Hftp from a secure cluster using SPNEGO, you must configure the dfs.https.port property on the client to use the HTTP port (50070 by default).
Affected Versions: All CDH 5 versions
Bug: HDFS-3983
Cloudera Bug: CDH-8118
Workaround: Configure dfs.https.port to use the HTTP port on the client
Non-HA DFS Clients do not attempt reconnects
This problem means that streams cannot survive a NameNode restart or network interruption that lasts longer than the time it takes to write a block.
Affected Versions: All CDH 5 versions
Bug: HDFS-4389
Cloudera Bug: CDH-10415
DataNodes may become unresponsive to block creation requests
DataNodes may become unresponsive to block creation requests from clients when the directory scanner is running.
Affected Versions: CDH 5.2.1 and below
Fixed in Versions: CDH 5.2.2 and higher
Bug: HDFS-7489
Workaround: Disable the directory scanner by setting dfs.datanode.directoryscan.interval to -1.
The active NameNode will not accept an fsimage sent from the standby during rolling upgrade
Affected Versions: CDH 5.3.7 and below
Fixed in Versions: CDH 5.3.8 and higher
Bug: HDFS-7185
Workaround: None.
Block report can exceed maximum RPC buffer size on some DataNodes
On a DataNode with a large number of blocks, the block report may exceed the maximum RPC buffer size.
Affected Versions: All CDH 5 versions
Bug: None
<property> <name>ipc.maximum.data.length</name> <value>268435456</value> </property>
Misapplied user-limits setting possible
The ulimits setting in /etc/security/limits.conf is applied to the wrong user when security is enabled.
Affected Versions: CDH 5.2.0 and below
Bug: DAEMON-192
Anticipated Resolution: None
Workaround: To increase the ulimits applied to DataNodes, you must change the ulimit settings for the root user, not the hdfs user.
LAZY_PERSIST storage policy is experimental and not supported
Using this storage policy could potentially lead to data loss.
Affected versions: All CDH 5 versions
Bug: HDFS-8229
Workaround: None
MapReduce2, YARN
NodeManager fails because of the changed default location of container executor binary
The default location of container-executor binary and .cfg files was changed to /var/lib/yarn-ce. It used to be /opt/cloudera/parcels/<CDH_parcel_version>. Because of this change, if you did not have the mount options -noexec and -nosuid set on /opt, the NodeManager can fail to start up as these options are set on /var.
Affected versions CDH 5.16.1, All CDH 6 versions
Workaround: Either remove the -noexec and -nosuid mount options on /var or change the container-executor binary and .cdf path using the CMF_YARN_SAFE_CONTAINER_EXECUTOR_DIR environment variable.
YARN scheduler queue ACLs are not checked when performing MoveApplicationAcrossQueues operations
The YARN moveApplicationAcrossQueues operation does not check ACLs on the target queue. This allows a user to move an application to a queue that the user has no access to.
Affected Versions: All CDH 5 versions
Fixed Versions: CDH 6.0.0
Bug: YARN-5554
Cloudera Bug: CDH-43327
Workaround: N/A
Hadoop YARN Privilege Escalation CVE-2016-6811
A vulnerability in Hadoop YARN allows a user who can escalate to the yarn user the ability to possibly run arbitrary commands as the root user.
Products affected: Hadoop YARN
Releases affected:
- CDH 5.12.x and all prior releases
- CDH 5.13.0, 5.13.1, 5.13.2, 5.13.3
- CDH 5.14.0, 5.14.2, 5.14.3
- CDH 5.15.0
Users affected: Users running the Hadoop YARN service.
Detected by: Freddie Rice
Severity: High
Impact: The vulnerability allows a user who has access to a node in the cluster running a YARN NodeManager and who can escalate to the yarn user, the ability to run arbitrary commands as the root user even if the user is not allowed to escalate directly to the root user.
CVE: CVE-2016-6811
Upgrade: Upgrade to a release where the issue is fixed.
Workaround: The vulnerability can be mitigated by restricting access to the nodes where the YARN NodeManagers are deployed, and by removing su access to the yarn user and by making sure no one other than the yarn user is a member of the yarn group. Please consult with your internal system administration team and adhere to your internal security policy when evaluating the feasibility of the above mitigation steps.
Addressed in release/refresh/patch: CDH 5.14.4, 5.15.1
For the latest update on this issue, see the corresponding Knowledge article:
TSB: 2018-309: Hadoop YARN privilege escalation
Missing results in Hive, Spark, Pig, Custom MapReduce jobs, and other Java applications when filtering Parquet data written by Impala
Apache Hive and Apache Spark rely on Apache Parquet's parquet-mr Java library to perform filtering of Parquet data stored in row groups. Those row groups contain statistics that make the filtering efficient without having to examine every value within the row group.
Recent versions of the parquet-mr library contain a bug described in PARQUET-1217. This bug causes filtering to behave incorrectly if only some of the statistics for a row group are written. Starting in CDH 5.13, Apache Impala populates statistics in this way for Parquet files. As a result, Hive and Spark may incorrectly filter Parquet data that is written by Impala.
In CDH 5.13, Impala started writing Parquet's null_count metadata field without writing the min and max fields. This is valid, but it triggers the PARQUET-1217 bug in the predicate push-down code of the Parquet Java library (parquet-mr). If the null_count field is set to a non-zero value, parquet-mr assumes that min and max are also set and reads them without checking whether they are actually there. If those fields are not set, parquet-mr reads their default value instead.
For integer SQL types, the default value is 0, so parquet-mr incorrectly assumes that the min and max values are both 0. This causes the problem when filtering data. Unless the value 0 itself matches the search condition, all row groups are discarded due to the incorrect min/max values, which leads to missing results.
- Hive
- Spark
- Pig
- Custom MapReduce jobs
- CDH 5.13.0, 5.13.1, 5.13.2, and 5.14.0
- CDS 2.2 Release 2 Powered by Apache Spark and earlier releases on CDH 5.13.0 and later
Who Is Affected: Anyone writing Parquet files with Impala and reading them back with Hive, Spark, or other Java-based components that use the parquet-mr libraries for reading Parquet files.
Severity (Low/Medium/High): High
Impact: Parquet files containing null values for integer fields written by Impala produce missing results in Hive, Spark, and other Java applications when filtering by the integer field.
-
Upgrade
You should upgrade to one of the fixed maintenance releases mentioned below.
-
Workaround
This issue can be avoided at the price of performance by disabling predicate push-down optimizations:-
In Hive, use the following SET command:
SET hive.optimize.ppd = false;
-
In Spark, disable the following configuration setting:
--conf spark.sql.parquet.filterPushdown=false
-
- CDH 5.13.3 and higher
- CDH 5.14.2 and higher
- CDH 5.15.0 and higher
- CDS 2.3 Release 2 and higher
For the latest update on this issue, see the corresponding Knowledge Base article:
Apache Hadoop Yarn Fair Scheduler might stop assigning containers when preemption is on
- A race condition that results in the Fair Scheduler making duplicate reservations. The duplicate reservations are never released and can result in an integer overflow stopping container assignments.
- A possible deadlock in the event processing of the Fair Scheduler. This will stop all updates in the Resource Manager.
Both side effects will ultimately cause the Fair Scheduler to stop processing resource requests.
Without the change from YARN-6432 the resources that are released after being preempted are not reserved for the starved application. This could result in scheduler assigning the preempted container to any application, not just the starved application. If no reservations are made on the node for the starved application preemption will be less effective in solving the resource starvation.
Products affected: YARN
Releases affected: CDH 5.11.1, 5.12.0
Users affected: Users who have YARN configured with the FairScheduler and have turned preemption on.
Severity (Low/Medium/High): Low
Impact: The Resource Manager will accept application but no application will change state or get container assigned and thus progress.
- If you have not upgraded to the affected release and preemption in the FairScheduler is in use, avoid upgrading to the affected releases.
- If you have already upgraded to the affected releases, choose from the following options:
- Upgrade to CDH 5.11.2 or 5.12.1
- Turn off preemption
Fixed in Versions: CDH 5.11.2 and 5.12.1
Yarn's Continuous Scheduling can cause slowness in Oozie
When Continuous Scheduling is enabled in Yarn, this can cause slowness in Oozie due to long delays in communicating with Yarn. In Cloudera Manager 5.9.0 and higher, Enable Fair Scheduler Continuous Scheduler is turned off by default.
Affected Versions: All CDH 5 versions
Bug: None
Cloudera Bug: CDH-60788
Workaround: Turn off Enable Fair Scheduler Continuous Scheduling in Cloudera Manager's Yarn Configuration. To keep equivalent benefits of this feature, turn on Fair Scheduler Assign Multiple Tasks.
Rolling upgrades to 5.11.0 and 5.11.1 may cause application failures
Affected Versions: CDH versions that can be upgraded to 5.11.0 or 5.11.1
Fixed in Versions: CDH 5.11.2 and higher
Bug: None
Cloudera Bug: CDH-55284, TSB-241
Workaround: Upgrade to 5.11.2 or higher.
Name resolution issues can result in unresponsive Web UI and REST endpoints
Name resolution issues can cause the Web UI or the RM REST endpoints to consume all ResourceManager request handling threads, leaving the Web UI and REST endpoints unresponsive.
Fixed in Versions: CDH 5.10.0 and higher
Bug: YARN-4767
Cloudera Bug: CDH-45597
Workaround: Restart the ResourceManager or kill the application that is being accessed or waiting for the ResourceManager to complete the job.
Loss of connection to the Zookeeper cluster can cause problems with the ResourceManagers
If the YARN user is granted access to all keys in KMS, then files localized from an encryption zone can be world readable
If the YARN user is granted access to all keys in KMS, then files localized from an encryption zone can be world readable.
Fixed in Versions: CDH 5.7.7, 5.8.5, 5.9.2, 5.10.1, 5.11.0 and higher.
Bug: None
Cloudera Bug: CDH-47377
Workaround: Make sure files in an encryption zone do not have world-readable files modes if they are going to be localized.
Zookeeper outage can cause the ResourceManagers to exit
FairScheduler might not Assign Containers
Under certain circumstances, turning on Fair Scheduler Assign Multiple Tasks (yarn.scheduler.fair.assignmultiple) causes the scheduler to stop assigning containers to applications. Possible symptoms are that running applications show no progress, and new applications do not start, staying in an Assigned state, despite the availability of free resources on the cluster.
Affected Versions: CDH 5.5.0, CDH-5.5.1, CDH-5.5.2, CDH-5.5.3, CDH-5.5.4, CDH-5.5.5, CDH-5.5.6, CDH-5.6.0, and CDH-5.6.1
Fixed in Versions: CDH 5.7.0 and higher
Bug: YARN-4477
Cloudera Bug: CDH-36686
Workaround: Turn off Fair Scheduler Assign Multiple Tasks (yarn.scheduler.fair.assignmultiple) and restart the ResourceManager.
FairScheduler: AMs can consume all vCores leading to a livelock
When using FAIR policy with the FairScheduler, Application Masters can consume all vCores which may lead to a livelock.
Fixed in Versions: CDH 5.7.3 and higher, except for CDH 5.8.0 and CDH 5.8.1
Bug: YARN-4866
Cloudera Bug: CDH-37529
Workaround: Use Dominant Resource Fairness (DRF) instead of FAIR; or make sure that the cluster has enough vCores in proportion to the memory.
NodeManager mount point mismatch (YARN)
NodeManager may select a cgroups (Linux control groups) mount point that is not accessible to user yarn, resulting in failure to start up. The mismatch occurs because YARN uses cgroups in mount point /run/lxcfs/controllers, while Cloudera Manager typically configures cgroups at /sys/fs/cgroups. This issue has occurred on Ubuntu 16.04 systems.
Fixed in Versions: CDH 5.11.1 and higher
Bug: YARN-6433
Cloudera Bug: CDH-52263
$ umount errant_mount_point
- apt-get remove lxcfs
- Reboot the node
JobHistory URL mismatch after server relocation
After moving the JobHistory Server to a new host, the URLs listed for the JobHistory Server on the ResourceManager web UI still point to the old JobHistory Server. This affects existing jobs only. New jobs started after the move are not affected.
Affected Versions: All CDH 5 versions.
Workaround: For any existing jobs that have the incorrect JobHistory Server URL, there is no option other than to allow the jobs to roll off the history over time. For new jobs, make sure that all clients have the updated mapred-site.xml that references the correct JobHistory Server.
Starting an unmanaged ApplicationMaster may fail
Starting a custom Unmanaged ApplicationMaster may fail due to a race in getting the necessary tokens.
Affected Versions: CDH 5.1.5 and below.
Fixed in Versions: CDH 5.2 and higher.
Bug: YARN-1577
Cloudera Bug: CDH-17405
Workaround: Try to get the tokens again; the custom unmanaged ApplicationMaster should be able to fetch the necessary tokens and start successfully.
Moving jobs between queues not persistent after restart
CDH 5 adds the capability to move a submitted application to a different scheduler queue. This queue placement is not persisted across ResourceManager restart or failover, which resumes the application in the original queue.
Affected Versions: All CDH 5 versions.
Bug: YARN-1558
Cloudera Bug: CDH-17408
Workaround: After ResourceManager restart, re-issue previously issued move requests.
Encrypted shuffle may fail (MRv2, Kerberos, TLS)
In MRv2, if the LinuxContainerExecutor is used (usually as part of Kerberos security), and hadoop.ssl.enabled is set to true (see Configuring Encrypted Shuffle, Encrypted Web UIs, and Encrypted HDFS Transport), then the encrypted shuffle does not work and the submitted job fails.
Affected Versions: All CDH 5 versions.
Bug: MAPREDUCE-4669
Cloudera Bug: CDH-8036
Workaround: Use encrypted shuffle with Kerberos security without encrypted web UIs, or use encrypted shuffle with encrypted web UIs without Kerberos security.
ResourceManager-to-Application Master HTTPS link fails
In MRv2 (YARN), if hadoop.ssl.enabled is set to true (use HTTPS for web UIs), then the link from the ResourceManager to the running MapReduce Application Master fails with an HTTP Error 500 because of a PKIX exception.
A job can still be run successfully, and, when it finishes, the link to the job history does work.
Affected Versions: CDH versions before 5.1.0.
Fixed Versions: CDH 5.1.0
Bug: YARN-113
Cloudera Bug: CDH-8014
Workaround: Do not use encrypted web UIs.
History link in ResourceManager web UI broken for killed Spark applications
When a Spark application is killed, the history link in the ResourceManager web UI does not work.
Workaround: To view the history for a killed Spark application, see the Spark HistoryServer web UI instead.
Affected Versions: All CDH versions
Apache Issue: None
Cloudera Issue: CDH-49165
Routable IP address required by ResourceManager
ResourceManager requires routable host:port addresses for yarn.resourcemanager.scheduler.address, and does not support using the wildcard 0.0.0.0 address.
Bug: None
Cloudera Bug: CDH-6808
Workaround: Set the address, in the form host:port, either in the client-side configuration, or on the command line when you submit the job.
Amazon S3 copy may time out
The Amazon S3 filesystem does not support renaming files, and performs a copy operation instead. If the file to be moved is very large, the operation can time out because S3 does not report progress to the TaskTracker during the operation.
Bug: MAPREDUCE-972
Cloudera Bug: CDH-17955
Workaround: Use -Dmapred.task.timeout=15000000 to increase the MR task timeout.
Out-of-memory errors may occur with Oracle JDK 1.8
The total JVM memory footprint for JDK8 can be larger than that of JDK7 in some cases. This may result in out-of-memory errors.
Bug: None
Workaround: Increase max default heap size (-Xmx). In the case of MapReduce, for example, increase Reduce Task Maximum Heap Size in Cloudera Manager (mapred.reduce.child.java.opts, or mapreduce.reduce.java.opts for YARN) to avoid out-of-memory errors during the shuffle phase.
MapReduce JAR file renamed (CDH 5.4.0)
As of CDH 5.4.0, hadoop-test.jar has been renamed to hadoop-test-mr1.jar. This JAR file contains the mrbench, TestDFSIO, and nnbench tests.
Bug: None
Cloudera Bug: CDH-26521
Workaround: None.
Jobs in pool with DRF policy will not run if root pool is FAIR
If a child pool using DRF policy has a parent pool using Fairshare policy, jobs submitted to the child pool do not run.
Affected Versions: All CDH 5 versions.
Bug: YARN-4212
Cloudera Bug: CDH-31358
Workaround: Change parent pool to use DRF.
Jobs with encrypted spills do not recover if the AM goes down
The fix to CVE-2015-1776 leads to not having enough information to recover a job should the Application Master fail. Releases with this security fix cannot tolerate Application Master failures.
Affected Versions: All CDH 5 versions.
Bug: MAPREDUCE-6638
Cloudera Bug: CDH-37412
Workaround: None. Fix to come in a later release.
Large TeraValidate data sets can fail with MapReduce
In a cluster using MapReduce, TeraValidate fails when run over large TeraGen/TeraSort data sets (1TB and larger) with an IndexOutOfBoundsException. Smaller data sets do not show this issue.
Affected Versions: CDH 5.3.7 and lower
Fixed in Versions: CDH 5.3.8 and higher
Bug: MAPREDUCE-6481
Cloudera Bug: CDH-31871
Workaround:None.
MapReduce job failure and rolling upgrade (CDH 5.6.0)
MapReduce jobs might fail during a rolling upgrade to or from CDH 5.6.0. Cloudera recommends that you avoid doing rolling upgrades to CDH 5.6.0.
Bug: None
Cloudera Bug: CDH-38587
Workaround: Restart failed jobs.
Unsupported Features
- FileSystemRMStateStore: Cloudera recommends you use ZKRMStateStore (ZooKeeper-based implementation) to store the ResourceManager's internal state for recovery on restart or failover. Cloudera does not support the use of FileSystemRMStateStore in production.
- ApplicationTimelineServer (also known as Application History Server): Cloudera does not support ApplicationTimelineServer v1. ApplicationTimelineServer v2 is under development and Cloudera does not currently support it.
- Scheduler Reservations: Scheduler reservations are currently at an experimental stage, and Cloudera does not support their use in production.
- Scheduler node-labels: Node-labels are currently experimental with CapacityScheduler. Cloudera does not support their use in production.
- CapacityScheduler. This is deprecated and will be removed from CDH in a future version.
MapReduce1
Oozie workflows not recovered after JobTracker failover on a secure cluster
Delegation tokens created by clients (via JobClient#getDelegationToken()) do not persist when the JobTracker fails over. This limitation means that Oozie workflows will not be recovered successfully in the event of a failover on a secure cluster.
Bug: None
Cloudera Bug: CDH-8913
Workaround: Re-submit the workflow.
Hadoop Pipes should not be used in secure clusters
Hadoop Pipes should not be used in secure clusters. A shared password used by the framework for parent-child communications in the clear. A malicious user could intercept that password and potentially use it to access private data in a running application.
Bug: None
No JobTracker becomes active if both JobTrackers are migrated to other hosts
If JobTrackers in an High Availability configuration are shut down, migrated to new hosts, then restarted, no JobTracker becomes active. The logs show a Mismatched address exception.
Bug: None
Cloudera Bug: CDH-11801
$ zkCli.sh rmr /hadoop-ha/<logical name>
Hadoop Pipes may not be usable in an MRv1 Hadoop installation done through tarballs
Under MRv1, MapReduce's C++ interface, Hadoop Pipes, may not be usable with a Hadoop installation done through tarballs unless you build the C++ code on the operating system you are using.
Bug: None
Cloudera Bug: CDH-7304
Workaround: Build the C++ code on the operating system you are using. The C++ code is present under src/c++ in the tarball.