Known Issues and Limitations in CDH 6.1.0
The following sections describe the known issues in CDH 6.1.0, grouped by component:
Operating System Known Issues
Known issues and workarounds related to operating systems are listed below.
Linux kernel security patch and CDH services crashes CVE-2017-10000364
A fatal error has been detected by the Java Runtime Environment: SIGBUS (0x7) at pc=0x00007fe91ef6cebc, pid=30321, tid=0x00007fe930c67700
Cloudera services for HDFS and Impala cannot start after applying the patch.
Commonly used Linux distributions are shown in the table below. However, the issue affects any CDH release that runs on RHEL, CentOS, Oracle Linux, SUSE Linux, or Ubuntu and that has had the Linux kernel security patch for CVE-2017-1000364 applied.
If you have already applied the patch for your OS according to the advisories for CVE-2017-1000364, apply the kernel update that contains the fix for your operating system (some of which are listed in the table). If you cannot apply the kernel update, you can workaround the issue by increasing the Java thread stack size as detailed in the steps below.
Distribution | Advisories for CVE-2017-1000364 | Advisory updates |
---|---|---|
Oracle Linux 6 | ELSA-2017-1486 | Oracle has fixed this problem in ELSA-2017-1723. |
Oracle Linux 7 | ELSA-2017-1484 | Oracle has also added the fix for Oracle Linux 7 in ELBA-2017-1674. |
RHEL 6 | RHSA-2017-1486 | RedHat has fixed this problem for RHEL 6, marked this as outdated and superseded by RHSA-2017-1723. |
RHEL 7 | RHSA-2017-1484 | RedHat has fixed this problem for RHEL 7 and has marked this patch as outdated and superseded by RHBA-2017-1674. |
SLES | CVE-2017-1000364 | SUSE has also fixed this problem and the patch names are included in this same advisory. |
Workaround
If you cannot apply the kernel update, you can set the Java thread stack size to -Xss1280k for the affected services using the appropriate Java configuration option or the environment advanced configuration snippet, as detailed below.
For role instances that have specific Java configuration options properties:
- Log in to Cloudera Manager Admin Console.
- Select Configuration tab. , and then click the
- Type java in the search field to display Java related configuration parameters. The Java Configuration Options for Catalog Server property field displays. Type -Xss1280k in the entry field, adding to any existing settings.
- Click Save Changes.
- Navigate to the HDFS service by selecting .
- Click the Configuration tab.
- Click the Scope filter DataNode. The Java Configuration Options for DataNode field displays among the properties listed. Enter -Xss1280k into the field, adding to any existing properties.
- Click Save Changes.
- Select the Scope filter NFS Gateway. The Java Configuration Options for NFS Gateway field displays among the properties listed. Enter -Xss1280k into the field, adding to any existing properties.
- Click Save Changes.
- Restart the affected roles (or configure the safety valves in next section and restart when finished with all configurations).
For role instances that do not have specific Java configuration options:
- Log in to Cloudera Manager Admin Console.
- Select Configuration tab. , and then click the
- Click the Scope filter Impala Daemon and Category filter Advanced.
- Type impala daemon environment in the search field to find the safety valve entry field.
- In the Impala Daemon Environment Advanced Configuration Snippet (Safety Valve), enter:
JAVA_TOOL_OPTIONS=-Xss1280K
- Click Save Changes.
- Click the Scope filter Impala StateStore and Category filter Advanced.
- In the Impala StateStore Environment Advanced Configuration Snippet (Safety Valve), enter:
JAVA_TOOL_OPTIONS=-Xss1280K
- Click Save Changes.
- Restart the affected roles.
The table below summarizes the parameters that can be set for the affected services:
Service | Settable Java Configuration Option |
---|---|
HDFS DataNode | Java Configuration Options for DataNode |
HDFS NFS Gateway | Java Configuration Options for NFS Gateway |
Impala Catalog Server | Java Configuration Options for Catalog Server |
Impala Daemon | Impala Daemon Environment Advanced Configuration Snippet (Safety Valve) |
JAVA_TOOL_OPTIONS=-Xss1280K | |
Impala StateStore | Impala StateStore Environment Advanced Configuration Snippet (Safety Valve) |
JAVA_TOOL_OPTIONS=-Xss1280K |
Cloudera Issue: CDH-55771
Leap-Second Events
Impact: After a leap-second event, Java applications (including CDH services) using older Java and Linux kernel versions, may consume almost 100% CPU. See https://access.redhat.com/articles/15145.
Leap-second events are tied to the time synchronization methods of the Linux kernel, the Linux distribution and version, and the Java version used by applications running on affected kernels.
Although Java is increasingly agnostic to system clock progression (and less susceptible to a kernel's mishandling of a leap-second event), using JDK 7 or 8 should prevent issues at the CDH level (for CDH components that use the Java Virtual Machine).
Immediate action required:
(1) Ensure that the kernel is up to date.
-
RHEL6/7, CentOS 6/7 - 2.6.32-298 or higher
-
Oracle Enterprise Linux (OEL) - Kernels built in 2013 or later
-
SLES12 - No action required.
-
Java 8 - No action required.
(3) Ensure that your systems use either NTP or PTP synchronization.
For systems not using time synchronization, update both the OS tzdata and Java tzdata packages to the tzdata-2016g version, at a minimum. For OS tzdata package updates, contact OS support or check updated OS repositories. For Java tzdata package updates, see Oracle's Timezone Updater Tool.
Cloudera Issue: CDH-44788, TSB-189
Apache Accumulo Known Issues
There are no notable known issues in this release of Apache Accumulo.
Apache Crunch Known Issues
Apache Flume Known Issues
Fast Replay does not work with encrypted File Channel
If an encrypted file channel is set to use fast replay, the replay will fail and the channel will fail to start.
Workaround: Disable fast replay for the encrypted channel by setting use-fast-replay to false.
Apache Issue: FLUME-1885
Apache Hadoop Known Issues
This page includes known issues and related topics, including:
Deprecated Properties
Several Hadoop and HDFS properties have been deprecated as of Hadoop 3.0 and later. For details, see Deprecated Properties.
Hadoop Common
KMS Load Balancing Provider Fails to invalidate Cache on Key Delete
The KMS Load balancing Provider has not been correctly invalidating the cache on key delete operations. The failure to invalidate the cache on key delete operations can result in the possibility that data can be leaked from the framework for a short period of time based on the value of the hadoop.kms.current.key.cache.timeout.ms property. Its default value is 30,000ms. When the KMS is deployed in an HA pattern the KMSLoadBalancingProvider class will only send the delete operation to one KMS role instance in a round-robin fashion. The code lacks a call to invalidate the cache across all instances and can leave key information including the metadata and key stored (the deleted key) in the cache on one or more KMS instances up to the key cache timeout.
-
CDH
-
HDP
-
CDP
-
CDH 5.x
-
CDH 6.x
-
CDP 7.0.x
-
CDP 7.1.4 and earlier
-
HDP 2.6 and later
Users affected: Customers with Data-at-rest encryption enabled that have more than 1 kms role instance and the services Key Cache enabled.
Impact: Key Meta-data and Key material may remain active within the service cache.
Severity: Medium
- CDH customers: Upgrade to CDP 7.1.5 or request a patch
- HDP customers: Request a patch
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: TSB 2020-434: KMS Load Balancing Provider Fails to invalidate Cache on Key Delete
HDFS
Possible HDFS Erasure Coded (EC) Data Files Corruption in EC Reconstruction
Cloudera has detected two bugs that can cause corruption of HDFS Erasure Coded (EC) files during the data reconstruction process.
The first bug can be hit during DataNode decommissioning. Due to a bug in the data reconstruction logic during decommissioning, some parity blocks may be generated with a content of all zeros.
Usually the NameNode makes a simple copy of the block when re-replicating it during decommissioning. However, if a decommissioning DataNode is already assigned with more than the replication streams hard limit (It can be set by using the dfs.namenode.replication.max-streams-hard-limit property. Its default value is 4.), the node will be treated as busy and instead of performing a simple copy, the parity blocks may be reconstructed as all zeros.
Subsequently if any other data blocks in the same EC group are lost (due to node failure or disk failure), the reconstruction may use a bad parity block to generate bad data blocks. So, once parity blocks are corrupted, any further reconstruction in the same block group can propagate further corruptions in the same block group.
The second issue occurs in a corner case when a DataNode times out in the reconstruction process. It will reschedule a read from another good DataNode. However, the stale DataNode reader may have polluted the buffer and subsequent reconstruction which uses the polluted buffer will suffer from EC block corruption.
- CDH
- HDP
- CDP Private Cloud Base
- CDH 6.0.x
- CDH 6.1.x
- CDH 6.2.x
- CDH 6.3.x
- HDP 3.1.x
- CDP 7.1.x
- Using an affected version of the product.
- Have enabled EC policy on one or more HDFS directories and have some EC files.
- Decommissioned DataNodes after enabling the EC policy will increase the probability of corruption.
- Rarely EC reconstructions can create dirty buffer issues which will lead to data corruption.
hdfs fsck / -files | grep "erasure-coded: policy=" /ectest/dirWithPolicy/sample-sales-1.csv 215 bytes, erasure-coded: policy=RS-3-2-1024k, 1 block(s): OK
If there are any file paths listed in the output of the above command, and if you have decommissioned DataNodes after creating those files, your EC files may have been affected by this bug.
If no files were listed by the above command, then your data is not affected. However, if you plan to use EC or if you have enabled EC policy on any directory in the past, then we strongly recommend requesting a hotfix from Cloudera.
Severity: High
Impact: With erasure coded files in the cluster, if you have done the decommission, the data files are potentially corrupted. HDFS/NameNode cannot self-detect and self-recover the corrupted files. This is because checksums are also updated during reconstruction. So, the HDFS client may not detect the corruption while reading the affected blocks, however applications may be impacted. Even in the case of normal reconstruction, the second dirty buffer issue can trigger corruption.
- If EC is enabled, request for a hotfix immediately from Cloudera.
- In case EC was enabled and decommission of DataNodes was performed in the past after enabling EC, Cloudera has implemented tools to check the possibility of corruption. Contact Cloudera support in such a situation.
- If no decommission was done in the past after enabling EC, then it is recommended not to perform decommission of DataNodes until the hotfix is applied.
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: Cloudera Customer Advisory: Possible HDFS Erasure Coded (EC) Data Files Corruption in EC Reconstruction
HDFS Snapshot corruption
A fix to HDFS snapshot functionality caused a regression in the affected CDH releases. When a snapshot is deleted, internal data structure in the NameNode can become inconsistent and the checkpoint operation on the Standby NameNode can fail.
Products affected: HDFS
- CDH 5.4.0 - 5.15.1, 5.16.0
- CDH 6.0.0 - 6.2.1, 6.3.0, 6.3.1, 6.3.2
Users affected: Any clusters with HDFS Snapshots enabled
Impact: A fix to HDFS snapshot functionality caused a regression in the affected CDH releases. When a snapshot is deleted, internal data structure in the NameNode can become inconsistent and the checkpoint operation on the Standby NameNode can fail.
Standby NameNode detects the inconsistent snapshot data structure and shuts itself down. To recover from this situation, the fsimage must be repaired and put back into both NameNodes' fsimage directory for the Standby NameNode to start normally. The Active NameNode stays up. However no fsimage checkpoint is performed because the Standby NameNode is down.
hdfs dfs -deleteSnapshot /path snapshot_123 deleteSnapshot: java.lang.IllegalStateException
The recovery of the corrupt fsimage can result in the loss of snapshots.
- Upgrade: Update to a version of CDH containing the fix.
- Workaround: Alternatively, avoid using snapshots. Cloudera BDR uses snapshots automatically when the relevant directories are snapshottable. Hence, we strongly recommend avoiding the upgrade to the affected releases if you are using BDR. For information and instructions, see Enabling and Disabling HDFS Snapshots.
Addressed in release/refresh/patch: CDH 6.3.3
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: TSB 2020-390: HDFS Snapshot corruption
CVE-2018-1296 Permissive Apache Hadoop HDFS listXAttr Authorization Exposes Extended Attribute Key/Value Pairs
AHDFS exposes extended attribute key/value pairs during listXAttrs, verifying only path-level search access to the directory rather than path-level read permission to the referent.
Products affected: Apache HDFS
- CDH 5.4.0 - 5.15.1, 5.16.0
- CDH 6.0.0, 6.0.1, 6.1.0
Users affected: Users who store sensitive data in extended attributes, such as users of HDFS encryption.
Date/time of detection: Dcember 12, 2017
Detected by: Rushabh Shah, Yahoo! Inc., Hadoop committer
Severity (Low/Medium/High): Medium
Impact: HDFS exposes extended attribute key/value pairs during listXAttrs, verifying only path-level search access to the directory rather than path-level read permission to the referent. This affects features that store sensitive data in extended attributes.
CVE: CVE-2018-1296
- Upgrade: Update to a version of CDH containing the fix.
- Workaround: If a file contains sensitive data in extended attributes, users and admins need to change the permission to prevent others from listing the directory that contains the file.
- CDH 5.15.2, 5.16.1
- CDH 6.1.1, 6.2.0
Clusters running CDH 5.16.1, 6.1.0, or 6.1.1 can lose some HDFS file permissions any time the Name Node is restarted
When a cluster is upgraded to 5.16.1, 6.1.0, or 6.1.1 roles with SELECT and/or INSERT privileges on an Impala database or table will have the REFRESH privilege added as part of the upgrade process. HDFS ACLs for roles with the REFRESH privilege get set with empty permissions whenever the Name Node is restarted. This can cause any jobs or queries run by users within affected roles to fail because they will no longer be able to access affected Impala database or tables.
Products Affected: HDFS and components that access files in HDFS
Affected Versions: CDH 5.16.1, 6.1.0, 6.1.1
Users Affected: Clusters with Impala and HDFS ACLs managed by Sentry upgrading from any release to CDH 5.16.1, 6.1.0, and 6.1.1.
Severity (Low/Medium/High): High
Root Cause and Impact: The new privilege REFRESH was introduced in CDH 5.16 and 6.1 and applies to Impala databases and tables. When a cluster is upgraded to 5.16.1, 6.1.0, or 6.1.1, roles with SELECT or INSERT privileges on an Impala database or table will have the REFRESH privilege added during the upgrade.
HDFS ACLs for roles with the REFRESH privilege get set with empty permissions whenever the Name Node is restarted. The Name Node is restarted during the upgrade.
For example if a group appdev is in role appdev_role and has SELECT access to the Impala table "project" the HDFS ACLs prior to the upgrade would look similar to:
group: appdev group::r--
After the upgrade the HDFS ACLs will be set with no permissions and will look like this:
group: appdev group::---
Any jobs or queries run by users within affected roles will fail because they will no longer be able to access affected Impala database or tables. This impacts any SQL client accessing the affected databases and tables. For example, if a Hive client is used to access a table created in Impala it will also fail. Jobs accessing the files directly through HDFS, e.g. via Spark, will also be impacted.
The HDFS ACLs will get reset whenever the Name Node is restarted.
Immediate action required: If possible, do not upgrade to releases CDH 5.16.1, 6.1.0, or 6.1.1 if Impala is used and Sentry manages HDFS ACLs within your environment. Subsequent CDH releases will resolve the problem with a product fix under SENTRY-2490.
If an upgrade is being considered, reach out to your account team to discuss other possibilities, and to receive additional insight into future product release schedules.
If an upgrade must be executed, contact Cloudera Support indicating the upgrade plan and why an upgrade is being executed. Options are available to assist with the upgrade if necessary.
Addressed in release/refresh/patch: Patches for 5.16.1, 6.1.0 and 6.1.1 are available for major supported operating systems. Customers are encouraged to contact Cloudera Support for a patch. The patch should be applied immediately after upgrade to any of the affected versions.
The fix for this TSB will be included in 6.1.2, 6.2.0, 5.16.2, and 5.17.0.
OIV ReverseXML processor fails
The HDFS OIV ReverseXML processor fails if the XML file contains escaped characters.
Affected Versions: CDH 6.x
Apache Issue: HDFS-12828
Cannot move encrypted files to trash
With HDFS encryption enabled, you cannot move encrypted files or directories to the trash directory.
rm -r -skipTrash /testdir
Affected Versions: All CDH versions
Apache Issue: HADOOP-10902
HDFS NFS gateway and CDH installation (using packages) limitation
HDFS NFS gateway works as shipped ("out of the box") only on RHEL-compatible systems, but not on SLES or Ubuntu. Because of a bug in native versions of portmap/rpcbind, the HDFS NFS gateway does not work out of the box on SLES or Ubuntu systems when CDH has been installed from the command-line, using packages. It does work on supported versions of RHEL-compatible systems on which rpcbind-0.2.0-10.el6 or later is installed, and it does work if you use Cloudera Manager to install CDH, or if you start the gateway as root. For more information, see CDH and Cloudera Manager Supported Operating Systems.
- On Red Hat and similar systems, make sure rpcbind-0.2.0-10.el6 or later is installed.
- On SLES and Ubuntu systems, do one of the following:
- Install CDH using Cloudera Manager; or
- Start the NFS gateway as root; or
- Start the NFS gateway without using packages; or
- You can use the gateway by running rpcbind in insecure mode, using the -i option, but keep in mind that this allows anyone from a remote host to bind to the portmap.
No error when changing permission to 777 on .snapshot directory
Snapshots are read-only; running chmod 777 on the .snapshots directory does not change this, but does not produce an error (though other illegal operations do).
Affected Versions: All CDH versions
Apache Issue: HDFS-4981
Snapshot operations are not supported by ViewFileSystem
Affected Versions: All CDH versions
Snapshots do not retain directories' quotas settings
Affected Versions: All CDH versions
Apache Issue: HDFS-4897
Permissions for dfs.namenode.name.dir incorrectly set
Hadoop daemons should set permissions for the dfs.namenode.name.dir (or dfs.name.dir) directories to drwx------ (700), but in fact these permissions are set to the file-system default, usually drwxr-xr-x (755).
Workaround: Use chmod to set permissions to 700.
Affected Versions: All CDH versions
Apache Issue: HDFS-2470
hadoop fsck -move does not work in a cluster with host-based Kerberos
Workaround: Use hadoop fsck -delete
Affected Versions: All CDH versions
Apache Issue: None
Block report can exceed maximum RPC buffer size on some DataNodes
On a DataNode with a large number of blocks, the block report may exceed the maximum RPC buffer size.
<property> <name>ipc.maximum.data.length</name> <value>268435456</value> </property>
Affected Versions: All CDH versions
Apache Issue: None
MapReduce2 and YARN
YARN Resource Managers will stay in standby state after failover or startup
ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt
This issue is fixed as YARN-7913.
Products affected: CDH with Fair Scheduler
-
CDH 6.0.x
-
CDH 6.1.x
-
CDH 6.2.0, CDH 6.2.1
-
CDH 6.3.0, CDH 6.3.1, CDH 6.3.2, CDH 6.3.3
User affected:
Any cluster running the Hadoop YARN service with the following configuration:
-
Scheduler set to Fair Scheduler
-
The YARN Resource Manager Work Preserving Recovery feature is enabled. That includes High Available setups.
Impact:
On startup or failover the YARN Resource Manager will process the state store to recover the workload that is currently running in the cluster. The recovery fails with a “null pointer exception” being logged.
Due to the recovery failure the YARN Resource Manager will not become active. In a cluster with High Availability configured the standby YARN Resource Manager will fail with the same exception leaving both YARN Resource Managers in a standby state. Even if the YARN Resource Managers are restarted, they still stay in standby state.
- Customers requiring an urgent fix who are using CDH 6.2.x or earlier: Raise a support case to request a new patch.
- Customers on CDH 6.3.x: Upgrade to the latest maintenance release.
-
CDH 6.3.4
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: TSB 2020-408: YARN Resource Managers will stay in standby state after failover or startup snapshot
NodeManager fails because of the changed default location of container executor binary
The default location of container-executor binary and .cfg files was changed to /var/lib/yarn-ce. It used to be /opt/cloudera/parcels/<CDH_parcel_version>. Because of this change, if you did not have the mount options -noexec and -nosuid set on /opt, the NodeManager can fail to start up as these options are set on /var.
Affected versions CDH 5.16.1, All CDH 6 versions
Workaround: Either remove the -noexec and -nosuid mount options on /var or change the container-executor binary and .cdf path using the CMF_YARN_SAFE_CONTAINER_EXECUTOR_DIR environment variable.
The Standby Resource Manager redirects /jmx and /metrics requests to the Active Resource Manager.
- If Enable Kerberos Authentication for HTTP Web-Console is disabled: Cloudera Manager shows statistics for the wrong server.
- If Enable Kerberos Authentication for HTTP Web-Console is enabled: connection from the agent to the standby fails with the HTTPError: HTTP Error 401: Authentication required error message. As a result, the health of the Standby Resource Manager will become bad.
Workaround: N/A
Affected Versions: CDH 6.0.x, CDH 6.1.0
Fixed Version: CDH 6.1.1
Cloudera Issue: CDH-76040
YARN's Continuous Scheduling can cause slowness in Oozie
When Continuous Scheduling is enabled in Yarn, this can cause slowness in Oozie due to long delays in communicating with Yarn. In Cloudera Manager 5.9.0 and higher, Enable Fair Scheduler Continuous Scheduler is turned off by default.
Workaround: Turn off Enable Fair Scheduler Continuous Scheduling in Cloudera Manager YARN Configuration. To keep equivalent benefits of this feature, turn on Fair Scheduler Assign Multiple Tasks.
Affected Versions: All CDH versions
Cloudera Issue: CDH-60788
JobHistory URL mismatch after server relocation
After moving the JobHistory Server to a new host, the URLs listed for the JobHistory Server on the ResourceManager web UI still point to the old JobHistory Server. This affects existing jobs only. New jobs started after the move are not affected.
Workaround: For any existing jobs that have the incorrect JobHistory Server URL, there is no option other than to allow the jobs to roll off the history over time. For new jobs, make sure that all clients have the updated mapred-site.xml that references the correct JobHistory Server.
Affected Versions: All CDH versions
Apache Issue: None
History link in ResourceManager web UI broken for killed Spark applications
When a Spark application is killed, the history link in the ResourceManager web UI does not work.
Workaround: To view the history for a killed Spark application, see the Spark HistoryServer web UI instead.
Affected Versions: All CDH versions
Apache Issue: None
Cloudera Issue: CDH-49165
Routable IP address required by ResourceManager
ResourceManager requires routable host:port addresses for yarn.resourcemanager.scheduler.address, and does not support using the wildcard 0.0.0.0 address.
Workaround: Set the address, in the form host:port, either in the client-side configuration, or on the command line when you submit the job.
Affected Versions: All CDH versions
Apache Issue: None
Cloudera Issue: CDH-6808
Amazon S3 copy may time out
The Amazon S3 filesystem does not support renaming files, and performs a copy operation instead. If the file to be moved is very large, the operation can time out because S3 does not report progress during the operation.
Workaround: Use -Dmapred.task.timeout=15000000 to increase the MR task timeout.
Affected Versions: All CDH versions
Apache Issue: MAPREDUCE-972
Cloudera Issue: CDH-17955
Apache HBase Known Issues
Cloudera Navigator plugin impacts HBase performance
Navigator Audit logging for HBase access can have a big impact on HBase performance most noticeable during data ingestion.
Component affected: HBase
Products affected: CDH
Releases affected: CDH 6.x
Impact: 4x performance increase was observed in batchMutate calls after disabling Navigator Audit.
Severity: High
- In Cloudera Manager, navigate to .
- Find the Enable Audit Collection property and clear it.
- Restart the HBase service.
Upgrade: Upgrade to CDP where Navigator is no longer used.
HBASE-25206: snapshot and cloned table corruption when original table is deleted
HBASE-25206 can cause data loss either through corrupting an existing hbase snapshot or destroying data that backs a clone of a previous snapshot.
Component affected: HBase
- HDP
- CDH
- CDP
- CDH 6.x.x
- HDP 3.1.5
- CDP PVC Base 7.1.x
- Cloudera Runtime (Public Cloud) 7.0.x
- Cloudera Runtime (Public Cloud) 7.1.x
- Cloudera Runtime (Public Cloud) 7.2.0
- Cloudera Runtime (Public Cloud) 7.2.1
- Cloudera Runtime (Public Cloud) 7.2.2
Users affected: Users of the affected releases.
Impact: Potential risk of Data Loss.
Severity: High
- Make HBase do the clean up work for the splits:
- Before dropping a table that has any snapshots, first ensure that any regions that resulted from a split have fully rewritten their data and cleanup has happened for the original host region.
- If there are any remaining children of a split that have links to their parent still, then we first need to issue a major compaction for those regions (or the entire table).
- After doing the major compaction we need to ensure it has finished before proceeding. There should no longer be any split pointers (named like "<target hfile>.<target region>").
- Whether or not we needed to do a major compaction we must always tell the catalog janitor to run to ensure the hfiles from any parent regions are moved to the archive.
- We must wait for the catalog janitor to finish.
- At this point it is safe to delete the original table without data loss.
- Manually do the archiving:
- Alternatively, as a part of deleting a table we can manually move all of its files into the archive. First disable the table. Next make sure each region and family combination that is present in the active data area is present in the archive. Finally move all hfiles and links from the active area to the archive.
- At this point it is safe to drop the table.
- Addressed in release/refresh/patch: Cloudera Runtime 7.2.6.0
Apache issue: HBASE-25206
KB article: For the latest update on this issue see the corresponding Knowledge article: TSB 2021-453: HBASE-25206 "snapshot and cloned table corruption when original table is deleted"
HBase Performance Issue
The HDFS short-circuit setting dfs.client.read.shortcircuit is overwritten to disabled by hbase-default.xml. HDFS short-circuit reads bypass access to data in HDFS by using a domain socket (file) instead of a network socket. This alleviates the overhead of TCP to read data from HDFS which can have a meaningful improvement on HBase performance (as high as 30-40%).
Users can restore short-circuit reads by explicitly setting dfs.client.read.shortcircuit in HBase configuration via the configuration management tool for their product (e.g. Cloudera Manager or Ambari).
- CDP
- CDH
- HDP
- CDP 7.x
- CDH 6.x
- HDP 3.x
Impact: HBase reads with high data-locality will not execute as fast as previously. HBase random read performance is heavily affected as random reads are expected to have low latency (e.g. Get, Multi-Get). Scan workloads would also be affected, but may be less impacted as latency of scans is greater.
Severity: High
- Cloudera Manager:
HBase → Configurations → HBase (Service-wide) → HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml→
dfs.client.read.shortcircuit=true
dfs.domain.socket.path=< Add same value which is configured in hdfs-site.xml >
- Ambari:
HBase → CONFIGS → Advanced → Custom hbase-site →
dfs.client.read.shortcircuit=true
dfs.domain.socket.path=< Add same value which is configured in hdfs-site.xml >
After making these configuration changes, restart the HBase service.
Cloudera will continue to pursue product changes which may alleviate the need to make these configuration changes.
For CDP 7.1.1.0 and newer, the metric shortCircuitBytesRead can be viewed for each RegionServer under the RegionServer/Server JMX metrics endpoint. When short circuit reads are not enabled, this metric will be zero. When short circuit reads are enabled and the data locality for this RegionServer is greater than zero, the metric should be greater than zero.
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: TSB 2021-463: HBase Performance Issue
Default limits for PressureAwareCompactionThroughputController are too low
HDP and CDH releases suffer from low compaction throughput limits, which cause storefiles to back up faster than compactions can re-write them. This was originally identified upstream in HBASE-21000.
- HDP
- CDH
- HDP 3.0.0 through HDP 3.1.2
- CDH 6.0.x
- CDH 6.1.x
- CDH 6.2.x
- CDH 6.3.0, 6.3.1, 6.3.2, 6.3.3
Users affected: Users of above mentioned HDP and CDH versions.
Severity: Medium
Impact: For non-read-only workloads, this will eventually cause back-pressure onto new writes when the blocking store files limit is reached.
- Upgrade: Upgrade to the latest release version: CDP 7.1.4, HDP 3.1.5, CDH 6.3.4
- Workaround:
- Set the hbase.hstore.compaction.throughput.higher.bound property to 104857600 and the hbase.hstore.compaction.throughput.lower.bound property to 52428800 in hbase-site.xml.
- An alternative solution is to set the hbase.regionserver.throughput.controller property to org.apache.hadoop.hbase.regionserver.throttle.NoLimitThroughputController which will remove all compaction throughput limitations (which has been observed to cause other pressure).
Apache issue: HBASE-21000
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: Cloudera Customer Advisory: Default limits for PressureAwareCompactionThroughputController are too low
Data loss with restore snapshot
The restore snapshot command causes data loss when the target table was split or truncated after snapshot creation.
Products affected: HBase
-
CDH 6.0.x
-
CDH 6.1.x
-
CDH 6.2.0
-
CDH 6.3.0
User affected: Users relying on Restore Snapshot functionality.
Impact: Restored table could have missing data when split or truncate happened after snapshot creation.
Immediate action required: Update to a version of CDH containing the fix.
hbase> disable 'table' hbase> drop 'table' hbase> clone_snapshot 'snapshot_name', 'table' hbase> enable 'table'
-
CDH 6.2.1
-
CDH 6.3.2
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: TSB 2020-379: Data loss with restore snapshot
CDH users must not use Apache HBase's OfflineMetaRepair tool
OfflineMetaRepair helps you to rebuild the HBase meta table from the underlying file system. This tool is often used to correct meta table corruption or loss. It is designed to work only with hbase-1.x (CDH 5.x). Users must not run the OfflineMetaRepair tool against CDH 6.x since it uses hbase-2.x. If a user runs OfflineMetaRepair tool in CDH 6.x, then it will break or corrupt the HBase meta table.
If you have already corrupted your meta table or you believe your meta table requires the use of something like the former OfflineMetaRepair tool, do not attempt any further changes, contact Cloudera Support.
Products affected: CDH
-
CDH 6.0.0, 6.0.1
-
CDH 6.1.0, 6.1.1
-
CDH 6.2.0
-
CDH 6.3.0
User affected: Clusters with HBase installed.
Impact: Cluster becomes inoperable.
Immediate action required: Update to a version of CDH containing the fix.
Workaround: Do not run OfflineMetaRepair tool.
-
CDH 6.2.1
-
CDH 6.3.2
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: TSB 2020-376: CDH users must not use Apache HBase's OfflineMetaRepair tool
Multiple HBase Services on the Same CDH Cluster is not Supported
Cloudera Manager does not allow to deploy multiple HBase services on the same host of an HDFS cluster as by design a DataNode can only have a single HBase service per host. It is possible to have two HBase services on the same HDFS cluster but they have to be on different DataNodes, meaning that there will be one RegionServer per DataNode per HBase cluster. However, that requires additional configuration, for example you have to pin /hbase_enc and /hbase to avoid the HDFS balancer to cluster. However, that requires additional configuration, for example you have to pin /hbase_enc and /hbase to avoid the HDFS balancer to cause issues with data locality.
If Cloudera Manager is not used, you can manage multiple configurations per host for different RegionServers that are part of different HBase clusters but that can lead to multiple issues and difficult troubleshooting procedures. Thus, Cloudera does not support managing multiple HBase services on the same CDH cluster.
IOException from Timeouts
CDH 5.12.0 includes the fix HBASE-16604, where the internal scanner that retries in case of IOException from timeouts could potentially miss data. Java clients were properly updated to account for the new behavior, but thrift clients will now see exceptions where the previous missing data would be.
Workaround: Create a new scanner and retry the operation when encountering this issue.
IntegrationTestReplication fails if replication does not finish before the verify phase begins
During IntegrationTestReplication, if the verify phase starts before the replication phase finishes, the test will fail because the target cluster does not contain all of the data. If the HBase services in the target cluster does not have enough memory, long garbage-collection pauses might occur.
Workaround: Use the -t flag to set the timeout value before starting verification.
Cloudera Issue: None.
HDFS encryption with HBase
Cloudera has tested the performance impact of using HDFS encryption with HBase. The overall overhead of HDFS encryption on HBase performance is in the range of 3 to 4% for both read and update workloads. Scan performance has not been thoroughly tested.
ExportSnapshot or DistCp operations may fail on the Amazon s3a:// protocol
ExportSnapshot or DistCP operations may fail on AWS when using certain JDK 8 versions, due to an incompatibility between the AWS Java SDK 1.9.x and the joda-time date-parsing module.
Workaround: Use joda-time 2.8.1 or higher, which is included in AWS Java SDK 1.10.1 or higher.
Cloudera Issue: None.
An operating-system level tuning issue in RHEL7 causes significant latency regressions
There are two distinct causes for the regressions, depending on the workload:
- For a cached workload, the regression may be up to 11%, as compared to RHEL6. The cause relates to differences in the CPU's C-state (power saving state) behavior. With the same workload, the CPU is around 40% busier in RHEL7, and the CPU spends more time transitioning between C-states in RHEL7. Transitions out of deeper C-states add latency. When CPUs are configured to never enter a C-state lower than 1, RHEL7 is slightly faster than RHEL6 on the cached workload. The root cause is still under investigation and may be hardware-dependent.
- For an IO-bound workload, the regression may be up to 8%, even with common C-state settings. A 6% difference in average disk service time has been observed, which in turn seems to be caused by a 10% higher average read size at the drive on RHEL7. The read sizes issued by HBase are the same in both cases, so the root cause seems to be a change in the EXT4 filesystem or the Linux block IO later. The root cause is still under investigation.
Bug: None
Severity: Medium
Workaround: Avoid using RHEL 7 if you have a latency-critical workload. For a cached workload, consider tuning the C-state (power-saving) behavior of your CPUs.
Export to Azure Blob Storage (the wasb:// or wasbs:// protocol) is not supported
CDH 5.3 and higher supports Azure Blob Storage for some applications. However, a null pointer exception occurs when you specify a wasb:// or wasbs:// location in the --copy-to option of the ExportSnapshot command or as the output directory (the second positional argument) of the Export command.
Workaround: None.
Apache Issue: HADOOP-12717
AccessController postOperation problems in asynchronous operations
When security and Access Control are enabled, the following problems occur:
- If a Delete Table fails for a reason other than missing permissions, the access rights are removed but the table may still exist and may be used again.
- If hbaseAdmin.modifyTable() is used to delete column families, the rights are not removed from the Access Control List (ACL) table. The postOperation is implemented only for postDeleteColumn().
- If Create Table fails, full rights for that table persist for the user who attempted to create it. If another user later succeeds in creating the table, the user who made the failed attempt still has the full rights.
Workaround: None
Apache Issue: HBASE-6992
Apache Hive/HCatalog/Hive on Spark Known Issues
This topic contains:
Hive Known Issues
BDR - Hive restore failing during import
When the table filter used during hive cloud restore is different from the table filter used to create the hive cloud backup, the import step fails with the table not found error. Currently it impacts only the cloud restore scenario.
Products affected: Cloudera Manager
- Cloudera Manager 5.15, 5.16
- Cloudera Manager 6.1.x
- Cloudera Manager 6.2.x
- Cloudera Manager 6.3.x
Users affected: BDR, Hive cloud restore, where restore uses a subset of tables from the exported tables
- Limited, the hive cloud restore all tables works properly.
- The hive cloud restore from the hive cloud backup created prior to Cloudera Manager 5.15 would work without any problem.
- No other BDR functionality is affected.
- Workaround: Not available. Importing specific tables would fail. Impoting ALL tables would continue to work properly.
- Upgrade: Upgrade to a Cloudera Manager version containing the fix.
Addressed in release/refresh/patch: Cloudera Manager 7.0 and higher versions
Query with an empty WHERE clause problematic if vectorization is off
SELECT COUNT (DISTINCT cint) FROM alltypesorc WHERE cstring1; SELECT 1 WHERE 1;
If vectorization is turned on and no rules turn off the vectorization, queries run as expected.
Workaround: Rewrite queries with casts or equals.
Affected Versions: 6.3.x, 6.2.x, 6.1.x, 6.0.x
Apache Issue: HIVE-15408
Cloudera Issue: CDH-81649
Query with DISTINCT can fail if vectorization is on
A query can fail when vectorization is turned on, the query contains DISTINCT, and other rules do not turn off the vectorization. A query-specific error message appears, for example:
Error: Error while compiling statement: FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: The column KEY._col2:0._col0 is not in the vectorization context column map {KEY._col0=0, KEY._col1=1, KEY._col2=2}. (state=42000,code=40000)
set hive.vectorized.execution.enabled=false;
Affected Versions: 6.3.x, 6.2.x, 6.1.x, 6.0.x
Apache Issue: HIVE-19032
Cloudera Issue: CDH-81341
When vectorization is enabled on any file type (ORC, Parquet) queries that divide by zero using the modulo operator (%) return an error
When vectorization is enabled for Hive on any file type, including ORC and Parquet, if the query divides by zero using the modulo operator (%), it returns the following error: Arithmetic exception [divide by] 0. For example, if you run the following query this issue is triggered: SELECT 100 % column_c1 FROM table_t1; and the value in column_c1 is zero. The divide operator (/) is not affected by this issue.
Workaround: Disable vectorization for the query that is triggering this at either the session level by using the SET statement or at the server level by disabling the property with Cloudera Manager. For information about how to enable or disable query vectorization, see Enabling Hive Query Vectorization.
Affected Versions: When query vectorization is enabled for Hive, this issue affects Hive ORC tables in all versions of CDH and affects Hive Parquet tables in CDH 6.0 and later
Apache Issue: HIVE-19564
Cloudera Issue: CDH-71211
When vectorization is enabled for Hive on any file type (ORC, Parquet) queries that perform comparisons in the SELECT clause on large values in columns with the data type of BIGINT might return wrong results
When vectorization is enabled for Hive on any file type, including ORC and Parquet, if the query performs a comparison operation between very large values in columns that are BIGINT data types in the SELECT clause of the query, incorrect results might be returned. Comparison operators include ==, !=, <, <=, >, and >=. This issue does not occur when the comparison operation is performed in the filtering clause of the query. This issue can also occur when the difference of values in such columns is out of range for a LONG (64-bit) data type. For example, if column_c1 stores 8976171455044006767 and column_c2 stores -7272907770454997143, a query such as SELECT column_c1 < column_c2 FROM table_test returns true instead of false because the difference (8976171455044006767 - (-7272907770454997143)) is 1.6249079225499E19 which is greater than 9.22337203685478E18, which is the maximum possible value that a LONG (64-bit) data type can hold.
Workaround: Use a DECIMAL type instead of BIGINT for columns that might contain very large values. Another option is to disable vectorization for the query that is triggering this at either the session level by using the SET statement or at the server level by disabling the property with Cloudera Manager. For information about how to enable or disable query vectorization, see Enabling Hive Query Vectorization.
Affected Versions: When query vectorization is enabled for Hive, this issue affects Hive ORC tables in all versions of CDH and affects Hive Parquet tables in CDH 6.0 and later
Apache Issue: HIVE_20207
Cloudera Issue: CDH-70996
Specified column position in the ORDER BY clause is not supported for SELECT * queries
CREATE TABLE decimal_1 (id decimal(5,0)); SELECT * FROM decimal_1 ORDER BY 1 limit 100; Error while compiling statement: FAILED: SemanticException [Error 10219]: Position in ORDER BY is not supported when using SELECT *Instead the query must list out the columns it is selecting.
Affected Versions: CDH 6.0.0 and higher
Cloudera Issue: CDH-68550
DirectSQL with PostgreSQL
Hive doesn't support Hive direct SQL queries with PostgreSQL database. It only supports this feature with MySQL, MariaDB, and Oracle. With PostgresSQL, direct SQL is disabled as a precaution, since there have been issues reported upstream where it is not possible to fallback on DataNucleus in the event of some failures, plus other non-standard behaviors. For more information, see Hive Configuration Properties.
Affected Versions: All CDH versions
Cloudera Issue: CDH-49017
ALTER PARTITION … SET LOCATION does not work on Amazon S3 or between S3 and HDFS
Cloudera recommends that you do not use ALTER PARTITION … SET LOCATION on S3 or between S3 and HDFS. The rest of the ALTER PARTITION commands work as expected.
Affected Versions: All CDH versions
Cloudera Issue: CDH-42420
Commands run against an Oracle-backed metastore might fail
javax.jdo.JDODataStoreException Incompatible data type for column TBLS.VIEW_EXPANDED_TEXT : was CLOB (datastore), but type expected was LONGVARCHAR (metadata). Please check that the type in the datastore and the type specified in the MetaData are consistent.
This error might occur if the metastore is run on top of an Oracle database with the configuration property datanucleus.validateColumns set to true.
Workaround: Set datanucleus.validateColumns=false in the hive-site.xml configuration file.
Affected Versions: All CDH versions
Cannot create archive partitions with external HAR (Hadoop Archive) tables
ALTER TABLE ... ARCHIVE PARTITION is not supported on external tables.
Affected Versions: All CDH versions
Cloudera Issue: CDH-9638
Object types Server and URI are not supported in "SHOW GRANT ROLE roleName on OBJECT objectName" statements
Workaround: Use SHOW GRANT ROLE roleNameto list all privileges granted to the role.
Affected Versions: All CDH versions
Cloudera Issue: CDH-19430
Logging differences create Supportability Issues
In the event you need Apache Hive support from Cloudera, the availability of logs is critical. Some CDH releases do not enable log4j2 logging for Hive by default. Because of this, logs are not generated. Furthermore, the specified CDH releases are not configured to remove old log files to make room for new ones. This can cause the new logs to be lost. When Hive logs are missing, Support cannot troubleshoot Hive problems efficiently.
Components affected: Hive
Products affected: Hive
- CDH 6.1
- CDH 6.2
- CDH 6.3
Users affected: Hive users
Severity: Medium
Impact: The absence of Hive log files causes delays in troubleshooting Hive problems.
- Open Cloudera Manager.
- Select .
- Click the Configuration tab.
- In the Search field, enter Hive Metastore Server Logging Advanced Configuration Snippet (Safety Valve).
- Add the following XML to the field (or switch to Editor mode, and enter each property and its value in the fields provided).
<property> <name>rootLogger.appenderRefs</name> <value>root, console, DRFA, PerfLogger</value> </property> <property> <name>logger.PerfLogger.name</name> <value>org.apache.hadoop.hive.ql.log.PerfLogger</value> </property> <property> <name>logger.PerfLogger.level</name> <value>DEBUG</value> </property> <property> <name>appender.DRFA.filePattern</name> <value>${log.dir}/${log.file}.%i</value> </property> <property> <name>appender.DRFA.strategy.fileIndex</name> <value>min</value> </property>
- In the Search field, enter HiveServer2 Logging Advanced Configuration Snippet (Safety Valve).
- Add the XML properties from step 5.
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: TSB 2020-384: Logging differences in CDH 6 create Supportability Issues
HCatalog Known Issues
There are no notable known issues in this release of HCatalog.
Hive on Spark (HoS) Known Issues
A query fails with IllegalArgumentException Size requested for unknown type: java.util.Collection
WITH t2 AS (SELECT array(1,2) AS c1 UNION ALL SELECT array(2,3) AS c1) SELECT collect_list(c1) FROM t2
Workaround: Create a table to store the array data.
Affected Versions: 6.3.x, 6.2.x, 6.1.x
Cloudera Issue: CDH-80169
Hive on Spark queries fail with "Timed out waiting for client to connect" for an unknown reason
If this exception is preceded by logs of the form "client.RpcRetryingCaller: Call exception...", then this failure is due to an unavailable HBase service. On a secure cluster, spark-submit will try to obtain delegation tokens from HBase, even though Hive on Spark might not need them. So if HBase is unavailable, spark-submit throws an exception.
Workaround: Fix the HBase service, or set spark.yarn.security.tokens.hbase.enabled to false.
Affected Versions: CDH 5.7.0 and higher
Cloudera Issues: CDH-59591, CDH-59599
Hue Known Issues
Cloudera Hue is vulnerable to Cross-Site Scripting attacks
-
CVE-2021-29994 - The Add Description field in the Table schema browser does not sanitize user inputs as expected.
-
CVE-2021-32480 - Default Home direct button in Filebrowser is also susceptible to XSS attack.
-
CVE-2021-32481 - The Error snippet dialog of the Hue UI does not sanitize user inputs.
Products affected: Hue
-
CDP Public Cloud 7.2.10 and lower
-
CDP Private Cloud Base 7.1.6 and lower
-
CDP Private Cloud Plus 1.2 and lower (NOTE: CDP Private Cloud Plus was renamed to CDP Private Cloud Experiences for version 1.2)
-
Cloudera Data Warehouse (DWX) 1.1.2-b1484 (CDH 7.2.11.0-59) or lower
-
CDH 6.3.4 and lower
User affected: All users of the affected versions
- CVE-2021-29994 - 5.5 (Medium) CVSS:3.1/AV:N/AC:L/PR:L/UI:R/S:U/C:L/I:L/A:L
- CVE-2021-32480 - 5.5 (Medium) CVSS:3.1/AV:N/AC:L/PR:L/UI:R/S:U/C:L/I:L/A:L
- CVE-2021-32481 - 5.5 (Medium) CVSS:3.1/AV:N/AC:L/PR:L/UI:R/S:U/C:L/I:L/A:L
Severity (Low/Medium/High): Medium
Impact:Security Vulnerabilities as mentioned in the CVEs
- Upgrade (recommended):
-
CDP Public Cloud users should upgrade to 7.2.11
-
CDP Private Cloud Base users should upgrade to CDP 7.1.7
-
CDP Private Cloud Plus users should upgrade to CDP PVC 1.3
-
Cloudera Data Warehouse users should upgrade to the latest version DWX1.1.2-b1793 & CDH 2021.0.1-b10
-
CDH users should request a patch
-
Hue Silently Disables StartTLS in LDAP Connections
There are two mechanisms to secure communication to an LDAP server. One is to use an ‘ldaps’ connection, where all traffic is encrypted inside a TLS tunnel - much like ‘https’. The other is to use ‘StartTLS’, where traffic begins unencrypted in the “ldap” protocol and then upgrades itself to a TLS connection.
If StartTLS is enabled in the Hue configuration but the ‘ldap_cert’ parameter is not configured, then Hue silently disables StartTLS.
StartTLS will not be used for synchronization or import, even if StartTLS is enabled and the ‘ldap_cert’ parameter is set.
The result is that connections that the administrator assumes to be secured, using StartTLS, are not actually secure.
CVE: CVE-2019-19146
Date/time of detection: 22nd March, 2019
Detected by: Ben Gooley, Cloudera
Severity (Low/Medium/High): 8.8 High CVSS AV:N/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H
Products affected: CDH
- CDH 5.x
- CDH 6.1.0
- CDH 6.1.1
- CDH 6.2.0
- CDH 6.2.1
- CDH 6.3.0
Users affected: All users who are using StartTLS enabled in the Hue configuration when using LDAP as Authentication Backend to login in Hue.
Impact: Sensitive data exposure.
- Upgrade (recommended): Update to a version of CDH containing the fix.
- Workaround: Use “ldaps” instead of “ldap” and StartTLS.
Addressed in release/refresh/patch: CDH 6.3.1 and above
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: TSB 2020-371: Hue Silently Disables StartTLS in LDAP Connections
Hue allows unsigned SAML assertions
If Hue receives an unsigned assertion, it continues to process it as valid. This means it is possible for an end-user to forge or remove the signature and manipulate a SAML assertion to gain access without a successful authentication.
Products affected: Hue, CDH
- CDH 5.15.x and earlier
- CDH 5.16.0, 5.16.1
- CDH 6.0.x
- CDH 6.1.x
User affected: All users who are using SAML with Hue.
CVE: CVE-2019-14775
Date/time of detection: January 2019
Detected by: Joel Snape
Severity (Low/Medium/High): High
Impact:
This is a significant security risk as it allows anyone to fake their access validity and therefore access Hue, even if they should not have access. In more detail: if Hue receives an unsigned assertion, it continues to process it as valid. This means it is possible for an end-user to forge or remove the signature and manipulate a SAML assertion to gain access without a successful authentication.
CVE: CVE-2019-14775
- Upgrade (recommended): Upgrade to a version of CDH containing the fix.
- Workaround: None
- CDH 5.16.2
- CDH 6.2.0
Hue external users granted super user priviliges in C6
When using either the LdapBackend or the SAML2Backend authentication backends in Hue, users that are created on login when logging in for the first time are granted superuser privileges in CDH 6. This does not apply to users that are created through the User Admin application in Hue.
Products affected: Hue
Releases affected: CDH 6.0.0, CDH 6.0.1, CDH 6.1.0
Users affected: All user
Date/time of detection: Dec/12/18
Severity (Low/Medium/High): Medium
Impact:
The superuser privilege is granted to any user that logs in to Hue when LDAP or SAML authentication is used. For example, if you have the create_users_on_login property set to true in the Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini, and you are using LDAP or SAML authentication, a user that logs in to Hue for the first time is created with superuser privileges and can perform the following actions:
- Create/Delete users and groups
- Assign users to groups
- Alter group permissions
- Synchronize Hue users with your LDAP server
- Create local users and groups (these local users can login to Hue only if the mode of multi-backend authentication is set up as LdapBackend and AllowFirstUserDjangoBackend)
- Assign users to groups
- Alter group permissios
- When users are synced with your LDAP server manually by using the User Admin page in Hue.
- When you are using other authentication methods. For example:
- AllowFirstUserDjangoBackend
- Spnego
- PAM
- Oauth
- Local users, including users created by unexpected superusers, can login throug AllowFirstUserDjangoBackend.
- Local users in Hue that created as hive, hdfs, or solr have privileges to access protected data and alter permissions in security app.
- Removing the AllowFirstUserDjangoBackend authentication backend can stop local users login to Hue, but it requires the administrator to have Cloudera Manager access
CVE: CVE-2019-7319
Immediate action required: Upgrade and follow the instructions below.
Addressed in release/refresh/patch: CDH 6.1.1 and CDH 6.2.0
UPDATE useradmin_userprofile SET `creation_method` = 'EXTERNAL' WHERE `creation_method` = 'CreationMethod.EXTERNAL';
After executing the UPDATE statement, new Hue users are no longer automatically created as superusers.
To find out the list of superusers, run SQL query:
SELECT username FROM auth_user WHERE superuser = 1;
- Log in to the Hue UI as an administrator.
- In the upper right corner of the page, click the user drop-down list and select Manage User:
- In the User Admin page, make sure that the Users tab is selected and click the name of the user in the list that you want to edit:
- In the Hue Users - Edit user page, click Step 3: Advanced:
- Clear the checkbox for Superuser status:
- At the bottom of the page, click Update user to save the change.
For the latest update on this issue see the corresponding Knowledge article:
TSB 2019-360: Hue external users granted super user privileges in C6
Hue does not support the Spark App
Hue does not currently support the Spark application.
Logs are not updating in /var/log/hue after upgrading to CDH 6
After upgrading to CDH 6 if you check the logs in /var/log/hue and note that they are not being updated, this means that the alternatives link was lost during the upgrade.
Workaround: To resolve this issue, open a terminal window and perform the following on every Hue server:
For RHEL/Centos:
/usr/sbin/alternatives --install /etc/hue/conf hue-conf /opt/cloudera/parcels/CDH/etc/hue/conf.empty 10
For SLES:
/usr/sbin/update-alternatives --install /etc/hue/conf hue-conf /opt/cloudera/parcels/CDH/etc/hue/conf.empty 10
Apache Impala Known Issues
The following sections describe known issues and workarounds in Impala, as of the current production release. This page summarizes the most serious or frequently encountered issues in the current release, to help you make planning decisions about installing and upgrading. Any workarounds are listed here. The bug links take you to the Impala issues site, where you can see the diagnosis and whether a fix is in the pipeline.
Continue reading:
- Impala Known Issues: Startup
- Impala Known Issues: Crashes and Hangs
- Impala Known Issues: Performance
- Impala Known Issues: Security
- Impala logs the session / operation secret on most RPCs at INFO level
- Authenticated user with access to active session or query id can hijack other Impala session or query
- XSS Cloudera Manager
- Impala does not support Heimdal Kerberos
- System-wide auth-to-local mapping not applied correctly to Kudu service account
- Impala Known Issues: Resources
- Impala Known Issues: Correctness
- Impala Known Issues: Metadata
- Impala Known Issues: Interoperability
- Impala Known Issues: Limitations
- Impala Known Issues: Miscellaneous / Older Issues
Impala Known Issues: Startup
These issues can prevent one or more Impala-related daemons from starting properly.
Impala requires FQDN from hostname command on kerberized clusters
The method Impala uses to retrieve the host name while constructing the Kerberos principal is the gethostname() system call. This function might not always return the fully qualified domain name, depending on the network configuration. If the daemons cannot determine the FQDN, Impala does not start on a kerberized cluster.
Workaround: Test if a host is affected by checking whether the output of the hostname command includes the FQDN. On hosts where hostname, only returns the short name, pass the command-line flag --hostname=fully_qualified_domain_name in the startup options of all Impala-related daemons.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-4978
Impala Known Issues: Crashes and Hangs
These issues can cause Impala to quit or become unresponsive.
Unable to view large catalog objects in catalogd Web UI
In catalogd Web UI, you can list metadata objects and view their details. These details are accessed via a link and printed to a string formatted using thrift's DebugProtocol. Printing large objects (> 1 GB) in Web UI can crash catalogd.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-6841
Impala Known Issues: Performance
These issues involve the performance of operations such as queries or DDL statements.
Metadata operations block read-only operations on unrelated tables
Metadata operations that change the state of a table, like COMPUTE STATS or ALTER RECOVER PARTITIONS, may delay metadata propagation of unrelated unloaded tables triggered by statements like DESCRIBE or SELECT queries.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-6671
Impala Known Issues: Security
These issues relate to security features, such as Kerberos authentication, Sentry authorization, encryption, auditing, and redaction.
Impala logs the session / operation secret on most RPCs at INFO level
Impala logs contain the session / operation secret. With this information a person who has access to the Impala logs might be able to hijack other users' sessions. This means the attacker is able to execute statements for which they do not have the necessary privileges otherwise. Impala deployments where Apache Sentry or Apache Ranger authorization is enabled may be vulnerable to privilege escalation. Impala deployments where audit logging is enabled may be vulnerable to incorrect audit logging.
Restricting access to the Impala logs that expose secrets will reduce the risk of an attack. Additionally, restricting access to trusted users for the Impala deployment will also reduce the risk of an attack. Log redaction techniques can be used to redact secrets from the logs. For more information, see the Cloudera Manager documentation.
For log redaction, users can create a rule with a search pattern: secret \(string\) [=:].*And the replacement could be for example: secret=LOG-REDACTED
This vulnerability is fixed upstream under IMPALA-10600
.
-
CDP Private Cloud Base
-
CDP Public Cloud
-
CDH
-
CDP Private Cloud Base 7.0.3, 7.1.1, 7.1.2, 7.1.3, 7.1.4, 7.1.5 and 7.1.6
- CDP Public Cloud 7.0.0, 7.0.1, 7.0.2, 7.1.0, 7.2.0, 7.2.1, 7.2.2, 7.2.6, 7.2.7, and 7.2.8
-
All CDH 6.3.4 and lower releases
Users affected: Impala users of the affected releases
Severity (Low/Medium/High): 7.5 (High) CVSS:3.1/AV:N/AC:H/PR:L/UI:N/S:U/C:H/I:H/A:H
Impact: Unauthorized access
CVE: CVE-2021-28131
Immediate action required:Upgrade to a CDP Private Cloud Base or CDP Public Cloud version containing the fix.
-
CDP Private Cloud Base 7.1.7
-
CDP Public Cloud 7.2.9 or higher versions
Authenticated user with access to active session or query id can hijack other Impala session or query
If an authenticated Impala user supplies a valid query id to Impala's HS2 and Beeswax interfaces, they can perform operations on other sessions or queries when normally they do not have privileges to do so.
- CDH 5.16.x and lower
- CDH 6.0.x
- CDH 6.1.x
- CDH 6.2.0
Users affected: All Impala users of affected versions.
Date/time of detection: 21st May 2019
Severity (Low/Medium/High): 7.5 (High) (CVSS 3.0: AV:N/AC:H/PR:L/UI:N/S:U/C:H/I:N/A:N)
Impact: Neither the original issue or the fix affect the normal use of the system.
CVE: CVE-2019-10084
Immediate action required: There is no workaround, upgrade to a version of CDH containing the fix.
Addressed in release/refresh/patch: CDH 6.2.1 and higher versions
XSS Cloudera Manager
Malicious Impala queries can result in Cross Site Scripting (XSS) when viewed in Cloudera Manager.
Products affected: Apache Impala
- Cloudera Manager 5.13.x, 5.14.x, 5.15.1, 5.15.2, 5.16.1
- Cloudera Manager 6.0.0, 6.0.1, 6.1.0
Users affected: All Cloudera Manager Users
Date/time of detection: November 2018
Severity (Low/Medium/High): High
Impact: When a malicious user generates a piece of JavaScript in the impala-shell and then goes to the Queries tab of the Impala service in Cloudera Manager, that piece of JavaScript code gets evaluated, resulting in an XSS.
CVE: CVE-2019-14449
Immediate action required: There is no workaround, upgrade to the latest available maintenance release.
- Cloudera Manager 5.16.2
- Cloudera Manager 6.0.2, 6.1.1, 6.2.0, 6.3.0
Impala does not support Heimdal Kerberos
Heimdal Kerberos is not supported in Impala.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-7072
System-wide auth-to-local mapping not applied correctly to Kudu service account
Due to system auth_to_local mapping, the principal may be mapped to some local name.
When running with Kerberos enabled, you may hit the following error message where <random-string> is some random string which doesn't match the primary in the Kerberos principal.
WARNINGS: TransmitData() to X.X.X.X:27000 failed: Remote error: Not authorized: {username='<random-string>', principal='impala/redacted'} is not allowed to access DataStreamService
Workaround: Start Impala with the --use_system_auth_to_local=false flag to ignore the system-wide auth_to_local mappings configured in /etc/krb5.conf.
Affected Versions: CDH 5.15, CDH 6.1 and higher
Apache Issue: KUDU-2198
Impala Known Issues: Resources
These issues involve memory or disk usage, including out-of-memory conditions, the spill-to-disk feature, and resource management features.
Handling large rows during upgrade to CDH 5.13 / Impala 2.10 or higher
After an upgrade to CDH 5.13 / Impala 2.10 or higher, users who process very large column values (long strings), or have increased the --read_size configuration setting from its default of 8 MB, might encounter capacity errors for some queries that previously worked.
Resolution: After the upgrade, follow the instructions in Handling Large Rows During Upgrade to CDH 5.13 / Impala 2.10 or Higher to check if your queries are affected by these changes and to modify your configuration settings if so.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-6028
Configuration to prevent crashes caused by thread resource limits
Impala could encounter a serious error due to resource usage under very high concurrency. The error message is similar to:
F0629 08:20:02.956413 29088 llvm-codegen.cc:111] LLVM hit fatal error: Unable to allocate section memory! terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::thread_resource_error> >'
Workaround:
In CDH 6.0 and lower versions of CDH, configure each host running an impalad daemon with the following settings:
echo 2000000 > /proc/sys/kernel/threads-max echo 2000000 > /proc/sys/kernel/pid_max echo 8000000 > /proc/sys/vm/max_map_count
In CDH 6.1 and higher versions, it is unlikely that you will hit the thread resource limit. Configure each host running an impalad daemon with the following setting:
echo 8000000 > /proc/sys/vm/max_map_count
- Add the following line to /etc/sysctl.conf:
vm.max_map_count=8000000
- Run the following command:
sysctl -p
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-5605
Breakpad minidumps can be very large when the thread count is high
The size of the breakpad minidump files grows linearly with the number of threads. By default, each thread adds 8 KB to the minidump size. Minidump files could consume significant disk space when the daemons have a high number of threads.
Workaround: Add --minidump_size_limit_hint_kb=size to set a soft upper limit on the size of each minidump file. If the minidump file would exceed that limit, Impala reduces the amount of information for each thread from 8 KB to 2 KB. (Full thread information is captured for the first 20 threads, then 2 KB per thread after that.) The minidump file can still grow larger than the "hinted" size. For example, if you have 10,000 threads, the minidump file can be more than 20 MB.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-3509
Process mem limit does not account for the JVM's memory usage
Some memory allocated by the JVM used internally by Impala is not counted against the memory limit for the impalad daemon.
Workaround: To monitor overall memory usage, use the top command, or add the memory figures in the Impala web UI /memz tab to JVM memory usage shown on the /metrics tab.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-691
Impala Known Issues: Correctness
These issues can cause incorrect or unexpected results from queries. They typically only arise in very specific circumstances.
Timestamp type-casted to varchar in a binary predicate can produce incorrect result
> select * from (select cast('2018-12-11 09:59:37' as timestamp) as ts) tbl where cast(ts as varchar(10)) = '2018-12-11';The output will have 0 rows.
- CDH 5.15.0, 5.15.1, 5.15.2, 5.16.0, 5.16.1
- CDH 6.0.0, 6.0.1, 6.1.0, 6.1.1
- CDH 5.16.2
- CDH 6.2.0
For the latest update on this issue see the corresponding Knowledge article:TSB 2019-358: Timestamp type-casted to varchar in a binary predicate can produce incorrect result
Incorrect result due to constant evaluation in query with outer join
explain SELECT 1 FROM alltypestiny a1 INNER JOIN alltypesagg a2 ON a1.smallint_col = a2.year AND false RIGHT JOIN alltypes a3 ON a1.year = a1.bigint_col; +---------------------------------------------------------+ | Explain String | +---------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=1.00KB VCores=1 | | | | 00:EMPTYSET | +---------------------------------------------------------+
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-3094
% escaping does not work correctly in a LIKE clause
If the final character in the RHS argument of a LIKE operator is an escaped \% character, it does not match a % final character of the LHS argument.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-2422
Crash: impala::Coordinator::ValidateCollectionSlots
A query could encounter a serious error if includes multiple nested levels of INNER JOIN clauses involving subqueries.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-2603
Impala Known Issues: Metadata
These issues affect how Impala interacts with metadata. They cover areas such as the metastore database and the Impala Catalog Server daemon.
Concurrent catalog operations with heavy DDL workloads can cause queries with SYNC_DDL to fail fast
ERROR: CatalogException: Couldn't retrieve the catalog topic version for the SYNC_DDL operation after 3 attempts.The operation has been successfully executed but its effects may have not been broadcast to all the coordinators.
The catalog operation is actually successful as the change has been committed to HMS and Catalog Server cache, but when Catalog Server notices a longer than expected time for it to broadcast the changes, it fails fast.
The coordinator daemons eventually sync up in the background.
Affected Versions: CDH versions 6.0 and 6.1
Apache Issue: IMPALA-7961 / CDH-76345
Impala Known Issues: Interoperability
These issues affect the ability to interchange data between Impala and other systems. They cover areas such as data types and file formats.
Queries Stuck on Failed HDFS Calls and not Timing out
In CDH 6.2 / Impala 3.2 and higher, if the following error appears multiple times in a short duration while running a query, it would mean that the connection between the impalad and the HDFS NameNode is in a bad state and hence the impalad would have to be restarted:
"hdfsOpenFile() for <filename> at backend <hostname:port> failed to finish before the <hdfs_operation_timeout_sec> second timeout "
In CDH 6.1 / Impala 3.1 and lower, the same issue would cause Impala to wait for a long time or hang without showing the above error message.
Workaround: Restart the impalad in the bad state.
Affected Versions: All versions of Impala
Apache Issue: HADOOP-15720
Configuration needed for Flume to be compatible with Impala
For compatibility with Impala, the value for the Flume HDFS Sink hdfs.writeFormat must be set to Text, rather than its default value of Writable. The hdfs.writeFormat setting must be changed to Text before creating data files with Flume; otherwise, those files cannot be read by either Impala or Hive.
Resolution: This information has been requested to be added to the upstream Flume documentation.
Affected Versions: All CDH 6 versions
Cloudera Issue: CDH-13199
Avro Scanner fails to parse some schemas
The default value in Avro schema must match the first union type. For example, if the default value is null, then the first type in the UNION must be "null".
Workaround: Swap the order of the fields in the schema specification. For example, use ["null", "string"] instead of ["string", "null"]. Note that the files written with the problematic schema must be rewritten with the new schema because Avro files have embedded schemas.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-635
Impala BE cannot parse Avro schema that contains a trailing semi-colon
If an Avro table has a schema definition with a trailing semicolon, Impala encounters an error when the table is queried.
Workaround: Remove trailing semicolon from the Avro schema.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-1024
Incorrect results with basic predicate on CHAR typed column
When comparing a CHAR column value to a string literal, the literal value is not blank-padded and so the comparison might fail when it should match.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-1652
Impala Known Issues: Limitations
These issues are current limitations of Impala that require evaluation as you plan how to integrate Impala into your data management workflow.
Set limits on size of expression trees
Very deeply nested expressions within queries can exceed internal Impala limits, leading to excessive memory usage.
Workaround: Avoid queries with extremely large expression trees. Setting the query option disable_codegen=true may reduce the impact, at a cost of longer query runtime.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-4551
Impala does not support running on clusters with federated namespaces
Impala does not support running on clusters with federated namespaces. The impalad process will not start on a node running such a filesystem based on the org.apache.hadoop.fs.viewfs.ViewFs class.
Workaround: Use standard HDFS on all Impala nodes.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-77
Hue and BDR require separate parameters for Impala Load Balancer
Cloudera Manager supports a single parameter for specifying the Impala Daemon Load Balancer. However, because BDR and Hue need to use different ports when connecting to the load balancer, it is not possible to configure the load balancer value so that BDR and Hue will work correctly in the same cluster.
Workaround: To configure BDR with Impala, use the load balancer configuration either without a port specification or with the Beeswax port.
To configure Hue, use the Hue Server Advanced Configuration Snippet (Safety Valve) for impalad_flags to specify the load balancer address with the HiveServer2 port.
Affected Versions: CDH versions from 5.11 to 6.0.1
Cloudera Issue: OPSAPS-46641
Impala Known Issues: Miscellaneous / Older Issues
These issues do not fall into one of the above categories or have not been categorized yet.
Unable to Correctly Parse the Terabyte Unit
Impala does not support parsing strings that contain "TB" when used as a unit for terabytes. The flags related to memory limits may be affected, such as the flags for scratch space and data cache.
Workaround: Use other supported units to specify values, e.g. GB or MB.
Affected Versions: CDH 6.3.x and lower versions
Fixed Versions: CDH 6.4.0
Apache Issue: IMPALA-8829
A failed CTAS does not drop the table if the insert fails
If a CREATE TABLE AS SELECT operation successfully creates the target table but an error occurs while querying the source table or copying the data, the new table is left behind rather than being dropped.
Workaround: Drop the new table manually after a failed CREATE TABLE AS SELECT.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-2005
Casting scenarios with invalid/inconsistent results
Using a CAST function to convert large literal values to smaller types, or to convert special values such as NaN or Inf, produces values not consistent with other database systems. This could lead to unexpected results from queries.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-1821
Impala Parser issue when using fully qualified table names that start with a number
A fully qualified table name starting with a number could cause a parsing error. In a name such as db.571_market, the decimal point followed by digits is interpreted as a floating-point number.
Workaround: Surround each part of the fully qualified name with backticks (``).
Affected Versions: All CDH 6 versions
Fixed Versions: CDH 6.2.0
Apache Issue: IMPALA-941
Impala should tolerate bad locale settings
If the LC_* environment variables specify an unsupported locale, Impala does not start.
Workaround: Add LC_ALL="C" to the environment settings for both the Impala daemon and the Statestore daemon. See Modifying Impala Startup Options for details about modifying these environment settings.
Resolution: Fixing this issue would require an upgrade to Boost 1.47 in the Impala distribution.
Affected Versions: All CDH 6 versions
Apache Issue: IMPALA-532
EMC Isilon Known Issues
CDH 6.0 is not currently supported on EMC Isilon.
Affected Versions: All CDH 6 versions
Apache Kafka Known Issues
Potential to bypass transaction and idempotent ACL checks in Apache Kafka
It is possible to manually craft a Produce request which bypasses transaction and idempotent ACL validation. Only authenticated clients with Write permission on the respective topics are able to exploit this vulnerability.
- CDH
- CDK Powered by Apache Kafka
-
CDH versions 6.0.x, 6.1.x, 6.2.0
-
CDK versions 3.0.x, 3.1.x, 4.0.x
Users affected: All users who run Kafka in CDH and CDK.
Date/time of detection: September, 2018
Severity (Low/Medium/High):7.1 (High) (CVSS:3.0/AV:N/AC:L/PR:L/UI:N/S:U/C:L/I:H/A:H)
Impact: Attackers can exploit this issue to bypass certain security restrictions to perform unauthorized actions. This can aid in further attacks.
CVE: CVE-2018-17196
Immediate action required: Update to a version of CDH containing the fix.
-
CDH 6.2.1, 6.3.2
-
CDK 4.1.0
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: TSB 2020-378: Potential to bypass transaction and idempotent ACL checks in Apache Kafka
Topics Created with the "kafka-topics" Tool Might Not Be Secured
Topics that are created and deleted via Kafka are secured (for example, auto created topics). However, most topic creation and deletion is done via the kafka-topics tool, which talks directly to ZooKeeper or some other third-party tool that talks directly to ZooKeeper. Because security is the responsibility of ZooKeeper authorization and authentication, Kafka cannot prevent users from making ZooKeeper changes. Anyone with access to ZooKeeper can create and delete topics. They will not be able to describe, read, or write to the topics even if they can create them.
- kafka-topics.sh
- kafka-configs.sh
- kafka-preferred-replica-election.sh
- kafka-reassign-partitions.sh
"offsets.topic.replication.factor" Must Be Less Than or Equal to the Number of Live Brokers
The offsets.topic.replication.factor broker configuration is now enforced upon auto topic creation. Internal auto topic creation will fail with a GROUP_COORDINATOR_NOT_AVAILABLE error until the cluster size meets this replication factor requirement.
Requests Fail When Sending to a Nonexistent Topic with "auto.create.topics.enable" Set to True
The first few produce requests fail when sending to a nonexistent topic with auto.create.topics.enable set to true.
Workaround: Increase the number of retries in the Producer configuration setting retries.
Custom Kerberos Principal Names Cannot Be Used for Kerberized ZooKeeper and Kafka instances
When using ZooKeeper authentication and a custom Kerberos principal, Kerberos-enabled Kafka does not start.
Workaround: None. You must disable ZooKeeper authentication for Kafka or use the default Kerberos principals for ZooKeeper and Kafka.
Performance Degradation When SSL Is Enabled
Significant performance degradation can occur when SSL is enabled. The impact varies depending on your CPU, JVM version, and message size. Consumers are typically more affected than producers.
Workaround: Configure brokers and clients with ssl.secure.random.implementation = SHA1PRNG. It often reduces this degradation drastically, but its effect is CPU and JVM dependent.
Affected Versions: CDK 2.x and later
Fixed Versions: None
Apache Issue: KAFKA-2561
Cloudera Issue: None
Kafka Broker Fails to Start Due to Slow Sentry and HMS startup
This issue is encountered on cluster startup and is caused by misalignment between Kafka, Sentry, and HMS. The slow startup of HMS slows down Sentry startup which consequently makes the Kafka connection to Sentry time out. Ultimately, the Kafka broker will be unable to start.
Workaround: Manually increase the number of remote procedure call retries between Sentry and Kafka through the Sentry Client Advanced Configuration Snippet (Safety Valve) for sentry-site.xml property.
- Go to Sentry Client Advanced Configuration Snippet (Safety Valve) for sentry-site.xml property. and find the
- Click on the add button.
- Enter the following data:
- Name: sentry.service.client.rpc.retry-total
- Value: 20
- Enter a Reason for change, and then click Save Changes to commit the changes.
- Return to the Home page by clicking the Cloudera Manager logo.
- Click the restart stale services icon next to the Sentry service to invoke the cluster restart wizard.
- Click Restart Stale Services.
- Click Restart Now.
- Click Finish.
Affected Versions: CDH 6.1.0 and higher
Fixed Versions: N/A
Cloudera Issue: CDH-74713
Kafka JMX Tool Cannot Connect to JMX
The Kafka JMX tool cannot connect to the JMX agent of the Kafka Broker or MirrorMaker if the specified address of the JMX remote connector is bound to 127.0.0.1.
- In Cloudera Manager go to and select the affected broker.
- Find the Additional Broker Java Options and Additional MirrorMaker Java Optionsproperties and add the following Java
option to the configuration:
-Djava.rmi.server.hostname=127.0.0.1
- Restart the affected brokers.
Affected Versions: CDH 6.0.0 and higher
Fixed Versions: CDH 6.2.0
Cloudera Issue: OPSAPS-48695
The Idempotent and Transactional Capabilities of Kafka are Incompatible with Sentry
The idempotent and transactional capabilities of Kafka are not compatible with Sentry. The issue is due to Sentry being unable to handle authorization policies for Kafka transactions. As a result, users cannot use Kafka transaction in combination with Sentry.
Workaround: Use the Sentry super user in applications where idempotent producing is a requirement or disable Sentry.
Affected Versions: CDK 4.0 and later, CDH 6.0.0, 6.0.1, 6.1.0, 6.1.1, 6.2.0, 6.3.0
Fixed Versions: CDH 6.2.1, 6.3.1
Apache Issue: N/A
Cloudera Issue: CDH-80606
Kafka Garbage Collection Logs are Written to the Process Directory
MirrorMaker Does Not Start When Sentry is Enabled
When MirrorMaker is used in conjunction with Sentry, MirrorMaker reports an authorization issue and does not start. This is due to Sentry being unable to authorize the kafka_mirror_maker principal which is automatically created.
- Create the kafka_mirror_maker Linux user ID and the kafka_mirror_maker Linux group ID on the MirrorMaker hosts. Use the
following command:
useradd kafka_mirror_maker
- Create the necessary Sentry rules for the kafka_mirror_maker group.
Affected Versions: CDH 6.0.0 and later
Fixed Versions: N/A
Apache Issue: N/A
Cloudera Issue: CDH-53706
Apache Kudu Known Issues
The following are known bugs and issues in Kudu. Note that this list is not exhaustive, and is meant to communicate only the most important known issues.
Kudu Masters unable to join back after a restart
In a multi master Kudu environment, if a master is restarted or goes offline for a few minutes, it can occasionally have trouble joining the cluster on startup. For example, if this happens in case of three kudu masters, and one of the other two masters is stopped or dies during this time, then the overall Kudu cluster is down because the majority of the masters are not running.
This issue is resolved by the KUDU-2748 upstream JIRA.
Products affected: Apache Kudu
- CDH 5.14.0, 5.14.2, 5.14.4
- CDH 5.15.0, 5.15.1, 5.15.2
- CDH 5.16.1, 5.16.2
- CDH 6.0.0, 6.0.1
- CDH 6.1.0, 6.1.1
- CDH 6.2.0, 6.2.1
- CDH 6.3.0
For the latest update on this issue see the corresponding Knowledge article:TSB 2020-442: Kudu Masters unable to join back after a restart
Inconsistent rows returned from queries in Kudu
Due to KUDU-2463, upon restarting Kudu, inconsistent rows may be returned from tables that have not recently been written to, resulting in any of the following:
- multiple rows for the same key being returned
- deleted data being returned
- inconsistent results consistently being returned for the same query
If this happens, you have two options to resolve the conflicts: write to the affected Kudu partitions by:
- re-deleting the known and deleted data
- upserting the most up-to-date version of affected rows.
Products affected: Apache Kudu
- CDH 5.12.2, 5.13.3, 5.14.4, 5.15.1, 5.16.1
- CDH 6.0.1, 6.1.0, 6.1.1
- CDH 5.16.2
- CDH 6.2.0
For the latest update on this issue see the corresponding Knowledge article:TSB 2019-353: Inconsistent rows returned from queries in Kudu
C++ Client Fails to Re-acquire Authentication Token in Multi-master Clusters
A security-related issue can cause Impala queries to start failing on busy clusters in the following scenario:
- The cluster runs with the --rpc_authentication set as optional or required. The default is optional. Secure clusters use required.
- The cluster is using multiple masters.
- Impala queries happen frequently enough that the leader master connection to some impalad isn't idle-closed (more than 1 query per 65 seconds).
- The connection stays alive for longer than the authentication token timeout (1 week by default).
- A master leadership change occurs after the authentication token expiration.
I0904 13:53:08.748968 95857 client-internal.cc:283] Unable to determine the new leader Master: Not authorized: Client connection negotiation failed: client connection to 10.164.44.13:7051: FATAL_INVALID_AUTHENTICATION_TOKEN: Not authorized: authentication token expired I0904 13:53:10.389009 95861 status.cc:125] Unable to open Kudu table: Timed out: GetTableSchema timed out after deadline expired @ 0x95b1e9 impala::Status::Status() @ 0xff22d4 impala::KuduScanNodeBase::Open() @ 0xff101e impala::KuduScanNode::Open() @ 0xb73ced impala::FragmentInstanceState::Open() @ 0xb7532b impala::FragmentInstanceState::Exec() @ 0xb64ae8 impala::QueryState::ExecFInstance() @ 0xd15193 impala::Thread::SuperviseThread() @ 0xd158d4 boost::detail::thread_data<>::run() @ 0x129188a (unknown) @ 0x7f717ceade25 start_thread @ 0x7f717cbdb34d __clone
Unable to open Kudu table: Timed out: GetTableSchema timed out after deadline expired
- Restart the affected Impala Daemons. Restarting a daemon ensures the problem will not reoccur for at least the authentication token lifetime, which defaults to one week.
- Increase the authentication token lifetime (--authn_token_validity_seconds). Beware that raising this lifetime increases the window of vulnerability of the cluster if a client is compromised. It is recommended that you keep the token lifetime at one month maximum for a secure cluster. For unsecured clusters, a longer token lifetime is acceptable, and a 3 month lifetime is recommended.
Affected Versions: From CDH 5.11 through CDH 6.0.1
Apache Issue: KUDU-2580
Timeout Possible with Log Force Synchronization Option
If the Kudu master is configured with the -log_force_fsync_all option, tablet servers and clients will experience frequent timeouts, and the cluster may become unusable.
Affected Versions: All CDH 6 versions
Longer Startup Times with a Large Number of Tablets
If a tablet server has a very large number of tablets, it may take several minutes to start up. It is recommended to limit the number of tablets per server to 1000 or fewer. The maximum allowed number of tablets is 2000 per server. Consider this limitation when pre-splitting your tables. If you notice slow start-up times, you can monitor the number of tablets per server in the web UI.
Affected Versions: All CDH 6 versions
Fault Tolerant Scan Memory Issue
Unlike regular scans, fault tolerant scans will allocate all required memory when the scan begins rather than as it progresses. This can be significant for big tablets. Moreover, this memory usage isn't counted towards the tablet server's overall memory limit, raising the likelihood of the tablet server being out-of-memory killed by the kernel.
Affected Versions: CDH 6.2 / Kudu 1.9 and lower
Apache Issue: KUDU-2466
Descriptions for Kudu TLS/SSL Settings in Cloudera Manager
Use the descriptions in the following table to better understand the TLS/SSL settings in the Cloudera Manager Admin Console.
Field | Usage Notes |
---|---|
Kerberos Principal | Set to the default principal, kudu. |
Enable Secure Authentication And Encryption | Select this checkbox to enable authentication and RPC encryption between all Kudu clients and servers, as well as between individual servers. Only enable this property after you have configured Kerberos. |
Master TLS/SSL Server Private Key File (PEM Format) | Set to the path containing the Kudu master host's private key (PEM-format). This is used to enable TLS/SSL encryption (over HTTPS) for browser-based connections to the Kudu master web UI. |
Tablet Server TLS/SSL Server Private Key File (PEM Format) | Set to the path containing the Kudu tablet server host's private key (PEM-format). This is used to enable TLS/SSL encryption (over HTTPS) for browser-based connections to Kudu tablet server web UIs. |
Master TLS/SSL Server Certificate File (PEM Format) | Set to the path containing the signed certificate (PEM-format) for the Kudu master host's private key (set in Master TLS/SSL Server Private Key File). The certificate file can be created by concatenating all the appropriate root and intermediate certificates required to verify trust. |
Tablet Server TLS/SSL Server Certificate File (PEM Format) | Set to the path containing the signed certificate (PEM-format) for the Kudu tablet server host's private key (set in Tablet Server TLS/SSL Server Private Key File). The certificate file can be created by concatenating all the appropriate root and intermediate certificates required to verify trust. |
Master TLS/SSL Server CA Certificate (PEM Format) | Disregard this field. |
Tablet Server TLS/SSL Server CA Certificate (PEM Format) | Disregard this field. |
Enable TLS/SSL for Master Server | Enables HTTPS encryption on the Kudu master web UI. |
Enable TLS/SSL for Tablet Server | Enables HTTPS encryption on the Kudu tablet server web UIs. |
Affected Versions: All CDH 6 versions
Apache Oozie Known Issues
Oozie database upgrade fails when PostgreSQL version 9.6 or higher is used
Oozie database upgrade fails when PostgreSQL version 9.6 or higher is used due to a sys table change in PostgreSQL from version 9.5 to 9.6. The failure only happens if Oozie uses a JDBC driver earlier than 9.4.1209.
- After the parcels of the new version are distributed, replace the PostgreSQL JDBC driver with a newer one (version 9.4.1209 or higher) in the new parcel, at the following locations:
- /opt/cloudera/parcels/${newparcel.version}/lib/oozie/lib/
- /opt/cloudera/parcels/${newparcel.version}/lib/oozie/libtools/
- Perform the upgrade.
- /usr/lib/oozie/libtools/
- /usr/lib/oozie/lib/
You can download the driver from the PostgreSQL JDBC driver homepage.
Affected Versions: CDH 6.0.0 and higher
Fixed Version: CDH 6.2.1 and higher
Cloudera Issue: CDH-75951
Oozie jobs fail (gracefully) on secure YARN clusters when JobHistory server is down
If the JobHistory server is down on a YARN (MRv2) cluster, Oozie attempts to submit a job, by default, three times. If the job fails, Oozie automatically puts the workflow in a SUSPEND state.
Workaround: When the JobHistory server is running again, use the resume command to tell Oozie to continue the workflow from the point at which it left off.
Affected Versions: CDH 5 and higher
Cloudera Issue: CDH-14623
Apache Parquet Known Issues
There are no known issues in Parquet.
Apache Pig Known Issues
There are no known issues in this release.
Cloudera Search Known Issues
The current release includes the following known limitations:
Default Solr core names cannot be changed (limitation)
Although it is technically possible to give user-defined Solr core names during core creation, it is to be avoided in te context of Cloudera Search. Cloudera Manager expects core names in the default "collection_shardX_replicaY" format. Altering core names results in Cloudera Manager being unable to fetch Solr metrics for the given core and this, eventually, may corrupt data collection for co-located core, or even shard and server level charts.
Processing UpdateRequest with delegation token throws NullPointerException
When using the Spark Crunch Indexer or another client application which utilizes the SolrJ API to send Solr Update requests with delegation token authentication, the server side processing of the request might fail with a NullPointerException.
Affected Versions: CDH 6.0.0, 6.0.1, 6.1.0, 6.1.1, 6.2.0, 6.2.1, 6.3.0, 6.3.1, 6.3.2
Fixed Version: CDH 6.3.3
Apache Issue: SOLR-13921
Cloudera Issue: CDH-82599
Solr service with no added collections causes the upgrade process to fail
Failed to execute command Bootstrap Solr Collections on service Solrif there are no collections present in Solr.
Workaround: If there are no collections added to it, remove the Solr service from your cluster before you start the upgrade.
Affected Versions: CDH 6.0.0, 6.0.1, 6.1.0, 6.1.1, 6.2.0, 6.2.1, 6.3.0, 6.3.1, 6.3.2
Fixed Version: CDH 6.3.3
Cloudera Issue: CDH-82042
HBase Lily indexer might fail to write role log files
In certain scenarios the HBase Lily Indexer (Key-Value Store Indexer) fails to write its role log files.
Workaround: None
Affected Versions: CDH 6.0.0, 6.0.1, 6.1.0, 6.1.1, 6.2.0, 6.2.1, 6.3.0, 6.3.1, 6.3.2
Fixed Version: CDH 6.3.3
Cloudera Issue: CDH-82342
Adding a new indexer instance to HBase Lily Indexer fails with GSSException
When Kerberos authentication is enabled and adding a new indexer instance to HBase Lily Indexer (Key-Value Store Indexer), the authentication might fail when Lily is communicating to the HBase Master process, throwing a similar Exception:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
- Go to .
- Make sure the Sentry Service configuration option points to a Sentry service instance instead of none.
Affected Versions: CDH 6.0.0, 6.0.1, 6.1.0, 6.1.1, 6.2.0, 6.2.1, 6.3.0, 6.3.1, 6.3.2
Fixed Version: CDH 6.3.3
Cloudera Issue: CDH-82566
Solr SQL, Graph, and Stream Handlers are Disabled if Collection Uses Document-Level Security
The Solr SQL, Graph, and Stream handlers do not support document-level security, and are disabled if document-level security is enabled on the collection. If necessary, these handlers can be re-enabled by setting the following Java system properties, but document-level security is not enforced for these handlers:
- SQL: solr.sentry.enableSqlQuery=true
- Graph: solr.sentry.enableGraphQuery=true
- Stream: solr.sentry.enableStreams=true
Workaround: None
Affected Versions: All CDH 6 releases
Cloudera Issue: CDH-66345
Collection Creation No Longer Supports Automatically Selecting A Configuration If Only One Exists
Before CDH 5.5.0, a collection could be created without specifying a configuration. If no -c value was specified, then:
- If there was only one configuration, that configuration was chosen.
- If the collection name matched a configuration name, that configuration was chosen.
Search for CDH 5.5.0 includes multiple built-in configurations. As a result, there is no longer a case in which only one configuration can be chosen by default.
Workaround: Explicitly specify the collection configuration to use by passing -c <configName> to solrctl collection --create.
Affected Versions: CDH 5.5.0 and higher
Cloudera Issue: CDH-34050
CrunchIndexerTool which includes Spark indexer requires specific input file format specifications
If the --input-file-format option is specified with CrunchIndexerTool, then its argument must be text, avro, or avroParquet, rather than a fully qualified class name.
Workaround: None
Affected Versions: All
Cloudera Issue: CDH-22190
The quickstart.sh file does not validate ZooKeeper and the NameNode on some operating systems
The quickstart.sh file uses the timeout function to determine if ZooKeeper and the NameNode are available. To ensure this check can be complete as intended, the quickstart.sh determines if the operating system on which the script is running supports timeout. If the script detects that the operating system does not support timeout, the script continues without checking if the NameNode and ZooKeeper are available. If your environment is configured properly or you are using an operating system that supports timeout, this issue does not apply.
Workaround: This issue only occurs in some operating systems. If timeout is not available, the quickstart continues and final validation is always done by the MapReduce jobs and Solr commands that are run by the quickstart.
Affected Versions: All
Cloudera Issue: CDH-19923
Field value class guessing and Automatic schema field addition are not supported with the MapReduceIndexerTool nor the HBaseMapReduceIndexerTool
The MapReduceIndexerTool and the HBaseMapReduceIndexerTool can be used with a Managed Schema created via NRT indexing of documents or via the Solr Schema API. However, neither tool supports adding fields automatically to the schema during ingest.
Workaround: Define the schema before running the MapReduceIndexerTool or HBaseMapReduceIndexerTool. In non-schemaless mode, define in the schema using the schema.xml file. In schemaless mode, either define the schema using the Solr Schema API or index sample documents using NRT indexing before invoking the tools. In either case, Cloudera recommends that you verify that the schema is what you expect using the List Fields API command.
Affected Versions: All
Cloudera Issue: CDH-26856
The Browse and Spell Request Handlers are not enabled in schemaless mode
The Browse and Spell Request Handlers require certain fields be present in the schema. Since those fields cannot be guaranteed to exist in a Schemaless setup, the Browse and Spell Request Handlers are not enabled by default.
Workaround: If you require the “Browse” and “Spell” Request Handlers, add them to the solrconfig.xml configuration file. Generate a non-schemaless configuration to see the usual settings and modify the required fields to fit your schema.
Affected Versions: All
Cloudera Issue: CDH-19407
Enabling blockcache writing may result in unusable indexes
It is possible to create indexes with solr.hdfs.blockcache.write.enabled set to true. Such indexes may appear corrupt to readers, and reading these indexes may irrecoverably corrupt indexes. Blockcache writing is disabled by default.
Workaround: None
Affected Versions: All
Cloudera Issue: CDH-17978
Users with insufficient Solr permissions may receive a "Page Loading" message from the Solr Web Admin UI
Users who are not authorized to use the Solr Admin UI are not given page explaining that access is denied, and instead receive a web page that never finishes loading.
Workaround: None
Affected Versions: All
Cloudera Issue: CDH-58276
Using MapReduceIndexerTool or HBaseMapReduceIndexerTool multiple times may produce duplicate entries in a collection.
Repeatedly running the MapReduceIndexerTool on the same set of input files can result in duplicate entries in the Solr collection. This occurs because the tool can only insert documents and cannot update or delete existing Solr documents. This issue does not apply to the HBaseMapReduceIndexerTool unless it is run with more than zero reducers.
Workaround: To avoid this issue, use HBaseMapReduceIndexerTool with zero reducers. This must be done without Kerberos.
Affected Versions: All
Cloudera Issue: CDH-15441
Deleting collections might fail if hosts are unavailable
It is possible to delete a collection when hosts that host some of the collection are unavailable. After such a deletion, if the previously unavailable hosts are brought back online, the deleted collection may be restored.
Workaround: Ensure all hosts are online before deleting collections.
Affected Versions: All
Cloudera Issue: CDH-58694
Saving search results is not supported
Cloudera Search does not support the ability to save search results.
Workaround: None
Affected Versions: All
Cloudera Issue: CDH-21162
HDFS Federation is not supported
Cloudera Search does not support HDFS Federation.
Workaround: None
Affected Versions: All
Cloudera Issue: CDH-11357
Solr contrib modules are not supported
Solr contrib modules are not supported (Morphlines, Spark Crunch indexer, MapReduce and Lily HBase indexers are part of the Cloudera Search product itself, therefore they are supported).
Workaround: None
Affected Versions: All
Cloudera Issue: CDH-72658
Using the Sentry Service with Cloudera Search may introduce latency
Using the Sentry Service with Cloudera Search may introduce latency because authorization requests must be sent to the Sentry Service.
Workaround: You can alleviate this latency by enabling caching for the Sentry Service. For instructions, see: Enabling Caching for the Sentry Service.
Affected Versions: All
Cloudera Issue: CDH-73407
Solr Sentry integration limitation where two Solr deployments depend on the same Sentry service
If multiple Solr instances are configured to depend on the same Sentry service, it is not possible to create unique Solr Sentry privileges per Solr deployment. Since privileges are enforced in all Solr instances simultaneously, you cannot add distinct privileges that apply to one Solr cluster, but not to another.
Workaround: None
Affected Versions: All
Cloudera Issue: CDH-72676
Collection state goes down after Solr SSL
If you enable TLS/SSL on a Solr instance with existing collections, the collections will break and become unavailable. Collections created after enabling TLS/SSL are not affected by this issue.
Workaround: Recreate the collection after enabling TLS. For more information, see How to update existing collections in Non-SSL to SSL in Solr.
Affected Versions: All
Cloudera Issue: CDPD-4139
Apache Sentry Known Issues
Sentry does not support Kafka topic name with more than 64 characters
A Kafka topic name can have 249 characters, but Sentry only supports topic names up to 64 characters.
Workaround: Keep Kafka topic names to 64 charcters or less.
Affected Versions: All CDH 5.x and 6.x versions
Cloudera Issue: CDH-64317
When granting privileges, a single transaction per grant causes long delays
Sentry takes a long time to grant or revoke a large number of column-level privileges that are requested in a single statement. For example if you execute the following command:
GRANT SELECT(col1, col2, …) ON TABLE table1;
Sentry applies the grants to each column separately and the refresh process causes long delays.
Workaround: Split the grant statement up into smaller chunks. This prevents the refresh process from causing delays.
- CDH: 5.14.4
- CDH: 5.15.1
- CDH: 5.16.0
- CDH: 6.1.0
- CDH 5.16.1 and above
- CDH 6.2.0 and above
Cloudera Issue: CDH-74982
SHOW ROLE GRANT GROUP raises exception for a group that was never granted a role
If you run the command SHOW ROLE GRANT GROUP for a group that has never been granted a role, beeline raises an exception. However, if you run the same command for a group that does not have any roles, but has at one time been granted a role, you do not get an exception, but instead get an empty list of roles granted to the group.
Workaround: Adding a role will prevent the exception.
Affected Versions:
- CDH 5.16.0
- CDH 6.0.0
Cloudera Issue: CDH-71694
GRANT/REVOKE operations could fail if there are too many concurrent requests
Under a significant workload, Grant/Revoke operations can have issues.
Workaround: If you need to make many privilege changes, plan them at a time when you do not need to do too many at once.
Affected Versions: CDH 5.13.0 and above
Apache Issue: SENTRY-1855
Cloudera Issue: CDH-56553
Creating large set of Sentry roles results in performance problems
Using more than a thousand roles/permissions might cause significant performance problems.
Workaround: Plan your roles so that groups have as few roles as possible and roles have as few permissions as possible.
Affected Versions: CDH 5.13.0 and above
Cloudera Issue: CDH-59010
Users can't track jobs with Hive and Sentry
As a prerequisite of enabling Sentry, Hive impersonation is turned off, which means all YARN jobs are submitted to the Hive job queue, and are run as the hive user. This is an issue because the YARN History Server now has to block users from accessing logs for their own jobs, since their own usernames are not associated with the jobs. As a result, end users cannot access any job logs unless they can get sudo access to the cluster as the hdfs, hive or other admin users.
In CDH 5.8 (and higher), Hive overrides the default configuration, mapred.job.queuename, and places incoming jobs into the connected user's job queue, even though the submitting user remains hive. Hive obtains the relevant queue/username information for each job by using YARN's fair-scheduler.xml file.
Affected Versions: CDH 5.2.0 and above
Cloudera Issue: CDH-22890
Column-level privileges are not supported on Hive Metastore views
GRANT and REVOKE for column level privileges is not supported on Hive Metastore views.
Affected Versions: All CDH versions
Apache Issue: SENTRY-754
SELECT privilege on all columns does not equate to SELECT privilege on table
Users who have been explicitly granted the SELECT privilege on all columns of a table, will not have the permission to perform table-level operations. For example, operations such as SELECT COUNT (1) or SELECT COUNT (*) will not work even if you have the SELECT privilege on all columns.
There is one exception to this. The SELECT * FROM TABLE command will work even if you do not have explicit table-level access.
Affected Versions: All CDH versions
Apache Issue: SENTRY-838
The EXPLAIN SELECT operation works without table or column-level privileges
Users are able to run the EXPLAIN SELECT operation, exposing metadata for all columns, even for tables/columns to which they weren't explicitly granted access.
Affected Versions: All CDH versions
Apache Issue: SENTRY-849
Object types Server and URI are not supported in SHOW GRANT ROLE roleName on OBJECT objectName
Workaround:Use SHOW GRANT ROLE roleNameto list all privileges granted to the role.
Affected Versions: All CDH versions
Apache Issue: N/A
Cloudera Issue: CDH-19430
Relative URI paths not supported by Sentry
Sentry supports only absolute (not relative) URI paths in permission grants. Although some early releases (for example, CDH 5.7.0) might not have raised explicit errors when relative paths were set, upgrading a system that uses relative paths causes the system to lose Sentry permissions.
Resolution: Revoke privileges that have been set using relative paths, and grant permissions using absolute paths before upgrading.
Affected Versions: All versions. Relative paths are not supported in Sentry for permission grants.
Absolute (Use this form) | Relative (Do not use this form) |
---|---|
hdfs://absolute/path/ | hdfs://relative/path |
s3a://bucketname/ | s3a://bucketname |
Apache Spark Known Issues
The following sections describe the current known issues and limitations in Apache Spark 2.x as distributed with CDH 6.1.x. In some cases, a feature from the upstream Apache Spark project is currently not considered reliable enough to be supported by Cloudera.
Continue reading:
- Shuffle+Repartition on a DataFrame could lead to incorrect answers
- Shuffle+Repartition on an RDD could lead to incorrect answers
- RDD.repartition() has different failure handling in Spark 2.4 and may cause job failures
- Spark Streaming write-ahead logs do not run on HDFS directories with Erasure Coding enabled
- PySpark broadcast variables fail when disk encryption is enabled
- Structured Streaming exactly-once fault tolerance constraints
- Spark SQL does not respect size limit for the varchar type
- Spark SQL does not prevent you from writing key types not supported by Avro tables
- Spark SQL does not support timestamp in Avro tables
- Spark SQL does not respect Sentry ACLs when communicating with Hive metastore
- Dynamic allocation and Spark Streaming
- Limitation with Region Pruning for HBase Tables
- Running spark-submit with --principal and --keytab arguments does not work in client mode
- The --proxy-user argument does not work in client mode
- History link in ResourceManager web UI broken for killed Spark applications
- ORC file format is not supported
Shuffle+Repartition on a DataFrame could lead to incorrect answers
When a repartition follows a shuffle, the assignment of rows to partitions is nondeterministic. If Spark has to recompute a partition, for example, due to an executor failure, the retry can consume a different set of input rows than the original computation. As a result, some rows can be dropped, and others can be duplicated.
Products affected: CDS Powered By Apache Spark
- CDH 6.0.0, 6.0.1, 6.1.0, 6.1.1
- CDS 2.1.0 release 1, release 2
- CDS 2.2.0 release 1, release 2
- CDH 6.2.0, 6.3.0
- CDS 2.1.0 release 3
- CDS 2.2.0 release 3
- CDS 2.3.0 release 3
Shuffle+Repartition on an RDD could lead to incorrect answers
When a repartition follows a shuffle, the assignment of records to partitions is nondeterministic. If Spark has to recompute a partition, for example, due to an executor failure, the retry can consume a different set of input records than the original computation. As a result, some records can be dropped, and others can be duplicated.
Products affected: CDS Powered By Apache Spark
- CDH 6.0.0, 6.0.1, 6.1.0, 6.1.1
- CDS 2.1.0 release 1, release 2, release 3
- CDS 2.2.0 release 1, release 2, release 3
- CDS 2.3.0 release 1, release 2, release 3
- CDH 6.2.0, 6.3.0
- CDS 2.1.0 release 4
- CDS 2.2.0 release 4
- CDS 2.3.0 release 4
RDD.repartition() has different failure handling in Spark 2.4 and may cause job failures
The RDD.repartition() transformation, which reshuffles data in the RDD randomly to create either more or fewer partitions and then balances it across the partitions, was using a round-robin method to distribute data that caused incorrect answers to be returned for RDD jobs. This issue has been corrected, but it introduced a behavior change in RDD job failure handling. Now, Spark actively fails a job if there is a fetch failure that was caused by a node failure after repartitioning.
Workaround: Use the RDD.checkpoint() method to save the intermediate RDD data to HDFS. First, call SparkContext.setCheckpointDir(directory: String) to set the checkpoint directory where the intermediate data will be saved. Note that the directory must be an HDFS path. Then mark the RDD for checkpointing by calling RDD.checkpoint() when you use the RDD.repartition() transformation.
Apache Issue: SPARK-23243
Cloudera Issue: CDH-76413
Spark Streaming write-ahead logs do not run on HDFS directories with Erasure Coding enabled
Spark Streaming write-ahead logs (WALs) cannot run on HDFS directories when Erasure Coding is enabled. Erasure Coding does not support hflush(), hsync(), and append(), which prevents the WALs from running.
Workaround: Configure Spark Streaming with a checkpoint directory that does not have Erasure Coding enabled on it. You can set the checkpoint directory with ssc.checkpoint("directory_name"). For example:
ssc.checkpoint("_checkpoint")
Affected Versions: CDH 6.1.0
Fixed Versions: CDH 6.2.0
Apache Issue: SPARK-26094
Cloudera Issue: CDH-61127
PySpark broadcast variables fail when disk encryption is enabled
When disk encryption is enabled, PySpark broadcast variables fail with the following stack trace:
Traceback (most recent call last): File "broadcast.py", line 37, in <module> words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File "pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of input
Workaround: None
Affected Versions: CDH 6.0.1, CDH 6.1.0
Fixed Versions: CDH 6.1.1, CDH 6.2.0
Apache Issue: SPARK-26201
Cloudera Issue: CDH-76055
Structured Streaming exactly-once fault tolerance constraints
In Spark Structured Streaming, the exactly-once fault tolerance for file sink is valid only for files that are in the manifest. These files are located in the _spark_metadata subdirectory of the file sink output directory. Only process files that have file names starting with digits. Other temporary files can also appear in this directory, but they should not be processed. Typically, these temporary file file names start with a period (".").
You can list the valid manifest files, excluding the temporary files, by using a command like the following, which assumes your output directory is located at /tmp/output. As the appropriate user, run the following command to list the valid manifest files:
hadoop fs -ls /tmp/output/_spark_metadata/[0-9]*
Workaround: None
Affected Versions: CDH 6.1.0 and higher
Cloudera Issue: CDH-75191
Spark SQL does not respect size limit for the varchar type
Spark SQL treats varchar as a string (that is, there no size limit). The observed behavior is that Spark reads and writes these columns as regular strings; if inserted values exceed the size limit, no error will occur. The data will be truncated when read from Hive, but not when read from Spark.
Workaround: None
Affected Versions: CDH 5.5.0 and higher
Apache Issue: SPARK-5918
Cloudera Issue: CDH-33642
Spark SQL does not prevent you from writing key types not supported by Avro tables
Spark allows you to declare DataFrames with any key type. Avro supports only string keys and trying to write any other key type to an Avro table will fail.
Workaround: None
Affected Versions: CDH 5.5.0 and higher
Cloudera Issue: CDH-33648
Spark SQL does not support timestamp in Avro tables
Workaround: None
Affected Versions: CDH 5.5.0 and higher
Cloudera Issue: CDH-33649
Spark SQL does not respect Sentry ACLs when communicating with Hive metastore
Even if user is configured via Sentry to not have read permission to a Hive table, a Spark SQL job running as that user can still read the table's metadata directly from the Hive metastore. Cloudera Issue: CDH-76468
Dynamic allocation and Spark Streaming
If you are using Spark Streaming, Cloudera recommends that you disable dynamic allocation by setting spark.dynamicAllocation.enabled to false when running streaming applications.
Limitation with Region Pruning for HBase Tables
When SparkSQL accesses an HBase table through the HiveContext, region pruning is not performed. This limitation can result in slower performance for some SparkSQL queries against tables that use the HBase SerDes than when the same table is accessed through Impala or Hive.
Workaround: None
Affected Versions: All
Cloudera Issue: CDH-56330
Running spark-submit with --principal and --keytab arguments does not work in client mode
The spark-submit script's --principal and --keytab arguments do not work with Spark-on-YARN's client mode.
Workaround: Use cluster mode instead.
Affected Versions: All
The --proxy-user argument does not work in client mode
Using the --proxy-user argument in client mode does not work and is not supported.
Workaround: Use cluster mode instead.
Affected Versions: All
History link in ResourceManager web UI broken for killed Spark applications
When a Spark application is killed, the history link in the ResourceManager web UI does not work.
Workaround: To view the history for a killed Spark application, see the Spark HistoryServer web UI instead.
Affected Versions: All CDH versions
Apache Issue: None
Cloudera Issue: CDH-49165
ORC file format is not supported
Currently, Cloudera does not support reading and writing Hive tables containing data files in the Apache ORC (Optimized Row Columnar) format from Spark applications. Cloudera recommends using Apache Parquet format for columnar data. That file format can be used with Spark, Hive, and Impala.
Apache Sqoop Known Issues
Column names cannot start with a number when importing data with the --as-parquetfile option.
Currently, Sqoop is using an Avro schema when writing data as a parquet file. The Avro schema requires that column names do not start with numbers, therefore Sqoop is renaming the columns in this case, prepending them with an underscore character. This can lead to issues when one wants to reuse the data in other tools, such as Impala.
Workaround: Rename the columns to comply with Avro limitations (start with letters or underscore, as specified in the Avro documentation).
Cloudera Issue: None
MySQL JDBC driver shipped with CentOS 6 systems does not work with Sqoop
CentOS 6 systems currently ship with version 5.1.17 of the MySQL JDBC driver. This version does not work correctly with Sqoop.
Workaround: Install version 5.1.31 of the JDBC driver as detailed in Installing the JDBC Drivers for Sqoop 1.
Affected Versions: MySQL JDBC 5.1.17, 5.1.4, 5.3.0
Cloudera Issue: CDH-23180
MS SQL Server "integratedSecurity" option unavailable in Sqoop
The integratedSecurity option is not available in the Sqoop CLI.
Workaround: None
Cloudera Issue: None
Sqoop1 (doc import + --as-parquetfile) limitation with KMS/KTS Encryption at Rest
sqoop import --connect jdbc:db2://djaxludb1001:61035/DDBAT003 --username=dh810202 --P --target-dir /data/hive_scratch/ASDISBURSEMENT --delete-target-dir -m1 --query "select disbursementnumber,disbursementdate,xmldata FROM DB2dba.ASDISBURSEMENT where DISBURSEMENTNUMBER = 2011113210000115311 AND \$CONDITIONS" -hive-import --hive-database adminserver -hive-table asdisbursement_dave --map-column-java XMLDATA=String --as-parquetfile 16/12/05 12:23:46 INFO mapreduce.Job: map 100% reduce 0% 16/12/05 12:23:46 INFO mapreduce.Job: Job job_1480530522947_0096 failed with state FAILED due to: Job commit failed: org.kitesdk.data.DatasetIOException: Could not move contents of hdfs://AJAX01-ns/tmp/adminserver/.temp/job_1480530522947_0096/mr/job_1480530522947_0096 to hdfs://AJAX01-ns/data/RetiredApps/INS/AdminServer/asdisbursement_dave <SNIP> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): /tmp/adminserver/.temp/job_1480530522947_0096/mr/job_1480530522947_0096/5ddcac42-5d69-4e46-88c2-17bbedac4858.parquet can't be moved into an encryption zone.
Workaround: If you use the Parquet Hadoop API based implementation for importing into Parquet, specify a --target-dir which is the same encryption zone as the Hive warehouse directory.
If you use the Kite Dataset API based implementation, use an alternate data file type, for example text or avro.
Apache Issue: SQOOP-2943
Cloudera Issue: CDH-40826
Doc import as Parquet files may result in out-of-memory errors
- With many very large rows (multiple megabytes per row) before initial-page-run check (ColumnWriter)
- When rows vary significantly by size so that the next-page-size check is based on small rows and is set very high followed by many large rows
Workaround: None, other than restructuring the data.
Apache Issue: PARQUET-99
Apache ZooKeeper Known Issues
There are no known issues in this release.