Known Issues in MapReduce and YARN

This topic describes known issues, unsupported features and limitations for using MapReduce and YARN in this release of Cloudera Runtime.

Known Issues

OPSAPS-56577: If a Kerberos principal other than "yarn" is configured for the YARN service, then Cloudera Manager will erroneously skip adding the custom principal to the YARN keytab, causing YARN to fail to start due to a Kerberos authentication failure. This also affects Ambari to Cloudera Manager migrations, if Ambari is configured with a princpal other than "yarn" for the YARN service. A similar issue affects Hive, when using Hive LLAP.
Workaround: You must specify "yarn" as the Kerberos principal for YARN, and specify "hive" as the Kerberos principal for Hive. When performing an Ambari to Cloudera Manager migration, set the principals for both services to those values before performing the migration.
Fair Scheduler to Capacity Scheduler migration - fs2cs tool

fs2cs tool does not convert all Fair Scheduler queue configurations to Capacity Scheduler queue configurations.

Workaround: You must manually configure the queue configurations which are not converted by the fs2cs tool. For information about using the fs2cs tool and its limitations, see Fair Scheduler to Capacity Scheduler transition.
CDPD-12123: HDFS replication performance is either slowed down or not on the expected lines with CDH/CM 7.1.1.
Post-upgrade, queue which the user is running workloads cannot grow beyond the configured capacity till its maximum capacity.
Workaround: Once the cluster is upgraded to CDP Private Cloud Base 7.1.1, user may need to tune yarn.scheduler.capacity.<queuepath>.user-limit-factor to a value greater than 1. This configuration enables the queue usage to grow beyond its configured capacity, till its maximum capacity configured.
DOCS-5966: Third party applications do not launch if MapReduce framework path is not included in the client configuration
MapReduce application framework is loaded from HDFS instead of being present on the NodeManagers. By default, the mapreduce.application.framework.path property is set to the appropriate value, but third party applications with their own configurations will not launch.
Workaround: Set the mapreduce.application.framework.path property to the appropriate configuration for third party applications.
JobHistory URL mismatch after server relocation
After moving the JobHistory Server to a new host, the URLs listed for the JobHistory Server on the ResourceManager web UI still point to the old JobHistory Server. This affects existing jobs only. New jobs started after the move are not affected.
Workaround: For any existing jobs that have the incorrect JobHistory Server URL, there is no option other than to allow the jobs to roll off the history over time. For new jobs, make sure that all clients have the updated mapred-site.xml that references the correct JobHistory Server.
CDH-49165: History link in ResourceManager web UI broken for killed Spark applications
When a Spark application is killed, the history link in the ResourceManager web UI does not work.
Workaround: To view the history for a killed Spark application, see the Spark HistoryServer web UI instead.
CDH-6808: Routable IP address required by ResourceManager
ResourceManager requires routable host:port addresses for yarn.resourcemanager.scheduler.address, and does not support using the wildcard 0.0.0.0 address.
Workaround: Set the address, in the form host:port, either in the client-side configuration, or on the command line when you submit the job.
OPSAPS-52066: Stacks under Logs Directory for Hadoop daemons are not accessible from Knox Gateway.
Stacks under the Logs directory for Hadoop daemons, such as NameNode, DataNode, ResourceManager, NodeManager, and JobHistoryServer are not accessible from Knox Gateway.
Workaround: Administrators can SSH directly to the Hadoop Daemon machine to collect stacks under the Logs directory.
COMPX-1445: Queue Manager operations are failing when Queue Manager is installed separately from YARN
If Queue Manager is not selected during YARN installation, Queue Manager operations are failing. Queue Manager says 0 queues are configured and several failures are present. That is because ZooKeeper configuration store is not enabled.
Workaround:
  1. In Cloudera Manager, select the YARN service.
  2. Click the Configuration tab.
  3. Find the Queue Manager Service property.
  4. Select the Queue Manager service that the YARN service instance depends on.
  5. Click Save Changes.
  6. Restart all services that are marked stale in Cloudera Manager.
COMPX-1451: Queue Manager does not support multiple Resource
When YARN High Availability is enabled there are multiple Resource Managers. Queue Manager receives multiple ResourceManager URLs for a High Availability cluster. It picks the active ResourceManager URL only when Queue Manager page is loaded. Queue Manager cannot handle it gracefully when the currently active ResourceManager goes down while the user is still using the Queue Manager UI.
Workaround: Reload the Queue Manager page manually.
COMPX-3134: Yarn applications can get stuck due to a NullPointerException in Capacity Scheduler
If you enable Asynchronous scheduling (yarn.scheduler.capacity.schedule-asynchronously.enable=true) in capacity scheduler, there is an edge-case where NullPointerException can cause the scheduler thread to exit and the applications get stuck without allocated resources. This can be recognized by NullPointerException thrown by the capacity scheduler.
Workaround:Restart the ResourceManager and check if the resources are allocated to the applications that were stuck.
YARN cannot start if Kerberos principal name is changed
If the Kerberos principal name is changed in Cloudera Manager after launch, YARN will not be able to start. In such case the keytabs can be correctly generated but YARN cannot access ZooKeeper with the new Kerberos principal name and old ACLs.
There are two possible workarounds:
  • Delete the znode and restart the YARN service.
  • Use the reset ZK ACLs command. This also sets the znodes below /rmstore/ZKRMStateRoot to world:anyone:cdrwa which is less secure.
COMPX-8687: Missing access check for getAppAttemps
When the Job ACL feature is enabled using Cloudera Manager (YARN > Configuration > Enablg JOB ACLproperty), the mapreduce.cluster.acls.enabled property is not generated to all configuration files, including the yarn-site.xml configuration file. As a result the ResourceManager process will use the default value of this property. The default property of mapreduce.cluster.acls.enabled is false.
Workaround: Enable the Job ACL feature using an advanced configuration snippet:
  1. In Cloudera Manager select the YARN service.
  2. Click Configuration.
  3. Find the YARN Service MapReduce Advanced Configuration Snippet (Safety Valve) property.
  4. Click the plus icon and add the following:
    • Name: mapreduce.cluster.acls.enabled
    • Value: true
  5. Click Save Changes.

Unsupported Features

The following YARN features are currently not supported in Cloudera Data Platform:
  • GPU support for Docker
  • Hadoop Pipes
  • Fair Scheduler
  • Application Timeline Server (ATS 2 and ATS 1.5)
  • Container Resizing
  • Distributed or Centralized Allocation of Opportunistic Containers
  • Distributed Scheduling
  • Native Services
  • Pluggable Scheduler Configuration
  • Queue Priority Support
  • Reservation REST APIs
  • Resource Estimator Service
  • Resource Profiles
  • (non-Zookeeper) ResourceManager State Store
  • Shared Cache
  • YARN Federation
  • Rolling Log Aggregation
  • Docker on YARN (DockerContainerExecutor) on Data Hub clusters
  • Moving jobs between queues
  • Dynamic Resource Pools

Technical Service Bulletins

TSB 2021-539: Capacity Scheduler queue pending metrics can become negative in certain production workload scenarios causing blocked queues
The pending metrics of Capacity Scheduler queues can become negative in certain production workload scenarios.

Once this metric becomes negative, the scheduler is unable to schedule any further resource requests on the specific queue. As a result, new applications are stuck in the ACCEPTED state unless YARN ResourceManager is restarted or failed-over.

Knowledge article
For the latest update on this issue see the corresponding Knowledge article: TSB 2021-539: Capacity Scheduler queue pending metrics can become negative in certain production workload scenarios causing blocked queues