Known Issues in MapReduce and YARN
Learn about the known issues in MapReduce and YARN, the impact or changes to the functionality, and the workaround.
Known Issues
- COMPX-10909: Investigate if placement rules are working fine if username contains dot, and default queue is set to that queue
- Usernames with a dot do not work well with CS placement rules
- COMPX-12559 Improve Queue ACL User Interface to provide clarity for users
- When YARN ACL is enabled, by default, all users are admins in the root queue level. However, the YARN Queue Manager's user interface queue properties page is blank and does not show that users are admin.
- COMPX-12021 Queue Manager configurations on Scheduler Configuration page are not working
- When setting the following properties on the YARN Queue Manager
UI, the properties are set in capacity-scheduler.xml which does not have any effect on
YARN. The properties need to be set in yarn-site.xml, which does not happen when you set
them through YARN Queue Manager.
- "Maximum Application Priority" – "yarn.cluster.max-application-priority"
- "Enable Monitoring Policies" – "yarn.resourcemanager.scheduler.monitor.enable"
- "Monitoring Policies" – "yarn.resourcemanager.scheduler.monitor.policies"
- "Preemption: Observe Only" – "yarn.resourcemanager.monitor.capacity.preemption.observe_only"
- "Preemption: Monitoring Interval (ms)" – "yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval"
- "Preemption: Maximum Wait Before Kill (ms)" – "yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill"
- "Preemption: Total Resources Per Round" – "yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round"
- "Preemption: Over Capacity Tolerance" – "yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity"
- "Preemption: Maximum Termination Factor" – "yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor"
- "Enable Intra Queue Preemption" – "yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.enabled"
- COMPX-6214: When there are more than 600 queues in a cluster, potential timeouts occur due to performance reasons that are visible in the Configuration Service.
- The Cloudera Manager proxy timeout configuration field is added now. This issue is tracked in OPSAPS-60554. For this release, the timeout is increased from 20 seconds to 5 minutes. However, if this problem occurs, Cloudera recommends you to increase the proxy timeout value.
- OPSAPS-61245: The YARN NodeManager container executor's banned.users list is a static list that contains the default superusers to ensure no container is launched by a user with elevated privileges. If the process user is changed to a custom value it will not be included in the list automatically.
- To ensure no container is launched by the new process user the users should be added to the banned.users list manually.
- COMPX-6054: PlacementPolicy Rules(default rule) is not honoured in case limit 2 is breached for AQC
- If there is a dynamic parent queue, the maximum of two rules applies. So in case two levels of queues is to be created under this dynamic parent queue, it should be detected as an invalid path, and fall through to the next rule. However, it does not happen, but the queue creation fails and subsequently the application submission fails too. This is essentially a discrepancy between AQC validation and MappingRule validation.
- COMPX-5817: Queue Manager UI will not be able to present a view of pre-upgrade queue structure. CM Store is not supported and therefore Yarn will not have any of the pre-upgrade queue structure preserved.
- When a Data Hub cluster is deleted, all saved configurations are also deleted. All YARN configurations are saved in CM Store and this is yet to be supported in Data Hub and Cloudera Manager. Hence, the YARN queue structure also will be lost when a Data Hub cluster is deleted or upgraded or restored.
- COMPX-6628: Unable to delete single leaf queue assigned to a partition.
-
In the current implementation, you cannot delete a single leaf queue assigned to a partition.
- COMPX-5240: Restarting parent queue does not restart child queues in weight mode
- When a dynamic auto child creation enabled parent queue is stopped in weight mode, its static and dynamically created child queues are also stopped. However, when the dynamic auto child creation enabled parent queue is restarted, its child queues remain stopped. In addition, the dynamically created child queues cannot be restarted manually through the YARN Queue Manager UI either.
- COMPX-5244: Root queue should not be enabled for auto-queue creation
- After dynamic auto child creation is enabled for a queue using the YARN Queue Manager UI, you cannot disable it using the YARN Queue Manager UI. That can cause problem when you want to switch between resource allocation modes, for example from weight mode to relative mode. The YARN Queue Manager UI does not let you to switch resource allocation mode if there is at least one dynamic auto child creation enabled parent queue in your queue hierarchy.
- COMPX-5589: Unable to add new queue to leaf queue with partition capacity in Weight/Absolute mode
- Scenario
- User creates one or more partitions.
- Assigns a partition to a parent with children
- Switches to the partition to distribute the capacities
- Creates a new child queue under one of the leaf queues but the following error is displayed:
Error : 2021-03-05 17:21:26,734 ERROR com.cloudera.cpx.server.api.repositories.SchedulerRepository: Validation failed for Add queue operation. Error message: CapacityScheduler configuration validation failed:java.io.IOException: Failed to re-init queues : Parent queue 'root.test2' have children queue used mixed of weight mode, percentage and absolute mode, it is not allowed, please double check, details: {Queue=root.test2.test2childNew, label= uses weight mode}. {Queue=root.test2.test2childNew, label=partition uses percentage mode}
- COMPX-5264: Unable to switch to Weight mode on creating a managed parent queue in Relative mode
- In the current implemention, if there is an existing managed queue in Relative mode, then conversion to Weight mode is not be allowed.
- COMPX-5549: Queue Manager UI sets maximum-capacity to null when you switch mode with multiple partitions
- If you associate a partition with one or more queues and then switch the allocation mode before assigning capacities to the queues, an Operation Failed error is displayed as the max-capacity is set to null.
- COMPX-4992: Unable to switch to absolute mode after deleting a partition using YARN Queue Manager
- If you delete a partition (node label) which has been associated with queues and those queues have capacities configured for that partition (node label), the CS.xml still contains the partition (node label) information. Hence, you cannot switch to absolute mode after deleting the partition (node label).
- COMPX-3181: Application logs does not work for AZURE and AWS cluster
- Yarn Application Log Aggregation will fail for any YARN job (MR, Tez, Spark, etc) which do not use cloud storage, or use a cloud storage location other than the one configured for YARN logs (yarn.nodemanager.remote-app-log-dir).
- COMPX-1445: Queue Manager operations are failing when Queue Manager is installed separately from YARN
- If Queue Manager is not selected during YARN installation, Queue Manager operation are failing. Queue Manager says 0 queues are configured and several failures are present. That is because ZooKeeper configuration store is not enabled.
- COMPX-1451: Queue Manager does not support multiple ResourceManagers
- When YARN High Availability is enabled there are multiple ResourceManagers. Queue Manager receives multiple ResourceManager URLs for a High Availability cluster. It picks the active ResourceManager URL only when Queue Manager page is loaded. Queue Manager cannot handle it gracefully when the currently active ResourceManager goes down while the user is still using the Queue Manager UI.
- COMPX-3329: Autorestart is not enabled for Queue Manager in Data Hub
- In a Data Hub cluster, Queue Manager is installed with autorestart disabled. Hence, if Queue Manager goes down, it will not restart automatically.
- Third party applications do not launch if MapReduce framework path is not included in the client configuration
- MapReduce application framework is loaded from HDFS instead of
being present on the NodeManagers. By default the
mapreduce.application.framework.path
property is set to the appropriate value, but third party applications with their own configurations will not launch. - OPSAPS-57067: Yarn Service in Cloudera Manager reports stale configuration yarn.cluster.scaling.recommendation.enable.
- This issue does not affect the functionality. Restarting Yarn service will fix this issue.
- JobHistory URL mismatch after server relocation
- After moving the JobHistory Server to a new host, the URLs listed for the JobHistory Server on the ResourceManager web UI still point to the old JobHistory Server. This affects existing jobs only. New jobs started after the move are not affected.
- CDH-49165: History link in ResourceManager web UI broken for killed Spark applications
- When a Spark application is killed, the history link in the ResourceManager web UI does not work.
- CDH-6808: Routable IP address required by ResourceManager
- ResourceManager requires routable
host:port
addresses foryarn.resourcemanager.scheduler.address
, and does not support using the wildcard 0.0.0.0 address. - OPSAPS-52066: Stacks under Logs Directory for Hadoop daemons are not accessible from Knox Gateway.
- Stacks under the Logs directory for Hadoop daemons, such as NameNode, DataNode, ResourceManager, NodeManager, and JobHistoryServer are not accessible from Knox Gateway.
- CDPD-2936: Application logs are not accessible in WebUI2 or Cloudera Manager
- Running Containers Logs from NodeManager local directory cannot be accessed either in Cloudera Manager or in WebUI2 due to log aggregation.
- YARN cannot start if Kerberos principal name is changed
- If the Kerberos principal name is changed in Cloudera Manager after launch, YARN will not be able to start. In such case the keytabs can be correctly generated but YARN cannot access ZooKeeper with the new Kerberos principal name and old ACLs.
- Queue Manager does not open on using a custom user with a default Kerberos principal
- If a custom user is used with the default Kerberos principal, the Queue Manager web UI displays an HTTP ERROR 400 error.
- COMPX-3303: Auto queue deletion is not supported in relative and absolute resource allocation mode
- The auto queue deletion feature enabled by default and as a result dynamically created child queues are automatically deleted 300 seconds after the last job finished on them. However, this feature is not supported in relative and absolute resource allocation mode.
- COMPX-8687: Missing access check for getAppAttemps
- When the Job ACL feature is enabled using Cloudera Manager (
mapreduce.cluster.acls.enabled
property is not generated to all configuration files, including theyarn-site.xml
configuration file. As a result the ResourceManager process will use the default value of this property. The default property ofmapreduce.cluster.acls.enabled
isfalse
.
property), the
- COMPX-7493: YARN Tracking URL that is shown in the command line does not work when knox is enabled
- When Knox is configured for YARN, the Tracking URL printed in the command line of an YARN application such as spark-submit shows the direct URL instead of the Knox Gateway URL.
Technical Service Bulletins
- TSB 2023-641: InvalidClassException while editing queue configurations in YARN Queue Manager UI
- Under situations described below, a user may encounter the following error message while editing queue configurations in the Apache Hadoop YARN (YARN) Queue Manager UI:
- Knowledge article
-
For the latest update on this issue see the corresponding Knowledge article: TSB 2022-641: InvalidClassException while editing queue configurations in YARN Queue Manager UI.
Unsupported Features
-
The following YARN features are currently not supported in Cloudera Data Platform:
- Application Timeline Server v2 (ATSv2)
- Auxiliary Services
- Container Resizing
- Distributed or Centralized Allocation of Opportunistic Containers
- Distributed Scheduling
- Docker on YARN (DockerContainerExecutor) on Data Hub clusters
- Fair Scheduler
- GPU support for Docker
- Hadoop Pipes
- Moving jobs between queues
- Native Services
- Pluggable Scheduler Configuration
- Queue Priority Support
- Reservation REST APIs
- Resource Estimator Service
- Resource Profiles
- (non-Zookeeper) ResourceManager State Store
- Rolling Log Aggregation
- Shared Cache
- YARN Federation