Error Messages and Various Failures

These error messages may indicate an issue with Kerberos authentication or other issues.

Cluster cannot run jobs after Kerberos enabled

Symptom: Cluster previously configured without Kerberos authentication may fail to run jobs for certain users on certain TaskTrackers (MRv1) or NodeManagers (YARN) after enabling Kerberos for the cluster. Errors may display in the TaskTracker or NodeManager logs. The following example errors are from TaskTracker on MRv1:

10/11/03 01:29:55 INFO mapred.JobClient: Task Id : attempt_201011021321_0004_m_000011_0, Status : FAILED
Error initializing attempt_201011021321_0004_m_000011_0: 
java.io.IOException: org.apache.hadoop.util.Shell$ExitCodeException: 
at org.apache.hadoop.mapred.LinuxTaskController.runCommand(LinuxTaskController.java:212) 
at org.apache.hadoop.mapred.LinuxTaskController.initializeUser(LinuxTaskController.java:442) 
at org.apache.hadoop.mapreduce.server.tasktracker.Localizer.initializeUserDirs(Localizer.java:272) 
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:963) 
at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2209) 
at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2174) 
Caused by: org.apache.hadoop.util.Shell$ExitCodeException: 
at org.apache.hadoop.util.Shell.runCommand(Shell.java:250) 
at org.apache.hadoop.util.Shell.run(Shell.java:177) 
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:370) 
at org.apache.hadoop.mapred.LinuxTaskController.runCommand(LinuxTaskController.java:203) 
... 5 more
Possible cause: The issue is caused by legacy content in directories for TaskTracker and NodeManager that may exist after configuring Kerberos authentication for a cluster that previously did not use Kerberos. The sequence of events leading to this issue is as follows:
  1. Cluster that had not been configured for Kerberos authentication was used to run jobs, which created local user directory (or directories) on each TaskTracker or NodeManager host.
  2. Cluster was then configured to use Kerberos authentication.
  3. Users try running jobs on the newly secured cluster but local user directories on TaskTrackers or NodeManagers are owned by the wrong user or have overly-permissive permissions.
These directories should have been cleaned up at the time Kerberos authentication was enabled for the cluster.

Steps to resolve: Delete mapred.local.dir or yarn.nodemanager.local-dirs directories across the cluster for affected users.

NameNode fails to start

Symptom: Login failure occurs when attempting to start the NameNode. With debugging enabled on the cluster, the following KrbException messages may display:
Caused by: KrbException: Integrity check on decrypted field failed (31) - PREAUTH_FAILED}}
Caused by: KrbException: Identifier does not match expected value (906)

Possible cause: This issue may be due to incorrect configuration for AES-256. By default, certain operating systems—CentOS/Red Hat Enterprise Linux 5.6 (and higher), Ubuntu—use AES-256 encryption which requires installing the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy File on all hosts (see JCE Policy File for AES-256 Encryption for details), or disabling AES-256 support in the kdc.conf or krb5.com (see disable AES-256 encryption from the Kerberos instance for details).

Steps to resolve: KrbException 31 and KrbException 906 can be caused by various issues, but the most likely cause is incorrectly configured AES-256 encryption. Resolving the issue should start by determining the type of encryption configured for the cluster.

To verify the type of encryption configured for the cluster:

  1. On the local KDC host, type this command to create a test principal:
    $ kadmin -q "addprinc test" 
  2. On a cluster host, type this command to start a Kerberos session as test:
    $ kinit test 
  3. On a cluster host, type this command to view the encryption type in use:
    $ klist -e 

    If AES-256 is being used, output such as the following displays:

    Ticket cache: FILE:/tmp/krb5cc_0
    Default principal: test@SCM
    Valid starting     Expires            Service principal
    05/19/11 13:25:04  05/20/11 13:25:04  krbtgt/SCM@SCM
        Etype (skey, tkt): AES-256 CTS mode with 96-bit SHA-1 HMAC, AES-256 CTS mode with 96-bit SHA-1 HMAC 

To remove AES-256 encryption from the Kerberos configuration files:

  • Remove aes256-cts:normal from the supported_enctypes field of the kdc.conf or krb5.conf file.
  • After changing the configuration, restart the KDC and the kadmin servers.
  • Change TGT principal (krbtgt/REALM@REALM) and other principal passwords as needed.
  • In the [realms] section of the kdc.conf file, for the realm associated with HADOOP.LOCALDOMAIN, add (or replace if it exists already) the following variable:
    supported_enctypes = des3-hmac-sha1:normal arcfour-hmac:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal des-cbc-crc:v4 des-cbc-crc:afs3
  • Recreate the hdfs keytab file and mapred keytab file using the instructions in Viewing or Regenerating Kerberos Credentials Using Cloudera Manager.

Clients cannot connect to NameNode

Symptom: The NameNode keytab file does not have an AES-256 entry but the client tickets do. The NameNode starts but clients cannot connect to it. The error message does not specify "AES256" but rather contains enctype code "18."

Possible cause: Issue related to AES-256 encryption and the JCE library.

Steps to resolve: Verify that Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy File is installed or remove aes256-cts:normal from the supported_enctypes field of the kdc.conf or krb5.conf file, as detailed above.

Hadoop commands run in local realm but not in remote realm

Symptom: After enabling cross-realm trust, authenticating as a principal in the local realm lets you successfully run Hadoop commands, but authenticating as a principal in the remote realm does not.

Possible cause: This issue is often due to principals in the two realms having different encryption types or different passwords for the cross-realm principal in each realm. Because the local and remote realm each issue their own TGTs, the local commands run but the service ticket needed for the local and remote realms to communicate cannot be granted.

Steps to resolve: Add the cross-realm krbtgt principal and its encryption types to the MIT KDC server using kadmin.local or kadmin as appropriate (local access vs. remote):
kadmin: addprinc -e "enc_type_list" krbtgt/LOCAL-REALM.EXAMPLE.COM@MAIN-REALM.COMPANY.COM
Specify the types of encryption supported by the cross-realm principal (krbtgt), for example, AES, DES, or RC4. Multiple encryption types can be specified in the enc_type_list as long as one of the encryption types matches that of the tickets granted by the KDC in the remote realm. For example:
kadmin:  addprinc -e "aes256-cts:normal rc4-hmac:normal des3-hmac-sha1:normal" krbtgt/LOCAL-REALM.EXAMPLE.COM@MAIN-REALM.COMPANY.COM

Users cannot obtain credentials when running Hadoop jobs or commands

Symptom: Users attempt to authenticate but message such as the following displays:
13/01/15 17:44:48 DEBUG ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Fail to create credential. 
(63) - No service creds)]

Possible cause: Ticket message may be too large for the UDP protocol (which is used by SASL by default).

Steps to resolve: Force Kerberos to use TCP instead of UDP by adding the following parameter to libdefaults in the krb5.conf file on the clients where the problem is occurring:
[libdefaults]
udp_preference_limit = 1

Configure krb5.conf through Cloudera Manager, this will automatically get added to krb5.conf.

Bogus replay exceptions in service logs

Symptom: Multiple valid requests to Kerberos protected services are identified as replay attempts when they are not. The following exception shows up in the logs for one or more of the Hadoop daemons:

2013-02-28 22:49:03,152 INFO  ipc.Server (Server.java:doRead(571)) - IPC Server listener on 8020: readAndProcess threw exception javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: Failure unspecified at GSS-API level (Mechanism l
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34))]
        at com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:159)
        at org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1040)
        at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1213)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:566)
        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:363)
Caused by: GSSException: Failure unspecified at GSS-API level (Mechanism level: Request is a replay (34))
        at sun.security.jgss.krb5.Krb5Context.acceptSecContext(Krb5Context.java:741)
        at sun.security.jgss.GSSContextImpl.acceptSecContext(GSSContextImpl.java:323)
        at sun.security.jgss.GSSContextImpl.acceptSecContext(GSSContextImpl.java:267)
        at com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:137)
        ... 4 more
Caused by: KrbException: Request is a replay (34)
        at sun.security.krb5.KrbApReq.authenticate(KrbApReq.java:300)
        at sun.security.krb5.KrbApReq.<init>(KrbApReq.java:134)
        at sun.security.jgss.krb5.InitSecContextToken.<init>(InitSecContextToken.java:79)
        at sun.security.jgss.krb5.Krb5Context.acceptSecContext(Krb5Context.java:724)
        ... 7 more

This issue can also manifest as poor performance for clients of the cluster, including dropped connections, timeouts attempting to make RPC calls, and so on.

Possible cause: Kerberos uses a second-resolution timestamp to protect against replay attacks (where an attacker can record network traffic, and play back recorded requests later to gain elevated privileges). That is, incoming requests are cached by Kerberos for a little while, and if there are similar requests within a few seconds, Kerberos will be able to detect them as replay attack attempts (see MIT Kerberos replay cache for more information).

However, if there are multiple valid Kerberos requests coming in at the same time, these may also be misjudged as attacks for the following reasons:

  • Multiple services in the cluster are using the same Kerberos principal. All secure clients that run on multiple machines should use unique Kerberos principals for each machine. For example, rather than connecting as a service principal myservice@EXAMPLE.COM, services should have per-host principals such as myservice/host123.example.com@EXAMPLE.COM.
  • Clocks not synchronized: All hosts should run NTP so that clocks are kept in sync between clients and servers.

Steps to resolve:

While having different principals for each service, and clocks in sync helps mitigate the issue, there are, however, cases where even if all of the above are implemented, the problem still persists. In such a case, disabling the cache (and the replay protection as a consequence), will allow parallel requests to succeed. This compromise between usability and security can be applied by setting the KRB5RCACHETYPE environment variable to none.

Note that the KRB5RCACHETYPE is not automatically detected by Java applications. For Java-based components:

  • Ensure that the cluster runs on JDK 8.
  • To disable the replay cache, add -Dsun.security.krb5.rcache=none to the Java Opts/Arguments of the targeted JVM. For example, HiveServer2 or the Sentry service.

Cloudera Manager cluster services fail to start

Symptom: One or more of the cluster services fails to start. For example, the DataNode fails to start. Error messages in the log files may display the following error messages:
Exception in secureMain 
java.lang.ExceptionInInitializerError 
at javax.crypto.KeyGenerator.nextSpi(KeyGenerator.java:324) 
at javax.crypto.KeyGenerator.<init>(KeyGenerator.java:157) 
...
Caused by: java.lang.SecurityException: The jurisdiction policy files are not signed by a trusted signer! 
at javax.crypto.JarVerifier.verifyPolicySigned(JarVerifier.java:289) 
at javax.crypto.JceSecurity.loadPolicies(JceSecurity.java:316) 
at javax.crypto.JceSecurity.setupJurisdictionPolicies(JceSecurity.java:261) 
...

Possible cause: This is another example of a mismatch for AES-256 encryption. Services cannot start when the version of the JCE policy file does not match the version of Java installed on a node because the cryptographic signatures for the JCE policy files cannot be verified, resulting in the message shown above.

Solution: Download the correct JCE policy files for the version of Java you are running:

Download and unpack the zip file. Copy the two JAR files to the $JAVA_HOME/jre/lib/security directory on each node within the cluster.

Error Messages

Incorrect permission Java exception (java.io.IOException)

Symptom: An incorrect permission error displays when trying to run a job:
java.io.IOException: Incorrect permission for
/var/folders/B3/B3d2vCm4F+mmWzVPB89W6E+++TI/-Tmp-/tmpYTil84/dfs/data/data1,
expected: rwxr-xr-x, while actual: rwxrwxr-x
       at org.apache.hadoop.util.DiskChecker.checkPermission(DiskChecker.java:107)
       at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:144)
       at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:160)
       at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1484)
       at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1432)
       at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1408)
       at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:418)
      ...

Possible cause: The daemon has umask 0002 rather than 0022.

Steps to resolve: Make sure that the umask for hdfs and mapred is 0022.

MapReduce (MRv1) Errors

Jobs won't run and cannot access files in mapred.local.dir

Symptom: The TaskTracker log contains the following error message:
WARN org.apache.hadoop.mapred.TaskTracker: Exception while localization java.io.IOException: Job initialization failed (1) 

Possible cause:

Steps to resolve:
  1. Add the mapred user to the mapred and hadoop groups on all hosts.
  2. Restart all TaskTrackers.

Jobs cannot run and TaskTracker cannot create local mapred directory

Symptom: The TaskTracker log contains the following error message:
11/08/17 14:44:06 INFO mapred.TaskController: main : user is atm
11/08/17 14:44:06 INFO mapred.TaskController: Failed to create directory /var/log/hadoop/cache/mapred/mapred/local1/taskTracker/atm - No such file or directory
11/08/17 14:44:06 WARN mapred.TaskTracker: Exception while localization java.io.IOException: Job initialization failed (20)
        at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:191)
        at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1199)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
        at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1174)
        at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1089)
        at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2257)
        at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2221)
Caused by: org.apache.hadoop.util.Shell$ExitCodeException:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
        at org.apache.hadoop.util.Shell.run(Shell.java:182)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
        at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:184)
        ... 8 more

Possible cause: Mismatch of mapred.local.dir values specified in mapred-site.xml and taskcontroller.cfg. These values should be the same.

Steps to resolve: Verify that the setting for mapred.local.dir is the same in both mapred-site.xml and taskcontroller.cfg, and reconfigure if necessary.

Jobs cannot run and TaskTracker cannot create Hadoop logs directory

Symptom: The TaskTracker log contains an error message similar to the following:
11/08/17 14:48:23 INFO mapred.TaskController: Failed to create directory /home/atm/src/cloudera/hadoop/build/hadoop-0.23.2-cdh3u1-SNAPSHOT/logs1/userlogs/job_201108171441_0004 - No such file or directory
11/08/17 14:48:23 WARN mapred.TaskTracker: Exception while localization java.io.IOException: Job initialization failed (255)
        at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:191)
        at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1199)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
        at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1174)
        at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1089)
        at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2257)
        at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2221)
Caused by: org.apache.hadoop.util.Shell$ExitCodeException:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
        at org.apache.hadoop.util.Shell.run(Shell.java:182)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
        at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:184)
        ... 8 more

Possible cause: Misconfiguration issue.

Steps to resolve: In MRv1, the default value specified for hadoop.log.dir in mapred-site.xml is /var/log/hadoop-0.20-mapreduce. The path must be owned and be writable by the mapred user. If you change the default value specified for hadoop.log.dir, make sure the value is identical in mapred-site.xml and taskcontroller.cfg. If the values are different, the error message above is returned.