Monitoring and DiagnosticsPDF version

Solutions to Common Problems

The table below describes solutions to common cluster configuration problems.

Symptom Reason Solution
Cloudera Manager
The Cloudera Manager service will not be running as it exited abnormally.

Running service cloudera-scm-server status will print following message "cloudera-scm-server dead but pid file exists".

The Cloudera Manager Server log file /var/log/cloudera-scm-server/cloudera-scm-server.log will have a stacktrace with "java.lang.OutOfMemoryError" logged.

Out of memory. Examine the heap dump that the Cloudera Manager Server creates when it runs out of memory. The heap dump file is created in the /tmp directory, has file extension .hprof and file permission of 600. Its owner and group will be the owner and group of the Cloudera Manager server process, normally cloudera-scm:cloudera-scm.
You are unable to start service on the Cloudera Manager server, that is, service cloudera-scm-server start does not work and there are errors in the log file located at /var/log/cloudera-scm-server/cloudera-scm-server.log The server has been disconnected from the database or the database has stopped responding or has shut down. Go to /etc/cloudera-scm-server/db.properties and make sure the database you are trying to connect to is listed there and has been started.
Logs include APPARENT DEADLOCK entries for c3p0. These deadlock messages are cause by the c3p0 process not making progress at the expected rate. This can indicate either that c3p0 is deadlocked or that its progress is slow enough to trigger these messages. In many cases, progress is occurring and these messages should not be seen as catastrophic. There are a variety of ways to react to these log entries.
  • You may ignore these messages if system performance is not otherwise affected. Because these entries often occur during slow progress, they may be ignored in some cases.
  • You may modify the timer triggers. If c3p0 is making slow progress, increasing the period of time during which progress is evaluated stop the log entries from occurring. The default time between Timer triggers is 10 seconds and is configurable indirectly by configuring maxAdministrativeTaskTime.
  • You may increase the number of threads in the c3p0 pool, thereby increasing the resources available to make progress on tasks.
Starting Services
After you click the Start button to start a service, the Finished status does not display.

This may not be merely a case of the status not getting displayed. It could be for a number of reasons such as network connectivity issues or subcommand failures.

The host is disconnected from the Server, as will be indicated by missing heartbeats on the Hosts tab.
  • Look at the logs for the service for causes of the problem.
  • Restart the Agents on the hosts where the heartbeats are missing.
Subcommands failed resulting in errors in the log file indicating that either the command timed out or the target port was already occupied
  • Look at the log file at /var/log/cloudera-scm-server/cloudera-scm-server.log for more details on the errors. For example, if the port is already occupied you should see an "Address in use" error.
  • Go to the Hosts > Status tab. Click the Name of the host you want to inspect. Now go to the Processes tab and check the Stdout/Stderr logs to diagnose the cause of the failure. For example, if any binaries are missing or if Java could not be found.
After you click Start to start a service, the Finished status displays but there are error messages. The subcommands to start service components (such as JobTracker and one or more TaskTrackers) do not start. A port specified in the Configuration tab of the service is already being used in your cluster. For example, the JobTracker port is in use by another process. Enter an available port number in the port property (such as JobTracker port) in the Configuration tab of the service.
There are incorrect directories specified in the Configuration tab of the service (such as the log directory). Enter correct directories in the Configuration tab of the service.
Job is Failing No space left on device. One approach is to use a system monitoring tool such as Nagios to alert on the disk space or quickly check disk space across all systems. If you do not have Nagios or equivalent you can do the following to determine the source of the space issue:

In the JobTracker Web UI, drill down from the job, to the map or reduce, to the task attempt details to see which TaskTracker the task executed and failed on due to disk space. For example: http://JTHost:50030/taskdetails.jsp?tipid=TaskID . You can see on which host the task is failing in the Machine column.

In the NameNode Web UI, inspect the % used column on the NameNode Live Nodes page: http://namenode:50070/dfsnodelist.jsp?whatNodes=LIVE.

Send Test Alert and Diagnose SMTP Errors
You have enabled sending alerts from the Cloudera Manager Admin Console, however, Cloudera Manager does not seem to be sending any alerts.

Using the Send Test Alert link under Administration > Alerts shows success even though you do not receive an alert email.

There is possibly a mismatch of protocol or port numbers between your mail server and the Alert Publisher. For example, if the Alert Publisher is sending alerts to SMTPS on port 465 and your mail servers are not configured for SMTPS, you wouldn't receive any alerts. Use the following steps to make changes to the Alert Publisher configuration:
  1. In the Cloudera Manager Admin Console, click the Cloudera Management Service.
  2. Click the Configuration tab.
  3. Select Scope > Alert Publisher.
  4. Click the Main category.
  5. Change Alerts: Mail Server Protocol to smtp (or smtps).
  6. Click the Ports and Addresses category and change Alerts: Mail Server TCP Port to 25 (or to 465 for SMTPS)
  7. Enter a Reason for change, and then click Save Changes to commit the changes.
  8. Restart the Alert Publisher.