Troubleshooting Cluster Configuration and Operation
This section contains solutions to some common problems that prevent you from using Cloudera Manager and describes how to use Cloudera Manager log and notification management tools to diagnose problems.
Solutions to Common Problems
Symptom | Reason | Solution |
---|---|---|
Cloudera Manager | ||
The Cloudera Manager service will not be running as it exited abnormally.
Running service cloudera-scm-server status will print following message "cloudera-scm-server dead but pid file exists". The Cloudera Manager Server log file /var/log/cloudera-scm-server/cloudera-scm-server.log will have a stacktrace with "java.lang.OutOfMemoryError" logged. |
Out of memory. | Examine the heap dump that the Cloudera Manager Server creates when it runs out of memory. The heap dump file is created in the /tmp directory, has file extension .hprof and file permission of 600. Its owner and group will be the owner and group of the Cloudera Manager server process, normally cloudera-scm:cloudera-scm. |
You are unable to start service on the Cloudera Manager server, that is, service cloudera-scm-server start does not work and there are errors in the log file located at /var/log/cloudera-scm-server/cloudera-scm-server.log | The server has been disconnected from the database or the database has stopped responding or has shut down. | Go to /etc/cloudera-scm-server/db.properties and make sure the database you are trying to connect to is listed there and has been started. |
Logs include APPARENT DEADLOCK entries for c3p0. | These deadlock messages are cause by the c3p0 process not making progress at the expected rate. This can indicate either that c3p0 is deadlocked or that its progress is slow enough to trigger these messages. In many cases, progress is occurring and these messages should not be seen as catastrophic. | There are a variety of ways to react to these log entries.
|
Starting Services | ||
After you click the Start button to start a service, the Finished status doesn't display.
This may not be merely a case of the status not getting displayed. It could be for a number of reasons such as network connectivity issues or subcommand failures. |
The host is disconnected from the Server, as will be indicated by missing heartbeats on the Hosts tab. |
|
Subcommands failed resulting in errors in the log file indicating that either the command timed out or the target port was already occupied |
|
|
After you click Start to start a service, the Finished status displays but there are error messages. The subcommands to start service components (such as JobTracker and one or more TaskTrackers) do not start. | A port specified in the Configuration tab of the service is already being used in your cluster. For example, the JobTracker port is in use by another process. | Enter an available port number in the port property (such as JobTracker port) in the Configuration tab of the service. |
There are incorrect directories specified in the Configuration tab of the service (such as the log directory). | Enter correct directories in the Configuration tab of the service. | |
Job is Failing | No space left on device. | One approach is to use a system monitoring tool such as Nagios to alert on the disk space or quickly check disk space across all
systems. If you do not have Nagios or equivalent you can do the following to determine the source of the space issue:
In the JobTracker Web UI, drill down from the job, to the map or reduce, to the task attempt details to see which TaskTracker the task executed and failed on due to disk space. For example: http://JTHost:50030/taskdetails.jsp?tipid=TaskID. You can see on which host the task is failing in the Machine column. In the NameNode Web UI, inspect the % used column on the NameNode Live Nodes page: http://namenode:50070/dfsnodelist.jsp?whatNodes=LIVE. |
Send Test Alert and Diagnose SMTP Errors | ||
You have enabled sending alerts from the Cloudera Manager Admin Console, however, Cloudera Manager does not seem to be sending any
alerts.
Using the Send Test Alert link under shows success even though you do not receive an alert email. |
There is possibly a mismatch of protocol or port numbers between your mail server and the Alert Publisher. For example, if the Alert Publisher is sending alerts to SMTPS on port 465 and your mail servers are not configured for SMTPS, you wouldn't receive any alerts. | Use the following steps to make changes to the Alert Publisher configuration:
|
Logs and Events
For information about problems, check the logs and events:
- Logs present log information for services, filtered by role, host, or keywords as well log level (severity).
- Viewing the Cloudera Manager Server Log contains information on the server and host agents.
- The Events tab lets you search for and display events and alerts that have occurred within a selected time range filtered by service, hosts, or keywords.