Troubleshooting Cloudera Director

This topic contains information on problems that can occur when you set up, configure, or use Cloudera Director, their causes, and their solutions.

Viewing Cloudera Director Logs

To help you troubleshoot problems, you can view the Cloudera Director logs. Log files are in the following locations:
  • Cloudera Director client
    • One shared log file per user account:
      $HOME/.cloudera-director/logs/application.log
  • Cloudera Director server
    • One file for all clusters:
      /var/log/cloudera-director-server/application.log
Cloudera Director normally logs only error, warning, and informational messages. To configure it to log debug level messages, edit the file logback.xml, which can be found at the following locations:
  • Cloudera Director client: /etc/cloudera-director-client/logback.xml
  • Cloudera Director server: /etc/cloudera-director-server/logback.xml
The XML file configures the logback logging library. To turn on all debug logging for Cloudera Director and its libraries, change the "root" element as follows:
<root level="DEBUG">
Enabling debug logging significantly increases the size of the logs, and may include more information than needed for troubleshooting. Once you discover specific loggers that carry information you care most about, you can narrow the scope of debug logging to those only. For example, if after turning on all debug logging you find that the messages emitted from Cloudera Director itself are most important, then you can set the root level back to INFO and then add a new logger element like this example, along with the other similar elements.
<logger name="com.cloudera.launchpad" level="DEBUG"/>

The logback.xml file can be reconfigured in many other ways to adjust how logging is performed. See Logback configuration in the Logback project documentation to learn more. Note that major changes to log format and contents will hamper the effectiveness of Cloudera support, if you should need to forward logs to them as part of troubleshooting.

Backing Up the H2 Embedded Database

By default, Cloudera Director uses an H2 embedded database to store environment and cluster data. The H2 embedded database file is located at:
/var/lib/cloudera-director-server/state.h2.db

Back up the state.h2.db file to avoid losing environment and cluster data. To ensure that your backup copy can be restored, use the H2 backup tools instead of simply copying the file. For more information, see the H2 Tutorial.

Manual Modifications to the Cloudera Director Database

Manual modifications to the Cloudera Director database are not supported. Modifications made outside of Cloudera Director control can lead to permanent data loss and unrecoverable errors in Cloudera Director.

Slow or Failed OS Updates in Some AWS Regions

Symptom

In AWS, Cloudera Director triggers operating system updates and performs software downloads on instances it allocates in your chosen region. Depending on the local network configuration, these updates and downloads may go slowly or fail.

Solutions

Consider trying one or more of the following steps:
  • Disable instance normalization. This causes Cloudera Director to not perform usual automated, general work on new instances. You should replace that work with your own, either by building a custom AMI with the work already accomplished, or by using a bootstrap script. Normalization can be disabled using a configuration file; contact Cloudera support for guidance on what else you need to do.

  • Create a preloaded AMI. Cloudera Director can avoid downloading Cloudera Manager and CDH software if it is already present in expected locations on instances. This also speeds up deployment and cluster bootstrap processes, even when download speeds from Cloudera repositories are reasonable. See the documentation for more information.

  • Mirror Cloudera repositories. Instead of preloading an AMI with Cloudera software, you can host them at local mirrors, and point Cloudera Director to them as alternative download locations. As with preloaded AMIs, taking this step can speed up bootstrap processes, and make your architecture less vulnerable to network problems. See the documentation for more information.

Cloudera Director Bootstrap Fails with Error Message

Symptom

Cloudera Director fails to bootstrap with the error message, "DNS is not configured correctly on at least one instance."

Cloudera Director fails to bootstrap the cluster and the Cloudera Agent log (cloudera-scm-agent.log) will show an entry similar to the following:

Cause

Needs to be diagnosed.

Solution

Verify that DNS is configured properly. Check the server logs, which may contain additional warning messages and information about why DNS detection failed. For example, this error can appear when an invalid ssh user has been set.

Cloudera Manager API Call Fails

Symptom

A Cloudera Manager API call fails in Cloudera Director.

Cause

Needs to be diagnosed. (See Solution immediately below.)

Solution

Enable API debugging in Cloudera Manager by going to Settings on the Administration tab in Cloudera Manager and clicking the checkbox Enable Debugging of API. Then look at the Cloudera Manager server logs to get more information on why the API call failed.

Cloudera Director Cannot Manage a Cluster That Was Kerberized Through Cloudera Manager

Symptom

Cloudera Director cannot manage a cluster after Cloudera Manager is used to enable Kerberos on the cluster.

Cause

Once a cluster is deployed through Cloudera Director, some changes to the cluster that are made using Cloudera Manager cause Cloudera Director to be out of sync and unable to manage the cluster. See Cloudera Director and Cloudera Manager Usage.

Solution

Deploy a new kerberized cluster, use distcp to transfer data from the old cluster to the new one, and then destroy the old cluster.

RDS Name Conflicts

Symptom

RDS name conflicts occur when creating multiple clusters with the same configuration file.

Cause

Most often, deletion of an older RDS instance has not completed when you try to launch a new cluster using the same configuration file, and therefore the same RDS name.

Solution

Allow more time for an RDS instance to be completely removed before creating a new cluster with the same configuration file, or change the name of the RDS instance in the configuration files for new clusters.

New Cluster Fails to Start Because of Missing Roles

Symptom

A new cluster will not start because roles are missing.

Cause

Cloudera Director does not validate that all required roles are assigned when provisioning a cluster. This can lead to failures during the intial run of a new cluster. For example, if the gateway instance group was removed, but the Flume Agent and Kafka Broker were assigned to roles in that group, the cluster fails to start.

Solution

Ensure that all required role types for the CDH services included in the cluster are assigned to instances before starting the cluster.

Cloudera Director Server Will Not Start with Unsupported Java Version

Symptom

Cloudera Director server will not start, and /var/log/cloudera-director-server/cloudera-director-server.out has the following error:
Exception in thread "main" java.lang.UnsupportedClassVersionError: com/cloudera/launchpad/Server : Unsupported major.minor version 51.0

Cause

You are running Cloudera Director server against an older, unsupported version of the Oracle Java SE Development Kit (JDK).

Solution

Update to Oracle JDK version 7 or 8.

Error Occurs if Tags Contain Unquoted Special Characters

Symptom

When using the configuration file with the bootstrap command to start Cloudera Director client, or using the bootstrap-remote command to set up a cluster with Cloudera Director server, an error message is displayed. This applies to HOCON characters, and includes periods. If the added configuration is in the form x.y, for example, the following error message may be displayed: "com.typesafe.config.ConfigException$WrongType: ... <x> has type OBJECT rather than STRING". This means that x.y must be in quotes, as in "x.y".
com.typesafe.config.ConfigException$WrongType: ... <x> has type OBJECT rather than STRING

Cause

Cloudera Director validation checks to ensure that special characters in configurations are enclosed in double quotes.

Solution

Use double quotes for special characters in configurations. An example of a configuration that would require double quotes is "log.dirs" in Kafka.

DNS Issues

Symptom

Cloudera Director fails to bootstrap a cluster with a DNS error. The Cloudera Agent log (cloudera-scm-agent.log) will show an entry similar to the following:
[27/Mar/2017 20:26:16 +0000] 12596 Thread-13 https ERROR Failed to retrieve/store URL:
http://ip-10-202-202-109.ec2.internal:7180/cmf/parcel/download/CDH-5.10.0-1.cdh5.10.0.p0.41-el7.parcel.torrent -> 
/opt/cloudera/parcel-cache/CDH-5.10.0-1.cdh5.10.0.p0.41-el7.parcel.torrent 
<urlopen error [Errno -2] Name or service not known>

Cause

This can be caused by one the following:
  • DNS Hostnames is not set to Yes in the Edit DNS Hostnames VPC configuration setting.
  • The Amazon Virtual Private Cloud (VPC) is not set up for forward and reverse hostname resolution. Forward and reverse DNS resolution is a requirement for many components of the Cloudera EDH platform, including Cloudera Director.

Solutions

In the AWS Management Console, go to Services > Networking and click VPC. In the VPC Dashboard, select your VPC and click Action. In the shortcut menu, click Edit DNS Hostnames and click Yes. If this does not fix the issue, continue with the instructions that follow to configure forward and reverse hostname resolution.

Configure the VPC for forward and reverse hostname resolution. You can verify if DNS is working as expected on a host by issuing the following one-line Python command:
python -c "import socket; print socket.getfqdn(); print socket.gethostbyname(socket.getfqdn())"

For more information on DNS and Amazon VPCs, see DHCP Options Sets in the Amazon VPC documentation.

If you are using Amazon-provided DNS, perform these steps to configure DHCP options:
  1. Log in to the AWS Management Console.

  2. Select VPC from the Services navigation list box.

  3. In the left pane, click Your VPCs. A list of currently configured VPCs is displayed.

  4. Select the VPC you are using and note the DHCP options set ID.

  5. In the left pane, click DHCP Option Sets. A list of currently configured DHCP Option Sets is displayed.

  6. Select the option set used by the VPC.

  7. Check for an entry similar to the following and make sure the domain-name is specified. For example:

    domain-name = ec2.internal
    domain-name-servers = AmazonProvidedDNS 
  8. If it is not configured correctly, create a new DHCP option set for the specified region and assign it to the VPC. For information on how to specify the correct domain name, see the AWS Documentation.

Server Does Not Start

Symptom

The Cloudera Director server does not start or quickly exits with an Out of Memory exception.

Cause

The Cloudera Director server is running on a machine with insufficient memory.

Solution

Run Cloudera Director on an instance that has at least 1 GB of free memory. See Resource Requirements for more details on Cloudera Director hardware requirements.

Problem When Removing Hosts from a Cluster

Symptom

A Modify Cluster operation fails to complete.

Cause

You are trying to shrink the cluster below the HDFS replication factor. See Removing Instances from a Cluster (Note) for more information about replication factors.

Solution

Do not attempt to shrink a cluster below the HDFS replication factor. Doing so can result in a loss of data.

Problems Connecting to Cloudera Director Server

Symptom

You are unable to connect to the Cloudera Director server.

Cause

Configuration of security group and iptables settings. For more information about configuring security groups, see Setting up the AWS Environment. For commands to turn off iptables, see either Installing Cloudera Director Server and Client on the EC2 Instance or Installing Cloudera Director Server and Client on Google Compute Engine. Some operating systems have IP tables turned on by default, and they must be turned off.

Solution

Check security group and iptables settings and reconfigure if necessary.