Sending Usage and Diagnostic Data to Cloudera

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

Cloudera Manager collects anonymous usage information and takes regularly-scheduled snapshots of the state of your cluster and automatically sends them anonymously to Cloudera. This helps Cloudera improve and optimize Cloudera Manager.

If you have a Cloudera Enterprise license, you can also trigger the collection of diagnostic data and send it to Cloudera Support to aid in resolving a problem you may be having.

Configuring a Proxy Server

To configure a proxy server through which usage and diagnostic data is uploaded, follow the instructions in Configuring Network Settings.

Managing Anonymous Usage Data Collection

Cloudera Manager sends anonymous usage information using Google Analytics to Cloudera. The information helps Cloudera improve Cloudera Manager. By default, anonymous usage data collection is enabled.

  1. Select Administration > Settings.
  2. Under the Other category, set the Allow Usage Data Collection property.
  3. Click Save Changes to commit the changes.

Managing Hue Analytics Data Collection

Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)

Hue tracks anonymized pages and application versions to collect information used to compare each application's usage levels. The data collected does not include hostnames or IDs; For example, the data has the format /2.3.0/pig, /2.5.0/beeswax/execute. You can restrict data collection as follows:
  1. Go to the Hue service.
  2. Click the Configuration tab.
  3. Select Scope > Hue.
  4. Locate the Enable Usage Data Collection property or search for it by typing its name in the Search box.
  5. Clear the Enable Usage Data Collection checkbox.

    To apply this configuration property to other role groups as needed, edit the value for the appropriate role group. See Modifying Configuration Properties Using Cloudera Manager.

  6. Click Save Changes to commit the changes.
  7. Restart the Hue service.

Diagnostic Data Collection

To help with solving problems when using Cloudera Manager on your cluster, Cloudera Manager collects diagnostic data on a regular schedule, and automatically sends it to Cloudera. By default Cloudera Manager is configured to collect this data weekly and to send it automatically. Cloudera analyzes this data and uses it to improve the software. If Cloudera discovers a serious issue, Cloudera searches this diagnostic data and notifies customers with Cloudera Enterprise licenses who might encounter problems due to the issue. You can schedule the frequency of data collection on a daily, weekly, or monthly schedule, or disable the scheduled collection of data entirely. You can also send a collected data set manually.

Automatically sending diagnostic data requires the Cloudera Manager Server host to have Internet access, and be configured for sending data automatically. If your Cloudera Manager Server does not have Internet access, and you have a Cloudera Enterprise license, you can manually send the diagnostic data as described in Manually Triggering Collection and Transfer of Diagnostic Data to Cloudera.

Automatically sending diagnostic data might fail sometimes and return an error message of "Could not send data to Cloudera." To work around this issue, you can manually send the data to Cloudera Support.

What Data Does Cloudera Manager Collect?

Cloudera Manager collects and returns a significant amount of information about the health and performance of the cluster. It includes:
  • Up to 1000 Cloudera Manager audit events: Configuration changes, add/remove of users, roles, services, and so on.
  • One day's worth of Cloudera Manager events: This includes critical errors Cloudera Manager watches for and more.
  • Data about the cluster structure which includes a list of all hosts, roles, and services along with the configurations that are set through Cloudera Manager. Where passwords are set in Cloudera Manager, the passwords are not returned.
  • Cloudera Manager license and version number.
  • Current health information for hosts, service, and roles. Includes results of health tests run by Cloudera Manager.
  • Heartbeat information from each host, service, and role. These include status and some information about memory, disk, and processor usage.
  • The results of running Host Inspector.
  • One day's worth of Cloudera Manager metrics. If you are using Cloudera Express, host metrics are not included.
  • A download of the debug pages for Cloudera Manager roles.
  • For each host in the cluster, the result of running a number of system-level commands on that host.
  • Logs from each role on the cluster, as well as the Cloudera Manager server and agent logs.
  • Which parcels are activated for which clusters.
  • Whether there's an active trial, and if so, metadata about the trial.
  • Metadata about the Cloudera Manager Server, such as its JMX metrics, stack traces, and the database or host it's running with.
  • HDFS or Hive replication schedules (including command history) for the deployment.
  • Impala query logs.
  • Cloudera Data Science Workbench collects aggregate usage data by sending limited tracking events to Google Analytics and Cloudera servers. No customer data or personal information is sent as part of these bundles.

Configuring the Frequency of Diagnostic Data Collection

By default, Cloudera Manager collects diagnostic data on a weekly basis. You can change the frequency to daily, weekly, monthly, or never. If you are a Cloudera Enterprise customer and you set the schedule to never, you can still collect and send data to Cloudera on demand. If you are a Cloudera Express customer and you set the schedule to never, data is not collected or sent to Cloudera.

  1. Select Administration > Settings.
  2. Under the Support category, click Scheduled Diagnostic Data Collection Frequency and select the frequency.
  3. To set the day and time of day that the collection will be performed, click Scheduled Diagnostic Data Collection Time and specify the date and time in the pop-up control.
  4. Click Save Changes to commit the changes.

You can see the current setting of the data collection frequency by viewing Support > Scheduled Diagnostics: in the main navigation bar.

Specifying the Diagnostic Data Directory

You can configure the directory where collected data is stored.

  1. Select Administration > Settings.
  2. Under the Support category, set the Diagnostic Data Bundle Directory to a directory on the host running Cloudera Manager Server. The directory must exist and be enabled for writing by the user cloudera-scm. If this field is left blank, the data is stored in /tmp.
  3. Click Save Changes to commit the changes.

Redaction of Sensitive Information from Diagnostic Bundles

By default, Cloudera Manager redacts known sensitive information from inclusion in diagnostic bundles. Cloudera Manager uses a set of standard rules to redact passwords and secrets. You can add additional redaction rules using regular expressions to specify data you want to be redacted from the bundles.

To specify redaction rules for diagnostic bundles:
  1. Go to Administration > Settings and search for the Redaction Parameters for Diagnostic Bundles parameter.

    The edit screen for the property displays.

  2. To add a new rule, click the icon. You can add one of the following:
    1. Credit Card numbers (with separator)
    2. Social Security Card numbers (with separator)
    3. Email addresses
    4. Custom rule (You must supply values for the Search and Replace fields, and optionally, the Trigger field.)
  3. To modify a new rule, click the icon.
  4. Edit the redaction rules as needed. Each rule has a description field where you can enter free text describing the rule and you can modify the following three fields:
    • Search - Regular expression to compare against the data. For example, the regular expression \d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4} searches for a credit card number pattern. Segments of data that match the regular expression are redacted using the Replace string.
    • Replace - String used to redact (obfuscate) data, such as a pattern of Xs to replace digits of a credit card number: XXXX-XXXX-XXXX-XXXX.
    • Trigger - Optional simple string to be searched before applying the regular expression. If the string is found, the redactor searches for matches using the Search regular expression. Using the Trigger field improves performance: simple string matching is faster than regular expression matching.
  5. To delete a redaction rule, click the icon.
  6. Click Save Changes.

Collecting and Sending Diagnostic Data to Cloudera

Disabling the Automatic Sending of Diagnostic Data from a Manually Triggered Collection

If you do not want data automatically sent to Cloudera after manually triggering data collection, you can disable this feature. The data you collect will be saved and can be downloaded for sending to Cloudera Support at a later time.

  1. Select Administration > Settings.
  2. Under the Support category, uncheck the box for Send Diagnostic Data to Cloudera Automatically.
  3. Click Save Changes to commit the changes.

Manually Triggering Collection and Transfer of Diagnostic Data to Cloudera

To troubleshoot specific problems, or to re-send an automatic bundle that failed to send, you can manually send diagnostic data to Cloudera:

  1. Optionally, change the System Identifier property:
    1. Select Administration > Settings.
    2. Under the Other category, set the System Identifier property and click Save Changes.
  2. Under the Support menu at the top right of the navigation bar, choose Send Diagnostic Data. The Send Diagnostic Data form displays.
  3. Fill in or change the information here as appropriate:
    • Optionally, you can improve performance by reducing the size of the data bundle that is sent. Click Restrict log and metrics collection to expand this section of the form. The three filters, Host, Service, and Role Type, allow you to restrict the data that will be sent. Cloudera Manager will only collect logs and metrics for roles that match all three filters.
    • Select one of the following under Data Selection:
      • Select By Target Size to manually set the maximum size of the bundle. Cloudera Manager populates the End Time based on the setting of the Time Range selector. You should change this to be a few minutes after you observed the problem or condition that you are trying to capture. The time range is based on the timezone of the host where Cloudera Manager Server is running.
      • Select By Date Range to manually set the Start Time and End Time to collect the diagnostic data. Click the Estimate button to calculate the size of the bundle based on the start and end times. If the bundle is too large, narrow the selection using the start and end times or by selecting additional filters.
    • If you have a support ticket open with Cloudera Support, include the support ticket number in the field provided.
  4. Depending on whether you have disabled automatic sending of data, do one of the following:
    • Click Collect and Upload Diagnostic Data to Cloudera Support. A Running Commands window shows you the progress of the data collection steps. When these steps are complete, the collected data is sent to Cloudera.
    • Click Collect Diagnostic Data only. A Command Details window shows you the progress of the data collection steps.
      1. In the Command Details window, click Download Result Data to download and save a zip file of the information.
      2. Send the data to Cloudera Support by doing one of the following:
        • Send the bundle using a Python script:
          1. Download the phone_home script.
          2. Copy the script and the downloaded data file to a host that has Internet access.
          3. Run the following command on that host:
            python phone_home.py --file downloaded data file
        • Attach the bundle to the SFDC case. Do not rename the bundle as this can cause a delay in processing the bundle.
        • Contact Cloudera Support and arrange to send the data file.