Workload XM Diagnostic Data Collection

When you enable Workload XM, the Cloudera Management Service starts the Telemetry Publisher role. Telemetry Publisher collects and transmits metrics as well as configuration and log files from Impala, Oozie, Hive, YARN, and Spark services for jobs running on CDH clusters to Workload XM. Telemetry Publisher collects metrics for all clusters that use the environments where Workload XM is enabled. This topic describes the sources of information sent to Workload XM and how that data is redacted:

Sources of Data Sent to Workload XM

The above diagram shows the sources from which you can configure Telemetry Publisher to collect diagnostic data. This data is collected in the following ways:

  • Pull — Telemetry Publisher pulls diagnostic data from these services periodically (once per minute, by default). These sources are indicated with the outbound arrows leading from Telemetry Publisher in the above diagram. They are Oozie, YARN, and Spark.
  • Push — A Cloudera Manager Agent pushes diagnostic data from these services to Telemetry Publisher within 5 seconds after a job finishes. These sources are indicated with the inbound arrows to Telemetry Publisher. They are Hive and Impala.

After the diagnostic data reaches Telemetry Publisher, it is stored temporarily in its data directory and periodically (once per minute) exported to Workload XM.

Diagnostic Data Collection Details

The diagnostic data collected by Telemetry Publisher and sent to Workload XM includes the following:

  • MapReduce Jobs — Telemetry Publisher polls the YARN Job History Server for recently completed MapReduce jobs. For each of these jobs, Telemetry Publisher collects the configuration and jhist file, which is the job history file that contains job and task counters, from HDFS. Telemetry Publisher can be configured to collect MapReduce task logs from HDFS and send them to Workload XM. By default, this log collection is turned off.
  • Spark Applications — Telemetry Publisher polls the Spark History Server for recently completed Spark applications. For each of these applications, Telemetry Publisher collects their event log from HDFS. Telemetry Publisher only collects Spark application data from Spark version 2.2 and later. Telemetry Publisher can be configured to collect the executor logs of Spark applications from HDFS and send them to Workload XM, but this data collection is turned off by default.

  • Oozie Workflows — Telemetry Publisher polls Oozie servers for recently completed Oozie workflows and sends their details to Workload XM.
  • Hive Queries — Telemetry Publisher uses the same mechanism used by Cloudera Navigator (a Cloudera Manager agent) to collect Hive query audits. The Cloudera Manager agent periodically searches for query detail files that are generated by HiveServer2 after a query completes and then sends the details from those files to Telemetry Publisher.
  • Impala Queries — A Cloudera Manager agent periodically looks for query profiles of recently completed queries and sends them to Telemetry Publisher.

Redaction Capabilities for Diagnostic Data

The diagnostic data collected by Telemetry Publisher might contain sensitive data in the job configurations or the logs. There are several ways you can redact sensitive data before it is sent to Telemetry Publisher. Cloudera recommends enabling the following redaction features even if you are not sending diagnostic data to Telemetry Publisher:

  • Log and query redaction — Refer to the Workload XM documentation and the Cloudera Manager documentation for information about log and query redaction. This redaction feature enables you to redact information in logs and queries collected by Telemetry Publisher based on filters created with regular expressions.
  • MapReduce job properties redaction — You can redact job configuration properties before they are stored in HDFS. Since Telemetry Publisher reads job configuration files from HDFS, it only fetches redacted configuration information. See Redacting MapReduce Job Properties for more information.
  • Spark event and executor log redaction — The Spark2 on YARN service has the spark.redaction.regex configuration property that can be used to redact sensitive data from event and executor logs. When this configuration property is enabled, Telemetry Publisher sends only redaction data to Workload XM. This configuration property is enabled by default, but can be overridden by using safety valves in Cloudera Manager or in the Spark application itself.

See Redact Data Before Sending to Workload XM for more information about data redaction in Workload XM.