Data Collection in Cloudera Data Science Workbench

Cloudera Data Science Workbench collects usage and diagnostic data in two ways, both of which can be controlled by system administrators.

Usage Tracking
Diagnostic Bundles

Usage Tracking

Cloudera Data Science Workbench collects aggregate usage data by sending limited tracking events to Google Analytics and Cloudera servers. No customer data or personal information is sent as part of these bundles.

Disable Usage Tracking

Log in to Cloudera Data Science Workbench as a site administrator.
Click Admin.
Click the Settings tab.
Uncheck Send usage data to Cloudera.

In addition to this, Cloudera Manager also collects and sends anonymous usage information using Google Analytics to Cloudera. If you are running a CSD-based deployment and want to disable data collection by Cloudera Manager, see Managing Anonymous Usage Data Collection.

Diagnostic Bundles

Using Cloudera Manager

If you are working on a CSD-based deployment, Cloudera Data Science Workbench logs and diagnostic data are available as part of the diagnostic bundles created by Cloudera Manager. By default, Cloudera Manager is configured to collect diagnostic data weekly and to send it to Cloudera automatically. You can schedule the frequency of data collection on a daily, weekly, or monthly schedule, or even disable the scheduled collection of data. To learn how to configure the frequency of data collection or disable it entirely, see Diagnostic Data Collection by Cloudera Manager.

You can also manually trigger a collection and transfer of diagnostic data to Cloudera at any time. For instructions, see Manually Collecting and Sending Diagnostic Data to Cloudera.

You can configure what directory Cloudera Manager uses as a staging directory for diagnostic bundles. Change the Log Staging Directory property for your Cloudera Data Science Workbench instance in Cloudera Manager to set a different directory.

Usage Metrics

If you want to disable collection of usage metrics, configure the Feature flag overrides property in Cloudera Manager as described here:

Log in to Cloudera Manager.
Go to the CDSW service.
Click Configuration.
Search for the Feature flag overrides property. Add the following JSON code to disable usage me.
```
{"usage-reporting": false}
```
Click Save Changes.
Restart the CDSW service.

Using the Command Line

Diagnostic bundles can be created by system administrators using the cdsw logs command. By default, sensitive information will be redacted from the log files in the bundle. This is the bundle that you should attach to any case opened with Cloudera Support. The filename of this generated bundle will be of the form, cdsw-logs-$hostname-$date-$time.redacted.tar.gz.

If you want to turn off redaction of log files, you can use the -x|--skip-redaction option as demonstrated below.

cdsw logs --skip-redaction

The diagnostic bundle is only meant for internal use. It should be retained at least for the duration of the support case, in case any critical information was redacted. However, it can be shared with Cloudera at your discretion. The filename of this bundle will be of the form, cdsw-logs-$hostname-$date-$time.tar.gz.

The contents of both archives are stored in text and can easily be inspected by system administrators. Both forms are designed to be easily diff-able.

Usage Metrics

Starting with version 1.7, CDSW also gathers highly redacted information on which feature is being used. When you create a diagnostic bundle, this information is packed alongside the diagnostic information. If you want to turn off collection of information on feature usage, use the -u|--skip-usage-events flag when you generate the diagnostic bundle. For example:

cdsw logs --skip-usage-events

Information Collected in Diagnostic Bundles

Cloudera Data Science Workbench diagnostic bundles collect the following information:

System information such as hostnames, operating system, kernel modules and settings, available hardware, and system logs.
Cloudera Data Science Workbench version, status information, configuration, and the results of install-time validation checks.
Details about file systems, devices, and mounts in use.
CDH cluster configuration, including information about Java, Kerberos, installed parcels, and CDH services such as Spark 2.
Network configuration and status, including interfaces, routing configuration, and reachability.
Status information for system services such as Docker, Kubernetes, NFS, and NTP.
Listings for processes, open files, and network sockets.
Reachability, configuration, and logs for Cloudera Data Science Workbench application components.
Hashed Cloudera Data Science Workbench user names.
Information about Cloudera Data Science Workbench workloads (sessions, jobs, experiments, models, applications), including editor, kernel type, engine, ownership, termination status, and performance data.

CDSW in Cloudera Manager

Email with SMTP