Cloudera Data Catalog Profilers

Profilers create metadata annotations that summarize the content and shape characteristics of the data assets (such as distribution of values in a box plot or histogram).

Cloudera Data Catalog supports different environments:

When using the VM-based environment, you can create a Cloudera Data Hub cluster for a profiler engine to run data profiling operations as a pipeline on data located in one of your data lakes. You can install the profiler agent in a data lake and set up a specific schedule to generate various types of data profiles.
When using the Compute Cluster enabled environment, after launching a profiler cluster, an internal service provisions new Kubernetes pods, scheduling and running profiler jobs on-demand.


Profiler Name	Description
Data Compliance Profiler	The profiler automatically classifies your data with preconfigured tags, such as, PII, PCI, HIPAA and others.
Activity Profiler Profiler	A Ranger audit log summarizer.
Statistics Collector Profiler	Provides summary statistics like Maximum, Minimum, Mean, Unique, and Null values at the Hive column level.

Limitations

In Compute Cluster enabled environments, profilers only support tables which are stored on AWS S3 storage.
Supported file formats:
- Compute Cluster enabled environments:
  - Statistics Collector profilers and Data Compliance profilers
    - CSV
    - Parquet
    - Iceberg tables
    - ORC
    - Avro
note
When trying to profile any unsupported assets, the status SKIPPED is shown in Profiler Details > Job History > Job Summary > Profiled Assets.
Compute Cluster based profilers might hang if the underlying AWS cloud provider environment cannot provide the necessary memory for the executor instances. In this case, reconfigure your executors with 4-5 GB memory in Profiler Details > Configuration.