Cloudera Data Catalog Profilers
Profilers create metadata annotations that summarize the content and shape characteristics of the data assets (such as distribution of values in a box plot or histogram).
Cloudera Data Catalog supports different environments:
- When using the VM-based environment, you can create a Cloudera Data Hub cluster for a profiler engine to run data profiling operations as a pipeline on data located in one of your data lakes. You can install the profiler agent in a data lake and set up a specific schedule to generate various types of data profiles.
- When using the Compute Cluster enabled environment, after launching a profiler cluster, an internal service provisions new Kubernetes pods, scheduling and running profiler jobs on-demand.
Profiler Name | Description |
---|---|
Data Compliance Profiler | The profiler automatically classifies your data with preconfigured tags, such as, PII, PCI, HIPAA and others. |
Activity Profiler Profiler | A Ranger audit log summarizer. |
Statistics Collector Profiler | Provides summary statistics like Maximum, Minimum, Mean, Unique, and Null values at the Hive column level. |
Limitations
- In Compute Cluster enabled environments, profilers only support tables which are stored on AWS S3 storage.
- Supported file formats:
- VM-based environments:
- CSV
- Avro
- Compute Cluster enabled environments:
- Statistics Collector profilers and
Data Compliance profilers
- CSV
- Parquet
- Iceberg tables
- ORC
- Avro
- Statistics Collector profilers and
Data Compliance profilers
- VM-based environments: