Cloudera Data Catalog Profilers

Profilers create metadata annotations that summarize the content and shape characteristics of the data assets (such as distribution of values in a box plot or histogram).

The Cloudera Data Catalog profiler employs Kubernetes enabled job scheduling and runs profilers jobs on-demand.
Profiler Name Description
Data Compliance Profiler The profiler automatically classifies your data with preconfigured tags, such as, PII, PCI, HIPAA and others.
Activity Profiler Profiler A Ranger audit log summarizer.
Statistics Collector Profiler Provides summary statistics like Maximum, Minimum, Mean, Unique, and Null values at the Hive column level.

Limitations

  • Cloudera Data Catalog on premises 1.5.5 SP1 or lower do not support Iceberg tables. In Cloudera Data Catalog on premises 1.5.5 SP2 or higher, Iceberg tables can be profiled.
  • In Compute Cluster enabled environments, profilers only support tables which are stored on AWS S3 storage.
  • Supported file formats:
    • Statistics Collector profilers and Data Compliance profilers
      • CSV
      • Parquet
      • Iceberg tables
      • ORC
      • Avro
  • Profilers might hang if the underlying AWS cloud provider environment cannot provide the necessary memory for the executor instances. In this case, reconfigure your executors with 4-5 GB memory in Profiler Details > Configuration.