Cloudera Data Catalog Profilers

Profilers create metadata annotations that summarize the content and shape characteristics of the data assets (such as distribution of values in a box plot or histogram).

Cloudera Data Catalog supports different environments:

  • When using the VM-based environment, you can create a Cloudera Data Hub cluster for a profiler engine to run data profiling operations as a pipeline on data located in one of your data lakes. You can install the profiler agent in a data lake and set up a specific schedule to generate various types of data profiles.
  • When using the Compute Cluster enabled environment, after launching a profiler cluster, an internal service provisions new Kubernetes pods, scheduling and running profiler jobs on-demand.
Profiler Name Description
Data Compliance Profiler The profiler automatically classifies your data with preconfigured tags, such as, PII, PCI, HIPAA and others.
Activity Profiler Profiler A Ranger audit log summarizer.
Statistics Collector Profiler Provides summary statistics like Maximum, Minimum, Mean, Unique, and Null values at the Hive column level.

Limitations

  • In Compute Cluster enabled environments, profilers only support tables which are stored on AWS S3 storage.
  • Supported file formats:
    • VM-based environments:
      • CSV
      • Avro
    • Compute Cluster enabled environments:
      • Statistics Collector profilers and Data Compliance profilers
        • CSV
        • Parquet
        • Iceberg tables
        • ORC
        • Avro