OperationsPDF version

Setting up column name based tagging

In VM-based environments with Cloudera Public Cloud runtime 7.2.18.500 or later, you can use column name based tagging to ensure profiling columns whose data quality might not trigger the column value based checks of the Cluster Sensitivity Profiler. Typically, this can be used for tables where a large ratio of rows contain a different type of data or no data at all compared to the targeted data type that needs to be profiled.

A new classification must be created in Apache Atlas in advance. This classification (called tag in Cloudera Data Catalog) will be matched with tag rules to trigger the profiling. For more information, see Creating classifications.

  1. Create a tag rule for the tag previously created in Atlas to be applied to the column to be profiled.
    1. Go to Profilers > Tag Rules.
    2. Click + New.
    3. In the Tags field, enter the name of the previously created Atlas classification.
  2. Click + in the Resources tab of Tag Rules to create your regular expression matching your column name.
  3. Select the regular expression matching the column name in Column Name Expression.
  4. Go to the profiler's with the following path: Cloudera Manager > Clusters > profiler_scheduler > Configuration.
    1. Search for "spark" and edit Profiler Scheduler Spark conf.
  5. Add the following configuration snippet to set the level of confidence for the profiler to apply a tag:
    spark.sensitive.tagRule.<***TAG RULE NAME***>.<***TAG NAME***>.<***COLUMN NAME***>.confidence = value=100
    
  6. Click Save Changes.
  7. Wait until the changes are saved and the Restart button appears. Restart the scheduler service.