Setting up column name based tagging

In VM-based environments with Cloudera Public Cloud runtime 7.2.18.500 or later, you can use column name based tagging to ensure profiling columns whose data quality might not trigger the column value based checks of the Cluster Sensitivity Profiler. Typically, this can be used for tables where a large ratio of rows contain a different type of data or no data at all compared to the targeted data type that needs to be profiled.

A new classification must be created in Apache Atlas in advance. This classification (called tag in Cloudera Data Catalog) will be matched with tag rules to trigger the profiling. For more information, see Creating classifications.

Create a tag rule for the tag previously created in Atlas to be applied to the column to be profiled.
1. Go to Profilers > Tag Rules.
2. Click + New.
3. In the Tags field, enter the name of the previously created Atlas classification.
Click + in the Resources tab of Tag Rules to create your regular expression matching your column name.

note
The regular expression must be a full match to the column name that you created this rule for.
Select the regular expression matching the column name in Column Name Expression.

note
As this rule is exclusively created to allow columns to get tagged based on their name, skip the Column Value Expression field.
Go to the profiler's with the following path: Cloudera Manager > Clusters > profiler_scheduler > Configuration.
1. Search for "spark" and edit Profiler Scheduler Spark conf.
Add the following configuration snippet to set the level of confidence for the profiler to apply a tag:
```
spark.sensitive.tagRule.<***TAG RULE NAME***>.<***TAG NAME***>.<***COLUMN NAME***>.confidence = value=100
```
note
Although the range of 0 to 100 (both inclusive) is supported, it is recommended to set the value to 100, since this rule is exclusively to be used for column name matching.
The column name and the column value tests both add up to a total of 100% weightage. If the confidence assigned to the column name matching is x then the confidence assigned to the column value is (100-x) by default. The profiler will suggest a tag for the column if the combined match score from testing both the column name and the column values add up to 70% or more.
Multiple configuration snippets can be used, each with a different tag name for different Cluster Sensitivity Profilers.
Click Save Changes.
Wait until the changes are saved and the Restart button appears. Restart the scheduler service.