Configuring the Cluster Sensitivity Profiler

In addition to the generic configuration, you can configure scheduling and available resources for the Cluster Sensitivity Profiler.

Go to Profilers and select your data lake.
Go to Profilers > Configs.
Select Cluster Sensitivity Profiler.
The Detail page is displayed.
Click the toggle to enable or disable the profiler.
Select a schedule to run the profiler. This is implemented as a Quartz cron expression.
For more information, see Understanding the Cron Expression generator.
Select Last Run Check and set a period if needed.

note

The Last Run Check configuration enables profilers to avoid profiling the same asset on each scheduled run.

If you have scheduled a cron job, for example, set to start in about an hour, and have enabled the Last Run Check configuration for two days, this setup ensures that the job scheduler filters out any asset which was already profiled in the last two days.

If the Last Run Check configuration is disabled, assets will be picked up for profiling as per the job cron schedule, honoring the asset filter rules.
Set the sampling configurations. When an asset or table is profiled, instead of scanning the whole table, the profiler sample selects only a subset of records.
1. Set the Sample Count – Indicates the number of times a table must be sampled for profiling. Cloudera recommends setting a value between 3 and 30.
2. Set the Sample Factor – Controls the randomization of records. Lower values promote better random samples. A value 0.001 indicates that the data that is retrieved from Hive and a new random number is generated. If the value is less than or equal to the provided proportion (0.001), it will be chosen in the result set. The value range is from 0.001 through 0.5.
3. Set the Sample Records – Indicates the number of records to be retrieved in a given sample. Consider this as LIMIT clause of a SQL query. The value range is from 100 through 100,000.
Continue with the Pod Configurations and set the Kubernetes job resources.
Pod configurations specify the resources that would be allocated to a pod when the profiler job starts to run. As all profilers are submitted as Kubernetes jobs, you must decide if you want to add or reduce resources to handle workload of various sizes.
- Pod CPU limit – Specifies the maximum number of cores that can be allocated to a pod. The value range is from 1-8
- Pod CPU Requirements – Specifies the minimum number of CPUs that will be allocated to a pod when it is provisioned. If the node where a pod is running has enough resources available, a container can use more resource than its request for that resource specifies. However, a container is not allowed to use more than its resource limit. The value range is from 1-8.
- Pod Memory limit – Specifies the maximum amount of memory that can be allocated to a pod. The value range is from 1 through 256.
- Pod Memory Requirements –Specifies the minimum amount of RAM that will be allocated to a pod when it is provisioned. If the node where a pod is running has enough resources available, a container can use more resource than its request for that resource specifies. However, a container is not allowed to use more than its resource limit. The value range is from 1-256.
Update the Executor Configurations.
Executor configurations specify the runtime configuration. These configurations must be changed if you are changing the pod configurations and when you require additional compute power.
- Number of workers – Specifies the number of processes that are used by the distributed computing framework. The value range is from 1 through 8.
- Number of threads per worker – Specifies the number of threads used by each worker to complete the job. The value range is from 1 through 8.
- Worker Memory limit in GB – Enforces a memory usage threshold for a given worker to prevent memory overutilization. For example, if you have an 8 GB pod and 4 threads, the value of this parameter must be 2 GB. The value range is from 1 through 4.
Add Asset Filter Rules as needed to customize the selection of assets to be profiled.
1. Set your Deny List and Allow-list.
  The profiler will skip profiling assets that meet any criteria in the Deny List and will include assets that meet any criteria in the Allow List.
  1. Select the Deny-list or Allow List tab.
  2. Click Add New to define new rules.
  3. Select one of the following keys from the drop-down list.:
    - Database name
    - Asset name
    - Asset owner
    - Path to the asset
    - Created date
  4. Select the operator from the drop-down list. Depending on the keys selected, you can select an operator such as equals, contains. For example, you can select the name of assets that contain a particular string.
  5. Enter the value corresponding to the key. For example, you can enter a string as mentioned in the previous example.
  6. Click Done. Once the rule is added, you can enable or disable it as needed by clicking the state toggle.
Click Save to apply the configuration changes to the selected profiler.