Additional Configuration for Hive Column Profiler

In addition to the generic configuration, there are additional parameters for the Hive Column Profiler that can optionally be edited.

  1. Click Profilers in the main navigation menu on the left.
  2. Click Configs to view all of the configured profilers.
  3. Select the cluster for which you need to edit profiler configuration.

    The list of profilers for the selected clusters is displayed.

  4. Select the cluster for which you need to edit profiler configuration.

    You can use the toggle button to enable / disable the Hive Column Profiler.

    The Hive Column Profiler detail page is displayed which contains the following sections:

    • Profiler Configurations

    • Pod Configurations

    • Executor Configurations

    • Asset Filter Rules

    Profiler Configurations
    • Sampling or Profiler configurations enables you to regulate sampling behaviour of the profilers. When an asset/table is profiled, instead of scanning the whole table, the profiler sample selects records as it finds them.
    • Sample Count: Indicates the number of times a table must be sampled for profiling. A value less than 3 and higher than 30 is not recommended.
    • Sample Factor: Controls the randomisation of records. Less value promote better random samples and higher values results in poor samples. A value 0.001 indicates that the data that is retrieved from Hive and a new random number is generated. If the value is less than or equal to the provided proportion (0.001), it will be chosen in the result set. If the value is greater, it is ignored.
    • Sample Records: Indicates the number of records to be retrieved in a given sample. Consider this as LIMIT clause of the SQL query.
    Profiler Configurations

    As all profilers are submitted as Kubernetes jobs, you must decide if you want to add or reduce resources to handle workload of various sizes.

    Pod configurations specify the resources that would be allocated to a pod when the profiler job starts to run.

    • Pod CPU limit: Indicates the maximum number of cores that can be allocated to a Pod. The accepted values examples are 0.5, 1, 2, 500m, and 250m.
    • Pod CPU Requirements: This is the minimum number of CPUs that will be allocated to a Pod when its provisioned. If the node where a Pod is running has enough resources available, it is possible (and allowed) for a container to use more resource than its request for that resource specifies. However, a container is not allowed to use more than its resource limit. The accepted values examples are 0.5, 1, 2, 500m, and 250m.
    • Pod Memory limit: Maximum amount of memory can be allocated to a Pod. The accepted values examples are: 128974848, 129e6, 129M, 128974848000m, and 123Mi.
    • Pod Memory Requirements: This is the minimum amount of RAM that will be allocated to a Pod when it is provisioned. If the node where a Pod is running has enough resources available, it is possible (and allowed) for a container to use more resource than its request for that resource specifies. However, a container is not allowed to use more than its resource limit.
    Executor Configurations

    Executor Configurations are the runtime configuration. These configuration must be changed if you are changing the Pod configurations and when there is a requirement for additional compute power.

    • Number of workers: Indicates the number of processes that are used by the distributed computing framework.

    • Number of threads per worker: Indicates the number of threads used by each worker to complete the job.

    • Worker Memory limit in GB: To avoid over utilization of memory, this parameter forces an upper threshold memory usage for a given worker. For example, if you have a 8 GB Pod and 4 threads, the value of this parameter must be 2 GB.