Overview

This processor provides deduplication across either a single record set file, across several files or even across an entire data lake using a DistributedMapCacheClient controller service. In the case of the former, it uses either a HashSet or a bloom filter to provide extremely fast in-memory calculations with a high degree of accuracy. In the latter use case, it will use the controller service to compare a generated hash against a map cache stored in one of the supported caching options that Apache NiFi offers.

Configuring single file deduplication

Choose the "single file" option under the configuration property labeled "Deduplication Strategy." Then choose whether to use a bloom filter or hash set. Be mindful to set size limits that are in line with the average size of the record sets that you process.

Configuring multi-file deduplication

Select the "Multiple Files" option under "Deduplication Strategy" and then configure a DistributedMapCacheClient service. It is possible to configure a cache identifier in multiple ways:

  1. Generate a hash of the entire record by specifying no dynamic properties.
  2. Generate a hash using dynamic properties to specify particular fields to use.
  3. Manually specify a single record path statement in the cache identifier property. Note:

The role of dynamic properties

Dynamic properties should have a human-readable name for the property name and a record path operation for the value. The record path operations will be used to extract values from the record to assemble a unique identifier. Here is an example:

Record:

        {
            "firstName": "John",
            "lastName": "Smith"
        }
    

Will yield an identifier that has "John" and "Smith" in it before a hash is generated from the final value.

If any record path is missing, it will cause an exception to be raised and the flowfile will be sent to the failure relationship.