JSON file

The JSON file is an optional component. Cloudera Lakehouse Optimizer uses the threshold values in the file during the evaluation phase to compare the current and expected stats. The action arguments are supported by a corresponding Spark action.

You can use the following list of action arguments in the JSON file as required for your use case:
{
  "expireSnapshot": {
    "expireOlderThan": 432000000,
    "retainLast": 5,
    "cleanExpiredFiles": true
  },
  "rewriteManifest" : {
    "useCaching": true,
    "fileCountMax": 100,
    "manifestFileSize": 8388608,
    "smallFileRatioMax": 0.5
  },
  "rewriteDataFiles" : {
    "targetFileSize": 536870912,
    "maxConcurrentRewriteFileGroups": 5,
    "minInputFiles": 5,
    "partialProgressEnabled": true,
    "partialProgressMaxCommits": 10,
    "deleteFileThreshold": 2000000,
    "useStartingSequenceNumber": false,
    "rewriteAll": false
  },
  "rewritePositionDelete" : {
    "enabled" : false,
    "targetFileSize": 67108864,
    "maxConcurrentGroupRewrite": 5,
    "minInputFiles": 6
    "partialProgressMaxCommits": 10,
    "partialProgressEnabled": true
  },
  "deleteOrphanFiles" : {
    "olderThan": 259200000
  },
  "description": "An example policy constant",
  "cron": "0 4 * ? * *"
}
The following table lists the action arguments that you can specify in the JSON file, which are then sent to the Spark engine to use during the maintenance actions:
Action argument Value Description
expireSnapshot
enabled Default is true Determines whether to evaluate the actions and generate the action arguments.
cleanExpiredFiles Default is true Removes the expired snapshots permanently.
expireOlderThan Default is 120 * 3600 * 1000 ms, that is 5 days.

Minimum is 10 seconds

Deletes the snapshot when the snapshot is older than the set time.

For example, a snapshot is deleted after 5 days by default.

retainLast Default is 5

Minimum is 1

Deletes the last snapshot when the number of snapshots exceeds the set value.

For example, by default the first snapshot gets deleted automatically after the sixth snapshot is created.

expireSnapshotId​ No default value Expires the specified snapshot.
rewriteManifest
enabled Default is true Determines whether to evaluate the actions and generate the action arguments.
useCaching Default is true Uses cache during the rewrite manifest file operation process.
targetFileSize Default is 8388608 bytes Specifies the target manifest file size in bytes.
rewriteDataFiles
enabled Default is true Determines whether to evaluate the actions and generate the action arguments.
targetFileSize Default is 512 MB

Minimum is 1 KB

Maximum is 64 GB

Determines the target output file size after compaction.
maxConcurrentRewriteFileGroups Default is 5

Minimum is 1

Maximum is 1000

Defines the maximum number of file groups to be simultaneously rewritten.
minInputFiles Default is 5

Minimum is 1

Rewrites a file group when the file group exceeds the specified number of files, regardless of other criteria. For example, the number of small files tolerated per partition.
partialProgressMaxCommits Default is 10

Minimum is 1

Defines the maximum number of commits that the rewrite action is allowed to commit when partial progress is enabled.
deleteFileThreshold Default is 2000000

Minimum is 1

Defines the minimum number of deletes that must be associated with a data file for it to be considered for the rewriting action.
partialProgressEnabled Default is false. Defines the maximum number of commits that are allowed during the rewrite operation. This ensures that the changes are committed and snapshots are created even while the rewrite operation is in progress. If a table is not updated frequently, retain the value as false.
use-starting-sequence-number Default is false. Specifies the sequence number of the snapshot at compaction operation start time instead of the newly produced snapshot.
rewrite-all Default is false. Force rewrites all the files overriding other options. Ensures full compaction of the tables.
deleteOrphanFiles
enabled Default is true Determines whether to evaluate the actions and generate the action arguments.
olderThan in ms Default is 72 * 3600 * 1000 that is 3 days.

Minimum is 10 in seconds.

Removes orphan files created before the specified time.
rewritePositionDelete
enabled Default is true Determines whether to evaluate the actions and generate the action arguments.
targetFileSize Default is 64 MB

Minimum is 1 KB

Determines the target output file size after the rewrite positional delete operation.
maxConcurrentGroupRewrite Default is 5

Minimum is 1

Maximum is 1000

Defines the maximum number of file groups to be simultaneously rewritten.
minInputFiles Default is 5

Minimum is 1

Rewrites a file group when the file group exceeds the specified number of files regardless of other criteria.
partial-progress.max-commits Default is 10 Defines the maximum number of commits that are allowed during the rewrite operation. This ensures that the changes are committed and snapshots are created even while the rewrite operation is in progress.
partialProgressEnabled Default is true Enables committing groups of files before the rewrite operation completes. For more information, see Partial Progress Enabled.
rewrite-job-order No default value Forces the rewrite job order based on the chosen value.
You can choose one of the following values:
  • bytes (asc) rewrites the smallest job groups first.
  • bytes (desc) rewrites the largest job groups first.
  • files (asc) rewrites the job groups with the least number of files first.
  • files (desc) rewrites the job groups with the most files first.
  • none rewrites the job groups in the order they were planned.