JSON file
The JSON file is an optional component. Cloudera Lakehouse Optimizer uses the threshold values in the file during the evaluation phase to compare the current and expected stats. The action arguments are supported by a corresponding Spark action.
{
"expireSnapshot": {
"expireOlderThan": 432000000,
"retainLast": 5,
"cleanExpiredFiles": true
},
"rewriteManifest" : {
"useCaching": true,
"fileCountMax": 100,
"manifestFileSize": 8388608,
"smallFileRatioMax": 0.5
},
"rewriteDataFiles" : {
"targetFileSize": 536870912,
"maxConcurrentRewriteFileGroups": 5,
"minInputFiles": 5,
"partialProgressEnabled": true,
"partialProgressMaxCommits": 10,
"deleteFileThreshold": 2000000,
"useStartingSequenceNumber": false,
"rewriteAll": false
},
"rewritePositionDelete" : {
"enabled" : false,
"targetFileSize": 67108864,
"maxConcurrentGroupRewrite": 5,
"minInputFiles": 6
"partialProgressMaxCommits": 10,
"partialProgressEnabled": true
},
"deleteOrphanFiles" : {
"olderThan": 259200000
},
"description": "An example policy constant",
"cron": "0 4 * ? * *"
}| Action argument | Value | Description |
|---|---|---|
| expireSnapshot | ||
enabled |
Default is true | Determines whether to evaluate the actions and generate the action arguments. |
cleanExpiredFiles |
Default is true | Removes the expired snapshots permanently. |
expireOlderThan |
Default is 120 * 3600 * 1000 ms, that is 5
days. Minimum is 10 seconds |
Deletes the snapshot when the snapshot is older than the set
time. For example, a snapshot is deleted after 5 days by default. |
retainLast |
Default is 5 Minimum is 1 |
Deletes the last snapshot when the number of snapshots exceeds the set
value. For example, by default the first snapshot gets deleted automatically after the sixth snapshot is created. |
expireSnapshotId​ |
No default value | Expires the specified snapshot. |
| rewriteManifest | ||
enabled |
Default is true | Determines whether to evaluate the actions and generate the action arguments. |
useCaching |
Default is true | Uses cache during the rewrite manifest file operation process. |
targetFileSize |
Default is 8388608 bytes | Specifies the target manifest file size in bytes. |
| rewriteDataFiles | ||
enabled |
Default is true | Determines whether to evaluate the actions and generate the action arguments. |
targetFileSize |
Default is 512
MB Minimum is 1 KB Maximum is 64 GB |
Determines the target output file size after compaction. |
maxConcurrentRewriteFileGroups |
Default is 5 Minimum is 1 Maximum is 1000 |
Defines the maximum number of file groups to be simultaneously rewritten. |
minInputFiles |
Default is
5 Minimum is 1 |
Rewrites a file group when the file group exceeds the specified number of files, regardless of other criteria. For example, the number of small files tolerated per partition. |
partialProgressMaxCommits |
Default is
10 Minimum is 1 |
Defines the maximum number of commits that the rewrite action is allowed to commit when partial progress is enabled. |
deleteFileThreshold |
Default is
2000000 Minimum is 1 |
Defines the minimum number of deletes that must be associated with a data file for it to be considered for the rewriting action. |
partialProgressEnabled |
Default is false. | Defines the maximum number of commits that are allowed during the rewrite operation. This ensures that the changes are committed and snapshots are created even while the rewrite operation is in progress. If a table is not updated frequently, retain the value as false. |
use-starting-sequence-number |
Default is false. | Specifies the sequence number of the snapshot at compaction operation start time instead of the newly produced snapshot. |
rewrite-all |
Default is false. | Force rewrites all the files overriding other options. Ensures full compaction of the tables. |
| deleteOrphanFiles | ||
enabled |
Default is true | Determines whether to evaluate the actions and generate the action arguments. |
olderThan in ms |
Default is 72 * 3600 * 1000 that is 3
days. Minimum is 10 in seconds. |
Removes orphan files created before the specified time. |
| rewritePositionDelete | ||
enabled |
Default is true | Determines whether to evaluate the actions and generate the action arguments. |
targetFileSize |
Default is 64 MB Minimum is 1 KB |
Determines the target output file size after the rewrite positional delete operation. |
maxConcurrentGroupRewrite |
Default is 5 Minimum is 1 Maximum is 1000 |
Defines the maximum number of file groups to be simultaneously rewritten. |
minInputFiles
|
Default is
5 Minimum is 1 |
Rewrites a file group when the file group exceeds the specified number of files regardless of other criteria. |
partial-progress.max-commits |
Default is 10 | Defines the maximum number of commits that are allowed during the rewrite operation. This ensures that the changes are committed and snapshots are created even while the rewrite operation is in progress. |
partialProgressEnabled |
Default is true | Enables committing groups of files before the rewrite operation completes. For more information, see Partial Progress Enabled. |
rewrite-job-order |
No default value | Forces the rewrite job order based on the chosen value. You can choose one
of the following values:
|
