CalculateParquetRowGroupOffsets

Description:

The processor generates one FlowFile from each Row Group of the input, and adds attributes with the offsets required to read the group of rows in the FlowFile's content. Can be used to increase the overall efficiency of processing extremely large Parquet files.

Tags:

parquet, split, partition, break apart, efficient processing, load balance, cluster

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values.

Display Name	API Name	Default Value	Allowable Values	Description
Zero Content Output	Zero Content Output	false	true false	Whether to do, or do not copy the content of input FlowFile.

Relationships:

Name	Description
success	FlowFiles, with special attributes that represent a chunk of the input file.

Reads Attributes:

None specified.

Writes Attributes:

Name	Description
parquet.file.range.startOffset	Sets the start offset of the selected row group in the parquet file.
parquet.file.range.endOffset	Sets the end offset of the selected row group in the parquet file.
record.count	Sets the count of records in the selected row group.

State management:

This component does not store state.

Restricted:

This component is not restricted.

Input requirement:

This component requires an incoming relationship.

System Resource Considerations:

None specified.