PutIceberg

Description:

This processor uses Iceberg API to parse and load records into Iceberg tables. The incoming data sets are parsed with Record Reader Controller Service and ingested into an Iceberg table using the configured catalog service and provided table information. The target Iceberg table should already exist and it must have matching schemas with the incoming records, which means the Record Reader schema must contain all the Iceberg schema fields, every additional field which is not present in the Iceberg schema will be ignored. To avoid 'small file problem' it is recommended pre-appending a MergeRecord processor.

Additional Details...

Tags:

iceberg, put, table, store, record, parse, orc, parquet, avro

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display NameAPI NameDefault ValueAllowable ValuesDescription
Record Readerrecord-readerController Service API:
RecordReaderFactory
Implementations: Syslog5424Reader
CEFReader
ReaderLookup
CiscoEmblemSyslogMessageReader
CSVReader
GrokReader
SyslogReader
JsonTreeReader
JsonPathReader
XMLReader
AvroReader
JASN1Reader
ExcelReader
ParquetReader
EBCDICRecordReader
WindowsEventLogReader
IPFIXReader
ScriptedReader
Specifies the Controller Service to use for parsing incoming data and determining the data's schema.
Catalog Servicecatalog-serviceController Service API:
IcebergCatalogService
Implementations: HiveCatalogService
HadoopCatalogService
Specifies the Controller Service to use for handling references to table’s metadata files.
Catalog Namespacecatalog-namespaceThe namespace of the catalog.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Table Nametable-nameThe name of the Iceberg table to write to.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
File Formatfile-format
  • AVRO
  • PARQUET
  • ORC
File format to use when writing Iceberg data files. If not set, then the 'write.format.default' table property will be used, default value is parquet.
Maximum File Sizemaximum-file-sizeThe maximum size that a file can be, if the file size is exceeded a new file will be generated with the remaining data. If not set, then the 'write.target-file-size-bytes' table property will be used, default value is 512 MB.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Kerberos User Servicekerberos-user-serviceController Service API:
KerberosUserService
Implementations: KerberosKeytabUserService
KerberosPasswordUserService
KerberosTicketCacheUserService
Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos.
Number of Commit Retriesnumber-of-commit-retries10Number of times to retry a commit before failing.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Minimum Commit Wait Timeminimum-commit-wait-time100 msMinimum time to wait before retrying a commit.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Maximum Commit Wait Timemaximum-commit-wait-time2 secMaximum time to wait before retrying a commit.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Maximum Commit Durationmaximum-commit-duration30 secTotal retry timeout period for a commit.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)

Relationships:

NameDescription
successA FlowFile is routed to this relationship after the data ingestion was successful.
failureA FlowFile is routed to this relationship if the operation failed and retrying the operation will also fail, such as an invalid data or schema.

Reads Attributes:

None specified.

Writes Attributes:

NameDescription
iceberg.record.countThe number of records in the FlowFile.

State management:

This component does not store state.

Restricted:

This component is not restricted.

System Resource Considerations:

None specified.