UpdateClouderaHiveTable

Description:

This processor uses a Hive JDBC connection and incoming records to generate any Hive 3.0+ table changes needed to support the incoming records.

Tags:

hive, metadata, jdbc, database, table

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display NameAPI NameDefault ValueAllowable ValuesDescription
Record Readerrecord-readerController Service API:
RecordReaderFactory
Implementations: WindowsEventLogReader
JASN1Reader
EBCDICRecordReader
YamlTreeReader
CiscoEmblemSyslogMessageReader
ReaderLookup
AvroReader
SyslogReader
CSVReader
GrokReader
IPFIXReader
ParquetReader
JsonTreeReader
ExcelReader
ScriptedReader
JsonPathReader
XMLReader
Syslog5424Reader
CEFReader
The service for reading incoming flow files. The reader is only used to determine the schema of the records, the actual records will not be processed.
Hive Database Connection Pooling Servicehive3-dbcp-serviceController Service API:
ClouderaHiveDBCPService
Implementation: ClouderaHiveConnectionPool
The Hive Controller Service that is used to obtain connection(s) to the Hive database
Table Namehive3-table-nameThe name of the database table to update. If the table does not exist, then it will either be created or an error thrown, depending on the value of the Create Table property.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Partition Clausehive3-partition-clauseSpecifies a comma-separated list of attribute names and optional data types corresponding to the partition columns of the target table. Simply put, if the table is partitioned or is to be created with partitions, each partition name should be an attribute on the FlowFile and listed in this property. This assumes all incoming records belong to the same partition and the partition columns are not fields in the record. An example of specifying this field is if PartitionRecord is upstream and two partition columns 'name' (of type string) and 'age' (of type integer) are used, then this property can be set to 'name string, age int'. The data types are optional and if partition(s) are to be created they will default to string type if not specified. For non-string primitive types, specifying the data type for existing partition columns is helpful for interpreting the partition value(s). If the table exists, the data types need not be specified (and are ignored in that case). This property must be set if the table is partitioned, and there must be an attribute for each partition column in the table. The values of the attributes will be used as the partition values, and the resulting output.path attribute value will reflect the location of the partition in the filesystem (for use downstream in processors such as PutHDFS).
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Create Table Strategyhive3-create-tableFail If Not Exists
  • Create If Not Exists Create a table with the given schema if it does not already exist
  • Fail If Not Exists If the target does not already exist, log an error and route the flowfile to failure
Specifies how to process the target table when it does not exist (create it, fail, e.g.).
Create Table Management Strategyhive3-create-table-managementManaged
  • Managed Any tables created by this processor will be managed tables (see Hive documentation for details).
  • External Any tables created by this processor will be external tables located at the `External Table Location` property value.
  • Use 'hive.table.management.strategy' Attribute Inspects the 'hive.table.management.strategy' FlowFile attribute to determine the table management strategy. The value of this attribute must be a case-insensitive match to one of the other allowable values (Managed, External, e.g.).
Specifies (when a table is to be created) whether the table is a managed table or an external table. Note that when External is specified, the 'External Table Location' property must be specified. If the 'hive.table.management.strategy' value is selected, 'External Table Location' must still be specified, but can contain Expression Language or be set to the empty string, and is ignored when the attribute evaluates to 'Managed'. Also note that if 'Iceberg' is set as the 'Create Table Storage Handler', the table will be in the 'external' area of the Hive warehouse, but no 'External Table Location' needs to be specified.

This Property is only considered if the [Create Table Strategy] Property has a value of "Create If Not Exists".
External Table Locationhive3-external-table-locationSpecifies (when an external table is to be created) the file path (in HDFS, e.g.) to store table data.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)

This Property is only considered if the [Create Table Management Strategy] Property is set to one of the following values: [Use 'hive.table.management.strategy' Attribute], [External]
Create Table Storage Formathive3-storage-formatORC
  • TEXTFILE Stored as plain text files. TEXTFILE is the default file format, unless the configuration parameter hive.default.fileformat has a different setting.
  • SEQUENCEFILE Stored as compressed Sequence Files.
  • ORC Stored as ORC file format. Supports ACID Transactions & Cost-based Optimizer (CBO). Stores column-level metadata.
  • PARQUET Stored as Parquet format for the Parquet columnar storage format.
  • AVRO Stored as Avro format.
  • RCFILE Stored as Record Columnar File format.
If a table is to be created, the specified storage format will be used.

This Property is only considered if the [Create Table Strategy] Property has a value of "Create If Not Exists".
Create Table Storage Handlerhive3-storage-handlerDefault
  • Default Uses the default Hive table storage handler
  • Iceberg Uses the Iceberg table storage handler. Use this when creating Iceberg-backed Hive tables.
If a table is to be created, the specified storage handler will be used (Iceberg, e.g.)

This Property is only considered if the [Create Table Strategy] Property has a value of "Create If Not Exists".
Update Field Nameshive3-update-field-namesfalse
  • true
  • false
This property indicates whether to update the output schema such that the field names are set to the exact column names from the specified table. This should be used if the incoming record field names may not match the table's column names in terms of upper- and lower-case. For example, this property should be set to true if the output FlowFile (and target table storage) is Avro format, as Hive/Impala expects the field names to match the column names exactly.
Record Writerhive3-record-writerController Service API:
RecordSetWriterFactory
Implementations: AvroRecordSetWriter
XMLRecordSetWriter
ParquetRecordSetWriter
JsonRecordSetWriter
RecordSetWriterLookup
CSVRecordSetWriter
ScriptedRecordSetWriter
FreeFormTextRecordSetWriter
Specifies the Controller Service to use for writing results to a FlowFile. The Record Writer should use Inherit Schema to emulate the inferred schema behavior, i.e. an explicit schema need not be defined in the writer, and will be supplied by the same logic used to infer the schema from the column types. If Create Table Strategy is set 'Create If Not Exists', the Record Writer's output format must match the Record Reader's format in order for the data to be placed in the created table location. Note that this property is only used if 'Update Field Names' is set to true and the field names do not all match the column names exactly. If no update is needed for any field names (or 'Update Field Names' is false), the Record Writer is not used and instead the input FlowFile is routed to success or failure without modification.

This Property is only considered if the [Update Field Names] Property has a value of "true".
Query Timeouthive3-query-timeout0Sets the number of seconds the driver will wait for a query to execute. A value of 0 means no timeout. NOTE: Non-zero values may not be supported by the driver.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)

Relationships:

NameDescription
successA FlowFile containing records routed to this relationship after the record has been successfully transmitted to Hive.
failureA FlowFile containing records routed to this relationship if the record could not be transmitted to Hive.

Reads Attributes:

NameDescription
hive.table.management.strategyThis attribute is read if the 'Table Management Strategy' property is configured to use the value of this attribute. The value of this attribute should correspond (ignoring case) to a valid option of the 'Table Management Strategy' property.

Writes Attributes:

NameDescription
output.tableThis attribute is written on the flow files routed to the 'success' and 'failure' relationships, and contains the target table name.
output.pathThis attribute is written on the flow files routed to the 'success' and 'failure' relationships, and contains the path on the file system to the table (or partition location if the table is partitioned).
mime.typeSets the mime.type attribute to the MIME Type specified by the Record Writer, only if a Record Writer is specified and Update Field Names is 'true'.
record.countSets the number of records in the FlowFile, only if a Record Writer is specified and Update Field Names is 'true'.

State management:

This component does not store state.

Restricted:

This component is not restricted.

Input requirement:

This component requires an incoming relationship.

System Resource Considerations:

None specified.