PutHive3Streaming

Description:

This processor uses Hive Streaming to send flow file records to an Apache Hive 3.0+ table. If 'Static Partition Values' is not set, then the partition values are expected to be the 'last' fields of each record, so if the table is partitioned on column A for example, then the last field in each record should be field A. If 'Static Partition Values' is set, those values will be used as the partition values, and any record fields corresponding to partition columns will be ignored.

Tags:

hive, streaming, put, database, store

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display NameAPI NameDefault ValueAllowable ValuesDescription
Record Readerrecord-readerController Service API:
RecordReaderFactory
Implementations: JASN1Reader
JsonTreeReader
GrokReader
Syslog5424Reader
CiscoEmblemSyslogMessageReader
AvroReader
JsonPathReader
CEFReader
IPFIXReader
WindowsEventLogReader
XMLReader
ScriptedReader
ReaderLookup
YamlTreeReader
ParquetReader
CSVReader
EBCDICRecordReader
ExcelReader
SyslogReader
The service for reading records from incoming flow files.
Hive Metastore URIhive3-stream-metastore-uriThe URI location(s) for the Hive metastore. This is a comma-separated list of Hive metastore URIs; note that this is not the location of the Hive Server. The default port for the Hive metastore is 9043. If this field is not set, then the 'hive.metastore.uris' property from any provided configuration resources will be used, and if none are provided, then the default value from a default hive-site.xml will be used (usually thrift://localhost:9083).
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Hive Configuration Resourceshive3-config-resourcesA file or comma separated list of files which contains the Hive configuration (hive-site.xml, e.g.). Without this, Hadoop will search the classpath for a 'hive-site.xml' file or will revert to a default configuration. Note that to enable authentication with Kerberos e.g., the appropriate properties must be set in the configuration files. Also note that if Max Concurrent Tasks is set to a number greater than one, the 'hcatalog.hive.client.cache.disabled' property will be forced to 'true' to avoid concurrency issues. Please see the Hive documentation for more details.

This property expects a comma-separated list of file resources.

Supports Expression Language: true (will be evaluated using Environment variables only)
Database Namehive3-stream-database-nameThe name of the database in which to put the data.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Table Namehive3-stream-table-nameThe name of the database table in which to put the data.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Static Partition Valueshive3-stream-part-valsSpecifies a comma-separated list of the values for the partition columns of the target table. If the incoming records all have the same values for the partition columns, those values can be entered here, resulting in a performance gain. If specified, this property will often contain Expression Language, for example if PartitionRecord is upstream and two partitions 'name' and 'age' are used, then this property can be set to ${name},${age}. If this property is set, the values will be used as the partition values, and any record fields corresponding to partition columns will be ignored. If this property is not set, then the partition values are expected to be the last fields of each record.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Records per Transactionhive3-stream-records-per-transaction0Number of records to process before committing the transaction. If set to zero (0), all records will be written in a single transaction.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Transactions per Batchhive3-stream-transactions-per-batch1A hint to Hive Streaming indicating how many transactions the processor task will need. The product of Records per Transaction (if not zero) and Transactions per Batch should be larger than the largest expected number of records in the flow file(s), this will ensure any failed transaction batches cause a full rollback.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Call Timeouthive3-stream-call-timeout0The number of seconds allowed for a Hive Streaming operation to complete. A value of 0 indicates the processor should wait indefinitely on operations. Note that although this property supports Expression Language, it will not be evaluated against incoming FlowFile attributes.
Supports Expression Language: true (will be evaluated using Environment variables only)
Disable Streaming Optimizationshive3-stream-disable-optimizationsfalse
  • true
  • false
Whether to disable streaming optimizations. Disabling streaming optimizations will have significant impact to performance and memory consumption.
Rollback On Failurerollback-on-failurefalse
  • true
  • false
Specify how to handle error. By default (false), if an error occurs while processing a FlowFile, the FlowFile will be routed to 'failure' or 'retry' relationship based on error type, and processor can continue with next FlowFile. Instead, you may want to rollback currently processed FlowFiles and stop further processing immediately. In that case, you can do so by enabling this 'Rollback On Failure' property. If enabled, failed FlowFiles will stay in the input relationship without penalizing it and being processed repeatedly until it gets processed successfully or removed by other means. It is important to set adequate 'Yield Duration' to avoid retrying too frequently.NOTE: When an error occurred after a Hive streaming transaction which is derived from the same input FlowFile is already committed, (i.e. a FlowFile contains more records than 'Records per Transaction' and a failure occurred at the 2nd transaction or later) then the succeeded records will be transferred to 'success' relationship while the original input FlowFile stays in incoming queue. Duplicated records can be created for the succeeded ones when the same FlowFile is processed again.
Kerberos Credentials Servicekerberos-credentials-serviceController Service API:
KerberosCredentialsService
Implementation: KeytabCredentialsService
Specifies the Kerberos Credentials Controller Service that should be used for authenticating with Kerberos
Kerberos Principalkerberos-principalThe principal to use when specifying the principal and password directly in the processor for authenticating via Kerberos.
Supports Expression Language: true (will be evaluated using Environment variables only)
Kerberos Passwordkerberos-passwordThe password to use when specifying the principal and password directly in the processor for authenticating via Kerberos.
Sensitive Property: true

Relationships:

NameDescription
retryThe incoming FlowFile is routed to this relationship if its records cannot be transmitted to Hive. Note that some records may have been processed successfully, they will be routed (as Avro flow files) to the success relationship. The combination of the retry, success, and failure relationships indicate how many records succeeded and/or failed. This can be used to provide a retry capability since full rollback is not possible.
successA FlowFile containing Avro records routed to this relationship after the record has been successfully transmitted to Hive.
failureA FlowFile containing Avro records routed to this relationship if the record could not be transmitted to Hive.

Reads Attributes:

None specified.

Writes Attributes:

NameDescription
hivestreaming.record.countThis attribute is written on the flow files routed to the 'success' and 'failure' relationships, and contains the number of records from the incoming flow file. All records in a flow file are committed as a single transaction.
query.output.tablesThis attribute is written on the flow files routed to the 'success' and 'failure' relationships, and contains the target table name in 'databaseName.tableName' format.

State management:

This component does not store state.

Restricted:

This component is not restricted.

System Resource Considerations:

None specified.