PutClouderaORC

Description:

Reads records from an incoming FlowFile using the provided Record Reader, and writes those records to a ORC file in the location/filesystem specified in the configuration.

Tags:

put, ORC, hadoop, HDFS, filesystem, restricted, record

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display NameAPI NameDefault ValueAllowable ValuesDescription
Hadoop Configuration ResourcesHadoop Configuration ResourcesA file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. To use swebhdfs, see 'Additional Details' section of PutHDFS's documentation.

This property expects a comma-separated list of file resources.

Supports Expression Language: true (will be evaluated using variable registry only)
Kerberos Credentials Servicekerberos-credentials-serviceController Service API:
KerberosCredentialsService
Implementation: KeytabCredentialsService
Specifies the Kerberos Credentials Controller Service that should be used for authenticating with Kerberos
Kerberos User Servicekerberos-user-serviceController Service API:
KerberosUserService
Implementations: KerberosTicketCacheUserService
KerberosKeytabUserService
KerberosPasswordUserService
Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos
Kerberos PrincipalKerberos PrincipalKerberos principal to authenticate as. Requires nifi.kerberos.krb5.file to be set in your nifi.properties
Supports Expression Language: true (will be evaluated using variable registry only)
Kerberos KeytabKerberos KeytabKerberos keytab associated with the principal. Requires nifi.kerberos.krb5.file to be set in your nifi.properties

This property requires exactly one file to be provided..

Supports Expression Language: true (will be evaluated using variable registry only)
Kerberos PasswordKerberos PasswordKerberos password associated with the principal.
Sensitive Property: true
Kerberos Relogin PeriodKerberos Relogin Period4 hoursPeriod of time which should pass before attempting a kerberos relogin. This property has been deprecated, and has no effect on processing. Relogins now occur automatically.
Supports Expression Language: true (will be evaluated using variable registry only)
Additional Classpath ResourcesAdditional Classpath ResourcesA comma-separated list of paths to files and/or directories that will be added to the classpath and used for loading native libraries. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included.

This property expects a comma-separated list of resources. Each of the resources may be of any of the following types: directory, file.
Record Readerrecord-readerController Service API:
RecordReaderFactory
Implementations: EBCDICRecordReader
JsonTreeReader
GrokReader
ReaderLookup
IPFIXReader
WindowsEventLogReader
ParquetReader
CSVReader
Syslog5424Reader
JASN1Reader
ExcelReader
CiscoEmblemSyslogMessageReader
ScriptedReader
ProtobufReader
JsonPathReader
XMLReader
CEFReader
SyslogReader
AvroReader
YamlTreeReader
The service for reading records from incoming flow files.
DirectoryDirectoryThe parent directory to which files should be written. Will be created if it doesn't exist.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Compression Typecompression-typeNONE
  • NONE No compression
  • ZLIB ZLIB compression
  • SNAPPY Snappy compression
  • LZO LZO compression
The type of compression for the file being written.
Overwrite Filesoverwritefalse
  • true
  • false
Whether or not to overwrite existing files in the same directory with the same name. When set to false, flow files will be routed to failure when a file exists in the same directory with the same name.
Permissions umaskpermissions-umaskA umask represented as an octal number which determines the permissions of files written to HDFS. This overrides the Hadoop Configuration dfs.umaskmode
Remote Groupremote-groupChanges the group of the HDFS file to this value after it is written. This only works if NiFi is running as a user that has HDFS super user privilege to change group
Remote Ownerremote-ownerChanges the owner of the HDFS file to this value after it is written. This only works if NiFi is running as a user that has HDFS super user privilege to change owner
ORC Configuration Resourcesputorc-config-resourcesA file or comma separated list of files which contains the ORC configuration (hive-site.xml, e.g.). Without this, Hadoop will search the classpath for a 'hive-site.xml' file or will revert to a default configuration. Please see the ORC documentation for more details.

This property expects a comma-separated list of file resources.
Stripe Sizeputorc-stripe-size64 MBThe size of the memory buffer (in bytes) for writing stripes to an ORC file
Buffer Sizeputorc-buffer-size10 KBThe maximum size of the memory buffers (in bytes) used for compressing and storing a stripe in memory. This is a hint to the ORC writer, which may choose to use a smaller buffer size based on stripe size and number of columns for efficient stripe writing and memory utilization.
Hive Table Nameputorc-hive-table-nameAn optional table name to insert into the hive.ddl attribute. The generated DDL can be used by a PutClouderaHiveQL processor (presumably after a PutHDFS processor) to create a table backed by the converted ORC file. If this property is not provided, the full name (including namespace) of the incoming Avro record will be normalized and used as the table name.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Normalize Field Names for Hiveputorc-hive-field-namestrue
  • true
  • false
Whether to normalize field names for Hive (force lowercase, e.g.). If the ORC file is going to be part of a Hive table, this property should be set to true. To preserve the original field names from the schema, this property should be set to false.

Relationships:

NameDescription
retryFlow Files that could not be processed due to issues that can be retried are transferred to this relationship
successFlow Files that have been successfully processed are transferred to this relationship
failureFlow Files that could not be processed due to issue that cannot be retried are transferred to this relationship

Reads Attributes:

NameDescription
filenameThe name of the file to write comes from the value of this attribute.

Writes Attributes:

NameDescription
filenameThe name of the file is stored in this attribute.
absolute.hdfs.pathThe absolute path to the file is stored in this attribute.
hadoop.file.urlThe hadoop url for the file is stored in this attribute.
record.countThe number of records written to the ORC file
hive.ddlCreates a partial Hive DDL statement for creating an external table in Hive from the destination folder. This can be used in ReplaceText for setting the content to the DDL. To make it valid DDL, add "LOCATION '<path_to_orc_file_in_hdfs>'", where the path is the directory that contains this ORC file on HDFS. For example, this processor can send flow files downstream to ReplaceText to set the content to this DDL (plus the LOCATION clause as described), then to PutHiveQL processor to create the table if it doesn't exist.

State management:

This component does not store state.

Restricted:

Required PermissionExplanation
write distributed filesystemProvides operator the ability to write to any file that NiFi has access to in HDFS or the local filesystem.

Input requirement:

This component requires an incoming relationship.

System Resource Considerations:

None specified.