ListCDPObjectStore 2.3.0.4.10.0.0-147

Bundle
com.cloudera | nifi-cdf-objectstore-nar
Description
Retrieves a listing of files from the object store. Each time a listing is performed, the files with the latest timestamp will be excluded and picked up during the next execution of the processor. This is done to ensure that we do not miss any files, or produce duplicates, in the cases where files with the same timestamp are written immediately before and after a single execution of the processor. For each file that is listed, this processor creates a FlowFile that represents the object store file to be fetched in conjunction with FetchCDPObjectStore. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data. This Processor does not delete any data from the object store.
Tags
ADLS, AWS, Azure, CDP, GCP, GCS, Google, HCFS, HDFS, S3, filesystem, get, hadoop, ingest, list, source
Input Requirement
FORBIDDEN
Supports Sensitive Dynamic Properties
false
  • Additional Details for ListCDPObjectStore 2.3.0.4.10.0.0-147

    ListCDPObjectStore

    Description

    ListCDPObjectStore provides the capability for listing files from a given folder of the configured object store. In most aspects ListCDPObjectStore works identical to ListHDFS. This includes supporting Filter Modes, limiting the scope of listing based on age and others. The detailed description of these features is situated at the documentation of the ListHDFS.

    CDP Object Store processors

    ListCDPObjectStore is part of the CDP Object Store processor family. This comes with a number of consequences listed below.

    Object Store access

    This processor is designed to ease the interactions with the object store associated to the NiFi cluster. If used in CDP Private Cloud, it can be used to facilitate interactions with HDFS and/or Ozone. If used in CDP Public Cloud, it can be used to interact with the object store of the underlying cloud provider (S3 for AWS, ADLS for Azure, GCS for Google Cloud, etc) but not cross cloud providers. If the cluster is configured with RAZ, the processor will interact with RAZ to check the Ranger policies when accessing the resources in the object store. If RAZ is not enabled, it is possible to leverage the IDBroker mappings to map CDP users with cloud accounts and policies.

    Configuration file

    This processor needs a configuration which contains connection details to the object store. This should be a Hadoop-style XML file, occasionally with additional parameters that are specific to the given kind of object store and authentication method. Unless specified otherwise the processor is looking for the CDP-default /etc/hadoop/conf/core-site.xml configuration file.

    This configuration contains information specific to the object store provider (For example Amazon AWS) which, combined with the underlying Hadoop library provides the capability to connect to different kind of stores, authenticate with Kerberos and authorize with Ranger. In the majority of the cases the use of this default configuration is recommended.

    Users may override the default location by adding a dynamic parameter, by the name of “cdp.configuration.resources”. It is possible to add multiple configuration files as a comma-separated list. It is important to note however that for the additional features provided by the underlying Hadoop library to continue to work, a number of additional configuration parameters are needed.

    Storage Location

    If Storage Location property is not set, the default storage location will be used. The default value is defined by the “fs.defaultFS” property of the object store configuration. If the default CDP configuration is used, this will be the Data Lake’s object storage. If this is being set, the value of “fs.defaultFS” will be ignored. It is important to adjust the authentication and authorization settings accordingly.

    Dynamic parameters

    This processors supports dynamic parameters. All dynamic parameters, except the protected ones are passed to the object storage configuration. These will be added as additional configuration parameters or in case some parameters already exist, overwrite them. This provides the opportunity to fine tune the connection without changing the configuration file. The protected parameters are: “fs.defaultFS” and “cdp.configuration.resources”.

    Authentication

    This processor supports Kerberos authentication via either Kerberos Credential Service or explicitly providing CDP Username and CDP Password. Both will authenticate against the cluster’s adherent Kerberos service.

Properties
Dynamic Properties
State Management
Scopes Description
CLUSTER After performing a listing of files, the latest timestamp of all the files listed and the latest timestamp of all the files transferred are both stored. This allows the Processor to list only files that have been added or modified after this date the next time that the Processor is run, without having to store all of the actual filenames/paths which could lead to performance problems. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node can pick up where the previous node left off, without duplicating the data.
Relationships
Name Description
success All FlowFiles are transferred to this relationship
Writes Attributes
Name Description
filename The name of the file that was read from object store.
path The path is set to the absolute path of the file's directory on the object store. For example, if the Directory property is set to /tmp, then files picked up from /tmp will have the path attribute set to "./". If the Recurse Subdirectories property is set to true and a file is picked up from /tmp/abc/1/2/3, then the path attribute will be set to "/tmp/abc/1/2/3".
objectstore.owner The user that owns the file in the object store.
objectstore.group The group that owns the file in the object store.
objectstore.lastModified The timestamp of when the file in the object store was last modified, as milliseconds since midnight Jan 1, 1970 UTC
objectstore.length The number of bytes in the file
objectstore.replication The number of replicas for hte file
objectstore.permissions The permissions for the file in the object store. This is formatted as 3 characters for the owner, 3 for the group, and 3 for other users. For example rw-rw-r--
See Also