kite-morphlines-hadoop-core

downloadHdfsFile

The downloadHdfsFile command (source code) downloads, on startup, zero or more files or directory trees from HDFS to the local file system. These files are typically static configuration files that are required by downstream morphline commands, e.g. Avro schema files, XML join tables, grok dictionaries, etc. Storing such configuration files in HDFS can help with consistent centralized configuration management across a set of cluster nodes.

The output directory on the local file system defaults to the current working directory of the current process. If the effective output file or directory already exists it will be deleted and overwritten.

The command provides the following configuration options:

Property Name Default Description
inputFiles The HDFS files or directories to download, in the form of a list of HDFS URIs.
outputDir "." The relative or absolute path of the destination directory on the local file system. Parent directories of that directory will be created automatically. Defaults to the current working directory of the current process.

Example usage:

downloadHdfsFile {
  inputFiles : ["hdfs://c2202.mycompany.com/user/foo/configs/sample-schema.avsc"]
  outputDir : "myconfigs"
}

openHdfsFile

The openHdfsFile command (source code) opens an HDFS file for read and returns a corresponding Java InputStream.

The morphline record input field _attachment_body must contain the HDFS Path of the file to read. The command replaces the HDFS Path in this field with the corresponding Java InputStream. Said InputStream can then be parsed with other commands, such as readLine or similar.

The command automatically handles gzip files if the file path ends with the ".gz" file name extensions.

Example usage:

openHdfsFile {}