Ingesting Data from HDFS
The HDFS spout actively monitors a specified HDFS directory and consumes any new files that appear in the directory, feeding data from HDFS to Storm.
Important | |
---|---|
HDFS spout assumes that files visible in the monitored directory are not actively being updated. Only after a file is completely written should it be made visible to the spout. Following are two approaches for ensuring this:
|
When the spout is actively consuming a file, it renames the file with an
.inprogress
suffix. After consuming all contents in the file,
the file is moved to a configurable done directory
and the .inprogress
suffix is dropped.
Concurrency
If multiple spout instances are used in the topology, each instance consumes a
different file. Synchronization among spout instances relies on lock files created
in a subdirectory called .lock
(by default) under the monitored
directory. A file with the same name as the file being consumed (without the
.inprogress
suffix) is created in the lock directory. Once
the file is completely consumed, the corresponding lock file is deleted.
Recovery from failure
Periodically, the spout records information about how much of the file has been consumed in the lock file. If the spout instance crashes or there is a force kill of topology, another spout can take over the file and resume from the location recorded in the lock file.
Certain error conditions (such as a spout crash) can leave residual lock files. Such a stale lock file indicates that the corresponding input file has not been completely processed. When detected, ownership of such stale lock files will be transferred to another spout.
The hdfsspout.lock.timeout.sec
property specifies the duration of
inactivity after which lock files should be considered stale. The default timeout is
five minutes. For lock file ownership transfer to succeed, the HDFS lease on the file
(from the previous lock owner) should have expired. Spouts scan for stale lock files
before selecting the next file for consumption.
Lock on .lock Directory
HDFS spout instances create a DIRLOCK
file in the
.lock
directory to coordinate certain accesses to the .lock
directory itself. A spout will try to create it when it needs access to the .lock
directory, and then delete it when done. In error conditions such as a topology
crash, force kill, or untimely death of a spout, this file may not be deleted.
Future instances of the spout will eventually recover the file once the
DIRLOCK
file becomes stale due to inactivity for
hdfsspout.lock.timeout.sec
seconds.
API Support
HDFS spout supports core Storm, but does not currently support Trident.