Data Egress
A Processor that publishes data to an external source has two Relationships:
success
and failure
. The Processor name starts
with "Put" followed by the protocol that is used for data transmission.
Processors that follow this pattern include PutEmail
,
PutSFTP
, and PostHTTP
(note that the name does not
begin with "Put" because this would lead to confusion, since PUT and POST have
special meanings when dealing with HTTP).
This Processor may create or initialize a Connection Pool in a method that uses the
@OnScheduled
annotation. However, because communications problems may
prevent connections from being established or cause connections to be terminated,
connections themselves are not created at this point. Rather, the connections are created
or leased from the pool in the onTrigger
method.
The onTrigger
method first obtains a FlowFile from the
ProcessSession via the get
method. If no FlowFile is available, the
method returns without obtaining a connection to the remote resource.
If at least one FlowFile is available, the Processor obtains a connection from the
Connection Pool, if possible, or otherwise creates a new connection. If the Processor is
neither able to lease a connection from the Connection Pool nor create a new connection,
the FlowFile is routed to failure
, the event is logged, and the method
returns.
If a connection was obtained, the Processor obtains an InputStream to the
FlowFile's content by invoking the read
method on the
ProcessSession and passing an InputStreamCallback (which is often an anonymous inner
class) and from within that callback transmits the contents of the FlowFile to the
destination. The event is logged along with the amount of time taken to transfer the file
and the data rate at which the file was transferred. A SEND event is reported to the
ProvenanceReporter by obtaining the reporter from the ProcessSession via the
getProvenanceReporter
method and calling the send
method on the reporter. The connection is returned or added to the Connection Pool,
depending on whether the connection was leased from the pool or newly created by the
onTrigger
method.
If there is a communications problem, the connection is typically terminated and not
returned (or added) to the Connection Pool. If there is an issue sending the data to the
remote resource, the desired approach for handling the error depends on a few
considerations. If the issue is related to a network condition, the FlowFile is generally
routed to failure
. The FlowFile is not penalized because there is not
necessary a problem with the data. Unlike the case of the Data
Ingress Processor, we typically do not call yield
on the
ProcessContext. This is because in the case of ingest, the FlowFile does not exist until
the Processor is able to perform its function. However, in the case of a Put Processor,
the DataFlow Manager may choose to route failure
to a different
Processor. This can allow for a "backup" system to be used in the case of
problems with one system or can be used for load distribution across many systems.
If a problem occurs that is data-related, one of two approaches should be taken.
First, if the problem is likely to sort itself out, the FlowFile is penalized and then
routed to failure
. This is the case, for instance, with PutFTP, when a
FlowFile cannot be transferred because of a file naming conflict. The presumption is that
the file will eventually be removed from the directory so that the new file can be
transferred. As a result, we penalize the FlowFile and route to failure
so that we can try again later. In the other case, if there is an actual problem with the
data (such as the data does not conform to some required specification), a different
approach may be taken. In this case, it may be advantageous to break apart the
failure
relationship into a failure
and a
communications failure
relationship. This allows the DataFlow Manager
to determine how to handle each of these cases individually. It is important in these
situations to document well the differences between the two Relationships by clarifying it
in the "description" when creating the Relationship.
Connections to remote systems are torn down and the Connection Pool shutdown in a
method annotated with @OnStopped
so that resources can be
reclaimed.