Falcon can replicate data across multiple clusters using distcp, and do it according to the fequency you specify in the feed entity. Falcon uses a pull-based replication mechanism, meaning in every target cluster, for a given source cluster, a coordinator is scheduled which pulls the data using distcp from source cluster. And, for every instance that a feed is replicated Falcon sends a JMS message on the success or failure of the replication instance.
For example, in this feed two clusters are replicating data to a backup cluster:
<clusters> <cluster name=Cluster1" type="source" partition="${cluster.name}" delay="days(2)"> <validity start="2011-11-01T00:00Z" end="2021-11-30T00:00Z"/> </cluster> <cluster name="Cluster2" type="source" partition="COUNTRY/${cluster.name}"> <validity start="2011-11-01T00:00Z" end="2021-11-30T00:00Z"/> </cluster> <cluster name="Backup" type="target"> <validity start="2011-11-01T00:00Z" end="2011-11-31T00:00Z"/> </cluster> </clusters>
Note | |
---|---|
We recommend that the data path be as granular as the frequency of the feed. For example, if you are specifying the feed frequency in hours, provide a data path that is/${YEAR}/${MONTH}/${DAY}/${HOUR}. |
In this example, two coordinators are scheduled to pull data in to the target, Backup, one coordinator pulls the data from a partition in Cluster1 and the other coordinator pulls from a partition in Cluster2. A replication delay of 2 days has been set for Cluster1, which means that it will run every 30 days with an offset of 2 days. This means that the feed instance that is scheduled for replication November 30 is elligible December 2nd.
If you are using Falcon for Data Replication, explore the following topics:
Falcon Community Documentation on Language Expression