4. Data Replication

Falcon can replicate data across multiple clusters using distcp, and do it according to the fequency you specify in the feed entity. Falcon uses a pull-based replication mechanism, meaning in every target cluster, for a given source cluster, a coordinator is scheduled which pulls the data using distcp from source cluster. And, for every instance that a feed is replicated Falcon sends a JMS message on the success or failure of the replication instance.

For example, in this feed two clusters are replicating data to a backup cluster:

<clusters> 
   <cluster name=Cluster1" type="source" partition="${cluster.name}" delay="days(2)"> 
       <validity start="2011-11-01T00:00Z" end="2021-11-30T00:00Z"/> 
   </cluster> 
   <cluster name="Cluster2" type="source" partition="COUNTRY/${cluster.name}"> 
       <validity start="2011-11-01T00:00Z" end="2021-11-30T00:00Z"/> 
   </cluster> 
   <cluster name="Backup" type="target"> 
       <validity start="2011-11-01T00:00Z" end="2011-11-31T00:00Z"/> 
   </cluster>
</clusters>

	Note
	We recommend that the data path be as granular as the frequency of the feed. For example, if you are specifying the feed frequency in hours, provide a data path that is/${YEAR}/${MONTH}/${DAY}/${HOUR}.

In this example, two coordinators are scheduled to pull data in to the target, Backup, one coordinator pulls the data from a partition in Cluster1 and the other coordinator pulls from a partition in Cluster2. A replication delay of 2 days has been set for Cluster1, which means that it will run every 30 days with an offset of 2 days. This means that the feed instance that is scheduled for replication November 30 is elligible December 2nd.

If you are using Falcon for Data Replication, explore the following topics:

Falcon Community Documentation on Language Expression
Section 4.1, “distCP Throttle”
Replacing JMS with ActiveMQ

Legal notices