2. Defining Data Pipelines

To create a data pipeline you must:

Create the cluster specification XML file, also known as a cluster entity. There are several interfaces to define in a cluster entity. For example, here is a cluster entity with all cluster interfaces defined:

Colo: Name of the Data Center
Name: Filename of the Data Center
<interface>: Specify the interface type

<?xml version="1.0"?>
<!--
   Cluster Example
 -->
<cluster colo="$MyDataCenter" description="description" name="$MyDataCenter">    
  <interfaces>
    <interface type="readonly" endpoint="hftp://nn:50070" version="2.4.0" /> <!-- Required for distcp for replications. -->
    <interface type="write" endpoint="hdfs://nn:8020" version="2.4.0" /> <!-- Needed for writing to HDFS-->
    <interface type="execute" endpoint="rm:8050" version="2.4.0" /> <!-- Needed to write to jobs as MapReduce-->
    <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <!-- Required. Submits Oozie jobs.-->
    <interface type=”registry" endpoint="thrift://hms:9083" version="0.13.0" /> <!--Register/deregister partitions in the Hive Metastore and get events on partition availability 
-->
    <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> <!--Needed for alerts-->
  </interfaces>
  <locations>
    <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <!--HDFS directories used by the Falcon server-->
    <location name="temp" path="/tmp" />
    <location name="working" path="/apps/falcon/prod-cluster/working" />
  </locations>
</cluster>

	Note
	Additional properties must be set if you are configuring for a secure cluster. For more information, see Configuring for Secure Clusters.

Next, create a dataset specification XML file, or feed entity:

Reference the cluster entity to determine which clusters the feed uses.
<frequency>: Specify the frequency of the feed.
<retention limit>: Choose a retention policy for the data to remain on the cluster.
<location>: Provide the HDFS path to the files.
<ACL owner>: Specify the HDFS access permissions.
Optional. Specify a Late Data Handling cut-off.

<?xml version="1.0"?>        
<!--
   Feed Example
 -->
<feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1">
  <frequency>hours(1)</frequency> <!--Feed run frequency-->  
  <late-arrival cut-off="hours(6)”/> <!-- Late arrival cut-off -->
  <groups>churnAnalysisFeeds</groups> <!--Feed group, feeds can belong to multiple groups -->
  <tags externalSource=$MyEDW, externalTarget=Marketing> <!-- Metadata tagging -->
  <clusters> <!-- Target clusters for retention and replication. -->
    <cluster name="$MyDataCenter" type="source">
      <validity start="$date" end="$date"/>
      <retention limit="days($n)" action="delete"> <!--Currently delete is the only action available -->
    </cluster>
    <cluster name="$MyDataCenter-secondary" type="target">
      <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
       <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
	 <retention limit="days(7)" action="delete"/>
    </cluster>
  </clusters>
  <locations> <!-- Global location across clusters - HDFS paths or Hive tables -->
    <location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
  </locations>
  <ACL owner="hdfs" group="users" permission="0755"/>  <!-- Required for HDFS. -->
  <schema location="/none" provider="none"/> <!-- Required for HDFS. -->
</feed>

Create the process specification XML file:

<cluster name>: Reference the cluster entity to define where the process runs.
<feed>: Reference the feed entity to define the datasets that the process uses.
Optional. Specifiy Late Data Handling policies or a Retry Policy.

<?xml version="1.0"?>        
<!--
   Process Example
 -->
<process name="process-test" xmlns="uri:falcon:process:0.1”>
    <clusters>
      <cluster name="$MyDataCenter">
        <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" 
      </cluster>
    </clusters>
    <parallel>1</parallel>
    <order>FIFO</order> <!--You can also use LIFO and LASTONLY but FIFO is recommended in most cases--> 
    <frequency>days(1)</frequency> 
    <inputs>
        <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" />
    </inputs>
    <outputs>
        <output instance="now(0,2)" feed="feed-clicks-clean" name="output" />
    </outputs>
    <workflow engine="pig" path="/apps/clickstream/clean-script.pig" />
    <retry policy="periodic" delay="minutes(10)" attempts="3"/>
    <late-process policy="exp-backoff" delay="hours(1)">
	  <late-input input="input" workflow-path="/apps/clickstream/late" />
    </late-process>
</process>

	Note
	LIFO and LASTONLY are also supported schedule changes for <order>.

You can now move on to Deploying Data Piplelines.

Legal notices