2. Defining Data Pipelines

To create a data pipeline you must:

  1. Create the cluster specification XML file, also known as a cluster entity. There are several interfaces to define in a cluster entity. For example, here is a cluster entity with all cluster interfaces defined:

    • Colo: Name of the Data Center

    • Name: Filename of the Data Center

    • <interface>: Specify the interface type

    <?xml version="1.0"?>
       Cluster Example
    <cluster colo="$MyDataCenter" description="description" name="$MyDataCenter">    
        <interface type="readonly" endpoint="hftp://nn:50070" version="2.4.0" /> <!-- Required for distcp for replications. -->
        <interface type="write" endpoint="hdfs://nn:8020" version="2.4.0" /> <!-- Needed for writing to HDFS-->
        <interface type="execute" endpoint="rm:8050" version="2.4.0" /> <!-- Needed to write to jobs as MapReduce-->
        <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <!-- Required. Submits Oozie jobs.-->
        <interface type=”registry" endpoint="thrift://hms:9083" version="0.13.0" /> <!--Register/deregister partitions in the Hive Metastore and get events on partition availability 
        <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> <!--Needed for alerts-->
        <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <!--HDFS directories used by the Falcon server-->
        <location name="temp" path="/tmp" />
        <location name="working" path="/apps/falcon/prod-cluster/working" />

    Additional properties must be set if you are configuring for a secure cluster. For more information, see Configuring for Secure Clusters.

  2. Next, create a dataset specification XML file, or feed entity:

    • Reference the cluster entity to determine which clusters the feed uses.

    • <frequency>: Specify the frequency of the feed.

    • <retention limit>: Choose a retention policy for the data to remain on the cluster.

    • <location>: Provide the HDFS path to the files.

    • <ACL owner>: Specify the HDFS access permissions.

    • Optional. Specify a Late Data Handling cut-off.

    <?xml version="1.0"?>        
       Feed Example
    <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1">
      <frequency>hours(1)</frequency> <!--Feed run frequency-->  
      <late-arrival cut-off="hours(6)”/> <!-- Late arrival cut-off -->
      <groups>churnAnalysisFeeds</groups> <!--Feed group, feeds can belong to multiple groups -->
      <tags externalSource=$MyEDW, externalTarget=Marketing> <!-- Metadata tagging -->
      <clusters> <!-- Target clusters for retention and replication. -->
        <cluster name="$MyDataCenter" type="source">
          <validity start="$date" end="$date"/>
          <retention limit="days($n)" action="delete"> <!--Currently delete is the only action available -->
        <cluster name="$MyDataCenter-secondary" type="target">
          <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
           <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
    	 <retention limit="days(7)" action="delete"/>
      <locations> <!-- Global location across clusters - HDFS paths or Hive tables -->
        <location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
      <ACL owner="hdfs" group="users" permission="0755"/>  <!-- Required for HDFS. -->
      <schema location="/none" provider="none"/> <!-- Required for HDFS. -->
  3. Create the process specification XML file:

    • <cluster name>: Reference the cluster entity to define where the process runs.

    • <feed>: Reference the feed entity to define the datasets that the process uses.

    • Optional. Specifiy Late Data Handling policies or a Retry Policy.

    <?xml version="1.0"?>        
       Process Example
    <process name="process-test" xmlns="uri:falcon:process:0.1”>
          <cluster name="$MyDataCenter">
            <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" 
        <order>FIFO</order> <!--You can also use LIFO and LASTONLY but FIFO is recommended in most cases--> 
            <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" />
            <output instance="now(0,2)" feed="feed-clicks-clean" name="output" />
        <workflow engine="pig" path="/apps/clickstream/clean-script.pig" />
        <retry policy="periodic" delay="minutes(10)" attempts="3"/>
        <late-process policy="exp-backoff" delay="hours(1)">
    	  <late-input input="input" workflow-path="/apps/clickstream/late" />

LIFO and LASTONLY are also supported schedule changes for <order>.

You can now move on to Deploying Data Piplelines.