Using the Falcon CLI to Define Data Pipelines
To use the Falcon CLI to define a data pipeline:
Create the cluster specification XML file, also known as a cluster entity. There are several interfaces to define in a cluster entity. For example, here is a cluster entity with all cluster interfaces defined:
Colo: Name of the Data Center
Name: Filename of the Data Center
<interface>: Specify the interface type
Important Permissions on the cluster staging directory must be set to 777 (read/write/execute for owner/group/others). Only Oozie job definitions are written to the staging directory so setting permissions to 777 does not create any vulnerability.
<?xml version="1.0"?> <!-- Cluster Example --> <cluster colo="$MyDataCenter" description="description" name="$MyDataCenter"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="2.4.0" /> <!-- Required for distcp for replications. --> <interface type="write" endpoint="hdfs://nn:8020" version="2.4.0" /> <!-- Needed for writing to HDFS--> <interface type="execute" endpoint="rm:8050" version="2.4.0" /> <!-- Needed to write to jobs as MapReduce--> <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <!-- Required. Submits Oozie jobs.--> <interface type=”registry" endpoint="thrift://hms:9083" version="0.13.0" /> <!--Register/deregister partitions in the Hive Metastore and get events on partition availability --> <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> <!--Needed for alerts--> </interfaces> <locations> <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <!--HDFS directories used by the Falcon server--> <location name="temp" path="/tmp" /> <location name="working" path="/apps/falcon/prod-cluster/working" /> </locations> </cluster>
Note Additional properties must be set if you are configuring for a secure cluster. For more information, see "Configuring for Secure Clusters" in the Installing HDP Manually guide.
Next, create a dataset specification XML file, or feed entity:
Reference the cluster entity to determine which clusters the feed uses.
<frequency>: Specify the frequency of the feed.
<retention limit>: Choose a retention policy for the data to remain on the cluster.
<location>: Provide the HDFS path to the files.
<ACL owner>: Specify the HDFS access permissions.
Optional. Specify a Late Data Handling cut-off.
<?xml version="1.0"?> <!-- Feed Example --> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <frequency>hours(1)</frequency> <!--Feed run frequency--> <late-arrival cut-off="hours(6)”/> <!-- Late arrival cut-off --> <groups>churnAnalysisFeeds</groups> <!--Feed group, feeds can belong to multiple groups --> <tags externalSource=$MyEDW, externalTarget=Marketing> <!-- Metadata tagging --> <clusters> <!-- Target clusters for retention and replication. --> <cluster name="$MyDataCenter" type="source"> <validity start="$date" end="$date"/> <retention limit="days($n)" action="delete"> <!--Currently delete is the only action available --> </cluster> <cluster name="$MyDataCenter-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit="days(7)" action="delete"/> </cluster> </clusters> <locations> <!-- Global location across clusters - HDFS paths or Hive tables --> <location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <ACL owner="hdfs" group="users" permission="0755"/> <!-- Required for HDFS. --> <schema location="/none" provider="none"/> <!-- Required for HDFS. --> </feed>
Create the process specification XML file:
<cluster name>: Reference the cluster entity to define where the process runs.
<feed>: Reference the feed entity to define the datasets that the process uses.
Optional. Specify Late Data Handling policies or a Retry Policy.
<?xml version="1.0"?> <!-- Process Example --> <process name="process-test" xmlns="uri:falcon:process:0.1”> <clusters> <cluster name="$MyDataCenter"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" </cluster> </clusters> <parallel>1</parallel> <order>FIFO</order> <!--You can also use LIFO and LASTONLY but FIFO is recommended in most cases--> <frequency>days(1)</frequency> <inputs> <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> </inputs> <outputs> <output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> </outputs> <workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> <retry policy="periodic" delay="minutes(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process> </process>
Note | |
---|---|
LIFO and LASTONLY are also supported schedule changes for <order>. |
You can now move on to Deploying Data Pipelines.