Create the Feed Entity
The feed entity defines the data set that Falcon replicates. Reference your cluster entities to determine which clusters the feed uses.
Create an XML file for the Feed entity.
<?xml version="1.0"?>
Describe the feed.
<?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> </feed>
Specify the frequency of the feed.
<?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> </feed>
Choose a retention policy for the data to remain on the cluster.
For example:
<?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> </feed>
(Optional) Set a late-arrival cut-off policy. The supported policies for late data handling are backoff, exp-backoff (default), and final.
For example, to set the policy to a late cutoff of 6 hours:
<?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> <!-- Late arrival cut-off --> <late-arrival cut-off="hours(6)”/> </feed>
Define your source and target clusters for the feed.
For example, for two clusters, MyDataCenter and MyDataCenter-secondary cluster:
<?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> <!-- Late arrival cut-off --> <late-arrival cut-off="hours(6)”/> <!-- Target clusters for retention and replication. --> <clusters> <cluster name="<MyDataCenter>" type="source"> <validity start="$date" end="$date"/> <!--Currently delete is the only action available --> <retention limit="days($n)" action="delete"> </cluster> <cluster name="$MyDataCenter-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit="days(7)" action="delete"/> </cluster> </clusters> </feed>
Specify the HDFS weblogs path locations or Hive table locations. For example to specify the HDFS weblogs location:
<?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> <!-- Late arrival cut-off --> <late-arrival cut-off="hours(6)”/> <!-- Target clusters for retention and replication. --> <clusters> <cluster name="<MyDataCenter>" type="source"> <validity start="$date" end="$date"/> <!--Currently delete is the only action available --> <retention limit="days($n)" action="delete"> </cluster> <cluster name="$MyDataCenter-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit="days(7)" action="delete"/> </cluster> </clusters> <locations> <!-- Global location across clusters - HDFS paths or Hive tables --> <location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> </feed>
Specify HDFS ACLs. Set the owner, group, and level of permissions for HDFS. For example:
<?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> <!-- Late arrival cut-off --> <late-arrival cut-off="hours(6)”/> <!-- Target clusters for retention and replication. --> <clusters> <cluster name="<MyDataCenter>" type="source"> <validity start="$date" end="$date"/> <!--Currently delete is the only action available --> <retention limit="days($n)" action="delete"> </cluster> <cluster name="$MyDataCenter-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit="days(7)" action="delete"/> </cluster> </clusters> <!-- Global location across clusters - HDFS paths or Hive tables --> <locations> <location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <!-- Required for HDFS. --> <ACL owner="hdfs" group="users" permission="0755"/> </feed>
Specify the location of the schema file for the feed as well as the provider of the schema like protobuf, thrift etc. For example:
<?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> <!-- Late arrival cut-off --> <late-arrival cut-off="hours(6)”/> <!-- Target clusters for retention and replication. --> <clusters> <cluster name="<MyDataCenter>" type="source"> <validity start="$date" end="$date"/> <!--Currently delete is the only action available --> <retention limit="days($n)" action="delete"> </cluster> <cluster name="$MyDataCenter-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit="days(7)" action="delete"/> </cluster> </clusters> <!-- Global location across clusters - HDFS paths or Hive tables --> <locations> <location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <!-- Required for HDFS. --> <ACL owner="hdfs" group="users" permission="0755"/> <schema location="/schema" provider="protobuf"/> </feed>