Data Governance Guide
Also available as:
PDF
loading table of contents...

Define the Data Source: Set Up a Source Cluster Entity

Define where data and processes are stored in the cluster entity.

  1. Create an XML file for the Cluster entity. This file contains all properties for the cluster. Include the XML version:

    <?xml version="1.0"?>
  2. Define the colo and name attributes for the cluster.

    <?xml version="1.0"?>
    <cluster colo="<MyDataCenter>" description="description" 
                name="<MyDataCenterFilename>"> 
    </cluster>
    [Note]Note

    colo specifies the data center to which this cluster belongs.

    name is the name of the cluster, which must be unique.

  3. Define the interfaces for the cluster. For each interface specify type of interface, endpoint, and Apache version.

    For example:

    <cluster colo="<MyDataCenter>" description="description" 
              name="<MyDataCenterFilename>">
         <interfaces>
              
              <!-- Required for distcp for replications. -->
              <interface type="readonly" endpoint="hftp://nn:50070" version="2.4.2" />
        
              <!-- Needed for writing to HDFS-->
              <interface type="write" endpoint="hdfs://nn:8020" version="2.4.2" />
        
              <!-- Required. An execute interface specifies the interface for job tracker.-->
              <interface type="execute" endpoint="rm:8050" version="2.4.2" />
        
              <!-- Required. A workflow interface specifies the interface for workflow engines, such as Oozie.-->
              <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" />
        
              <!--A registry interface specifies the interface for the metadata catalog, such as Hive Metastore or HCatalog.-->
              <interface type="registry" endpoint="thrift://hms:9083" version="0.13.0" /> 
        
              <!--Messaging interface specifies the interface for sending alerts.-->
              <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> 
         </interfaces>
    </cluster>
  4. Provide the locations for the HDFS paths to files.

    For example:

    <cluster colo="<MyDataCenter>" description="description" 
              name="<MyDataCenter>">
         <interfaces>
         
              <!-- Required for distcp for replications. -->   
              <interface type="readonly" endpoint="hftp://nn:50070" version="2.4.2" /> 
              
              <!-- Needed for writing to HDFS-->
              <interface type="write" endpoint="hdfs://nn:8020" version="2.4.2" />
              
              <!-- Needed to write to jobs as MapReduce-->
              <interface type="execute" endpoint="rm:8050" version="2.4.2" /> 
              
              <!-- Required. Submits Oozie jobs.-->
              <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> 
              
              <!--Register/deregister partitions in the Hive Metastore and get events on partition availability-->
              <interface type=”registry" endpoint="thrift://hms:9083" version="0.13.0" /> 
              
              <!--Needed for alerts-->
              <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> 
         </interfaces>
         
         <locations>
         
              <!--HDFS directories used by the Falcon server-->
              <location name="staging" path="/apps/falcon/prod-cluster/staging" /> 
              <location name="temp" path="/tmp" />
              <location name="working" path="/apps/falcon/prod-cluster/working" />
         </locations>
    </cluster>

    The cluster entity is complete if you are using a non-secure environment. If you are using an environment that is secured with Kerberos, continue on with the next step.

  5. For secure clusters, define the following properties in all your cluster entities as shown below:

    <cluster colo="<MyDataCenter>" description="description" 
              name="<MyDataCenter>">
              
         <interfaces>
         
              <!-- Required for distcp for replications. -->     
              <interface type="readonly" endpoint="hftp://nn:50070" version="2.4.2" /> 
              
              <!-- Needed for writing to HDFS-->
              <interface type="write" endpoint="hdfs://nn:8020" version="2.4.2" />
              
              <!-- Needed to write to jobs as MapReduce-->
              <interface type="execute" endpoint="rm:8050" version="2.4.2" />
              
              <!-- Required. Submits Oozie jobs.-->
              <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> 
              
              <!--Register/deregister partitions in the Hive Metastore and get events on partition availability-->
              <interface type=”registry" endpoint="thrift://hms:9083" version="0.13.0" /> 
              
              <!--Needed for alerts-->
              <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> 
         </interfaces>
         
         <locations>
         
              <!--HDFS directories used by the Falcon server-->
              <location name="staging" path="/apps/falcon/prod-cluster/staging" /> 
              <location name="temp" path="/tmp" />
              <location name="working" path="/apps/falcon/prod-cluster/working" />
         </locations> 
         
         <properties>
              <property name="dfs.namenode.kerberos.principal" value="nn/$my.internal@EXAMPLE.COM"/>
              <property name="hive.metastore.kerberos.principal" value="hive/$my.internal@EXAMPLE.COM"/>
              <property name="hive.metastore.uris" value="thrift://$my.internal:9083"/>
              <property name="hive.metastore.sasl.enabled" value="true"/>
         </properties>
    </cluster>

    Replace $my.internal@EXAMPLE.COM and $my.internal with your own values.

    [Important]Important

    Make sure hadoop.security.auth_to_local in core-site.xml is consistent across all clusters. Inconsistencies in rules for hadoop.security.auth_to_local can lead to issues with delegation token renewals.