Data Movement and Integration
Also available as:
PDF
loading table of contents...

Chapter 8. Mirroring Data with Falcon

You can mirror data between on-premise clusters or between an on-premises HDFS cluster and a cluster in the cloud using Microsoft Azure or Amazon S3.

Prepare to Mirror Data

Mirroring data produces an exact copy of the data and keeps both copies synchronized. You can use Falcon to mirror HDFS directories, Hive tables, and snapshots.

Before creating a mirror, complete the following actions:

  1. Set permissions to allow read, write, and execute access to the source and target directories.

    You must be logged in as the owner of the directories.

    Example: If the source directory were in /user/ambari-qa/falcon, type the following:

    [bash ~]$ su - root
    [root@bash ~]$ su - ambari-qa
    [ambari-qa@bash ~]$ hadoop fs -chmod 755 /user/ambari-qa/falcon/
  2. Create the source and target cluster entity definitions, if they do not exist.

    See "Creating a Cluster Entity Definition" in Creating Falcon Entity Definitions for more information.

  3. For snapshot mirroring, you must also enable the snapshot capability on the source and target directories.

    You must be logged in as the HDFS Service user and the source and target directories must be owned by the user submitting the job.

    For example:

    [ambari-qa@bash ~]$ su - hdfs
    ## Run the following command on the target cluster
    [hdfs@bash ~]$ hdfs dfsadmin -allowSnapshot /apps/falcon/snapshots/target
    ## Run the following command on the source cluster
    [hdfs@bash ~]$ hdfs dfsadmin -allowSnapshot /apps/falcon/snapshots/source

Mirror File System Data Using the Web UI

You can use the Falcon web UI to quickly define a mirror job and start a mirror job on HDFS.

Prerequisites

Your environment must meet the HDP versioning requirements described in "Replication Between HDP Versions" in Creating Falcon Entity Definitions.

Steps

  1. Ensure that you have set permissions correctly and defined required entities as described in Preparing to Mirror Data.

  2. At the top of the Falcon web UI page, click Create > Mirror > File System.

  3. On the New HDFS Mirror page, specify the values for the following properties:

    Table 8.1. General HDFS Mirror Properties

    PropertyDescription
    Mirror Name

    Name of the mirror job. The naming criteria are as follows:

    • Must be unique

    • Must start with a letter

    • Is case sensitive

    • Can contain a maximum of 40 characters

    • Can include numbers

    • Can use a dash (-) but no other special characters

    • Cannot contain spaces

    Tags

    Enter the key/value pair for metadata tagging to assist in entity search in Falcon. The criteria are as follows:

    • Can contain 1 to 100 characters

    • Can include numbers

    • Can use a dash (-) but no other special characters

    • Cannot contain spaces


    Table 8.2. Source and Target Mirror Properties

    PropertyDescription
    Source LocationSpecify whether the source data is local on HDFS or on Microsoft Azure or Amazon S3 in the cloud. If your target is Azure or S3, you can only use HDFS for the source.
    Source ClusterSelect an existing cluster entity.
    Source PathEnter the path to the source data.
    Target LocationSpecify whether the mirror target is local on HDFS or on Microsoft Azure or Amazon S3 in the cloud. If your target is Azure or S3, you can only use HDFS for the source.
    Target ClusterSelect an existing cluster entity to serve as target for the mirrored data.
    Target PathEnter the path to the directory that will contain the mirrored data.
    Run job hereChoose whether to execute the job on the source or on the target cluster.
    Validity Startand EndCombined with the frequency value to determine the window of time in which a Falcon mirror job can execute. The workflow job starts executing after the schedule time and when all the inputs are available. The workflow ends before the specified end time, so there is not a workflow instance at end time. Also known as run duration.
    FrequencyHow often the process is generated. Valid frequency types are minutes, hours, days, and months.
    TimezoneThe timezone is associated with the start and end times. Default timezone is UTC.
    Send alerts toA comma-separated list of email addresses to which alerts are sent, in the format name@company.com.

    Table 8.3. Advanced HDFS Mirror Properties

    PropertyDescription
    Max Maps for DistCpThe maximum number of maps used during replication. This setting impacts performance and throttling. Default is 5.
    Max Bandwidth (MB)The bandwidth in MB/s used by each mapper during replication. This setting impacts performance and throttling. Default is 100 MB.
    Retry PolicyDefines how the workflow failures should be handled. Options are Periodic, Exponential Backoff, and Final.

    Delay

    The time period after which a retry attempt is made. For example, an Attempt value of 3 and Delay value of 10 minutes would cause the workflow retry to occur after 10 minutes, 20 minutes, and 30 minutes after the start time of the workflow. Default is 30 minutes.

    Attempts

    How many times the retry policy should be implemented before the job fails. Default is 3.

    Access Control List

    Specify the HDFS owner, group, and access permissions for the cluster. Default permissions are 755 (rwx/r-x/r-x).

  4. Click Next to view a summary of your entity definition.

  5. (Optional) Click Preview XML to review or edit the entity definition in XML.

  6. After verifying the entity definition, click Save.

    The entity is automatically submitted for verification, but it is not scheduled to run.

  7. Verify that you successfully created the entity.

    1. Type the entity name in the Falcon web UI Search field and press Enter.

    2. If the entity name appears in the search results, it was successfully created.

      For more information about the search function, see "Locating and Managing Entities" in Using Advanced Falcon Features.

  8. Schedule the entity.

    1. In the search results, click the checkbox next to an entity name with status of Submitted.

    2. Click Schedule.

      After a few seconds a success message displays.

Mirror Hive Data Using the Web UI

You can quickly mirror Apache Hive databases or tables between source and target clusters with HiveServer2 endpoints. You can also enable TDE encryption for your mirror.

  1. Ensure that you have set permissions correctly and defined required entities as described in Preparing to Mirror Data.

  2. At the top of the Falcon web UI page, click Create > Mirror > Hive.

  3. On the New Hive Mirror page, specify the values for the following properties:

    Table 8.4. General Hive Mirror Properties

    PropertyDescriptionDetails
    Mirror Name

    Name of the mirror job.

    The naming criteria are as follows:

    • Must be unique

    • Must start with a letter

    • Is case sensitive

    • Can contain 2 to 40 characters

    • Can include numbers

    • Can use a dash (-) but no other special characters

    • Cannot contain spaces

    Tags

    Enter the key/value pair for metadata tagging to assist in entity search in Falcon.

    The criteria are as follows:

    • Can contain 1 to 100 characters

    • Can include numbers

    • Can use a dash (-) but no other special characters

    • Cannot contain spaces


    Table 8.5. Source and Target Hive Mirror Properties

    PropertyDescriptionDetails
    Cluster, Source & TargetSelect existing cluster entities, one to serve as source for the mirrored data and one to serve as target for the mirrored data.Cluster entities must be available in Falcon before a mirror job can be created.
    HiveServer2 Endpoint, Source & TargetEnter the location of data to be mirrored on the source and the location of the mirrored data on the target.The format is hive2://localhost:1000.
    Hive2 Kerberos Principal, Source & TargetThis field is automatically populated with the value of the service principal for the metastore Thrift server.The value is displayed in Ambari at Hive > Config > Advanced > Advanced hive-site > hive.metastore.kerberos.principal and must be unique.
    Meta Store URI, Source & TargetUsed by the metastore client to connect to the remote metastore. The value is displayed in Ambari at Hive > Config > Advanced > General > hive.metastore.uris.
    Kerberos Principal, Source & TargetThe field is automatically populated.Property=dfs.namenode.kerberos.principal and Value=nn/_HOST@EXAMPLE.COM and must be unique.
    Run job hereChoose whether to execute the job on the source cluster or on the target cluster.None
    I want to copySelect to copy one or more Hive databases or copy one or more tables from a single database. You must identify the specific databases and tables to be copied.None
    Validity Startand EndCombined with the frequency value to determine the window of time in which a Falcon mirror job can execute. The workflow job starts executing after the schedule time and when all the inputs are available. The workflow ends before the specified end time, so there is not a workflow instance at end time. Also known as run duration.
    FrequencyDetermines how often the process is generated. Valid frequency types are minutes, hours, days, and months.
    TimezoneThe timezone is associated with the validity start and end times. Default timezone is UTC.
    Send alerts toA comma-separated list of email addresses to which alerts are sent.The format is name@xyz.com.

    Table 8.6. Advanced Hive Mirror Properties

    PropertyDescriptionDetails
    TDE EncryptionEnables encryption of data at rest. See "Enabling Transparent Data Encryption" in Using Advanced Features for more information.
    Max Maps for DistCpThe maximum number of maps used during replication. This setting impacts performance and throttling. Default is 5.
    Max Bandwidth (MB)The bandwidth in MB/s used by each mapper during replication. This setting impacts performance and throttling. Default is 100 MB.
    Retry PolicyDefines how the workflow failures should be handled. Options are Periodic, Exponential Backoff, and Final.

    Delay

    The time period after which a retry attempt is made. For example, an Attempt value of 3 and Delay value of 10 minutes would cause the workflow retry to occur after 10 minutes, 20 minutes, and 30 minutes after the start time of the workflow. Default is 30 minutes.

    Attempts

    How many times the retry policy should be implemented before the job fails.

    Default is 3.

    Access Control List

    Specify the HDFS owner, group, and access permissions for the cluster. Default permissions are 755 (rwx/r-x/r-x).

  4. Click Next to view a summary of your entity definition.

  5. (Optional) Click Preview XML to review or edit the entity definition in XML.

  6. After verifying the entity definition, click Save.

    The entity is automatically submitted for verification, but it is not scheduled to run.

  7. Verify that you successfully created the entity.

    1. Type the entity name in the Falcon web UI Search field and press Enter.

    2. If the entity name appears in the search results, it was successfully created.

      For more information about the search function, see "Locating and Managing Entities" in Using Advanced Falcon Features.

  8. Schedule the entity to run the mirror job.

    1. In the search results, click the checkbox next to an entity name with status of Submitted.

    2. Click Schedule.

      After a few seconds a success message displays.

Mirror Data Using Snapshots

Snapshot-based mirroring is an efficient data backup method because only updated content is actually transferred during the mirror job. You can mirror snapshots from a single source directory to a single target directory. The destination directory is the target for the backup job.

Prerequisites

  • Source and target clusters must run Hadoop 2.7.0 or higher.

    Falcon does not validate versions.

  • Source and target clusters should both be either secure or unsecure.

    This is a recommendation, not a requirement.

  • Source and target clusters must have snapshot capability enabled (the default is "enabled").

  • The user submitting the mirror job must have access permissions on both the source and target clusters.

To mirror snapshot data with the Falcon web UI:

  1. Ensure that you have set permissions correctly, enabled snapshot mirroring, and defined required entities as described in Preparing to Mirror Data.

  2. At the top of the Falcon web UI page, click Create > Mirror > Snapshot.

  3. On the New Snapshot Based Mirror page, specify the values for the following properties:

    Table 8.7. Source and Target Snapshot Mirror Properties

    PropertyDescription
    Source, ClusterSelect an existing source cluster entity. At least one cluster entity must be available in Falcon.
    Target, ClusterSelect an existing target cluster entity. At least one cluster entity must be available in Falcon.
    Source, DirectoryEnter the path to the source data.

    Source, Delete Snapshot After

    Specify the time period after which the mirrored snapshots are deleted from the source cluster. Snapshots are retained past this date if the number of snapshots is less than the Keep Last setting.

    Source, Keep Last

    Specify the number of snapshots to retain on the source cluster, even if the delete time has been reached. Upon reaching the number specified, the oldest snapshot is deleted when the next job is run.
    Target, DirectoryEnter the path to the location on the target cluster in which the snapshot is stored.

    Target, Delete Snapshot After

    Specify the time period after which the mirrored snapshots are deleted from the target cluster. Snapshots are retained past this date if the number of snapshots is less than the Keep Last setting.

    Target, Keep Last

    Specify the number of snapshots to retain on the target cluster, even if the delete time has been reached. Upon reaching the number specified, the oldest snapshot is deleted when the next job is run.

    Run job hereChoose whether to execute the job on the source or on the target cluster.
    Run Duration Startand EndCombined with the frequency value to determine the window of time in which a Falcon mirror job can execute. The workflow job starts executing after the schedule time and when all the inputs are available. The workflow ends before the specified end time, so there is not a workflow instance at end time. Also known as validity time.
    FrequencyHow often the process is generated. Valid frequency types are minutes, hours, days, and months.
    TimezoneDefault timezone is UTC.

    Table 8.8. Advanced Snapshot Mirror Properties

    PropertyDescription
    TDE EncryptionEnable to encrypt data at rest. See "Enabling Transparent Data Encryption" in Using Advanced Features for more information.
    Retry Policy Defines how the workflow failures should be handled. Options are Periodic, Exponential Backup, and Final.

    Delay

    The time period after which a retry attempt is made. For example, an Attempt value of 3 and Delay value of 10 minutes would cause the workflow retry to occur after 10 minutes, 20 minutes, and 30 minutes after the start time of the workflow. Default is 30 minutes.

    Attempts

    How many times the retry policy should be implemented before the job fails. Default is 3.

    Max MapsThe maximum number of maps used during DistCp replication. This setting impacts performance and throttling. Default is 5.
    Max Bandwidth (MB)The bandwidth in MB/s used by each mapper during replication. This setting impacts performance and throttling. Default is 100 MB.
    Send alerts toA comma-separated list of email addresses to which alerts are sent, in the format name@xyz.com.

    Access Control List

    Specify the HDFS owner, group, and access permissions for the cluster. Default permissions are 755 (rwx/r-x/r-x).

  4. Click Next to view a summary of your entity definition.

  5. (Optional) Click Preview XML to review or edit the entity definition in XML.

  6. After verifying the entity definition, click Save.

    The entity is automatically submitted for verification, but it is not scheduled to run.

  7. Verify that you successfully created the entity.

    1. Type the entity name in the Falcon web UI Search field and press Enter.

    2. If the entity name appears in the search results, it was successfully created.

      For more information about the search function, see "Locating and Managing Entities" in Using Advanced Falcon Features.

  8. Schedule the entity.

    1. In the search results, click the checkbox next to an entity name with status of Submitted.

    2. Click Schedule.

      After a few seconds a success message displays.

Mirror File System Data Using APIs

In HDP 2.6, Falcon client-side recipes were deprecated and replaced with more extensible server-side extensions. Existing Falcon workflows that use client-side recipes are still supported, but any new mirror job must use the server-side extensions.

See the Hortonworks Community Connection (HCC) article DistCp options supported in Falcon mirroring jobs in HDP 2.5 for instructions about using server-side extensions for HDFS and Hive mirroring.

See the Apache Falcon website for additional information about using APIs to mirror data.

Supported DistCp Options for HDFS Mirroring

  • distcpMaxMaps

  • distcpMapBandwidth

  • overwrite

  • ignoreErrors

  • skipChecksum

  • removeDeletedFiles

  • preserveBlockSize

  • preserveReplicationNumber

  • preservePermission

  • preserveUser

  • preserveGroup

  • preserveChecksumType

  • preserveAcl

  • preserveXattr

  • preserveTimes

Following is an example of using extensions to schedule an HDFS mirror job.

falcon extension -submitAndSchedule -extensionName hdfs-mirroring  -file sales-monthly.properties

Content of sales-monthly.properties file:
jobName=sales-monthly
jobValidityStart=2016-06-30T00:00Z
jobValidityEnd=2099-12-31T11:59Z
jobFrequency=minutes(45)
jobTimezone=UTC
sourceCluster=primaryCluster
targetCluster=backupCluster
jobClusterName=primaryCluster
sourceDir=/user/ambari-qa/sales-monthly/input
targetDir=/user/ambari-qa/sales-monthly/output
removeDeletedFiles=true
skipChecksum=false
preservePermission=true
preserveUser=true

Supported DistCp Options for Hive Mirroring

  • distcpMaxMaps

  • distcpMapBandwidth

Following is an example of using extensions to schedule an Apache Hive mirror job.

falcon extension -submitAndSchedule -extensionName hive-mirroring -file hive-sales-monthly.properties


Content of hive-sales-monthly.properties file:

jobName=hive-sales-monthly
sourceCluster=primaryCluster
targetCluster=backupCluster
jobClusterName=primaryCluster
jobValidityStart=2016-07-19T00:02Z
jobValidityEnd=2018-05-25T11:02Z
jobFrequency=minutes(30)
jobRetryPolicy=periodic
jobRetryDelay=minutes(30)
jobRetryAttempts=3
distcpMaxMaps=1
distcpMapBandwidth=100
maxEvents=-1
replicationMaxMaps=5
sourceDatabases=default
sourceTables=*
sourceHiveServer2Uri=hive2://primary:10000
targetHiveServer2Uri=hive2://backup:10000