Chapter 8. Mirroring Data with Falcon
You can mirror data between on-premise clusters or between an on-premises HDFS cluster and a cluster in the cloud using Microsoft Azure or Amazon S3.
Prepare to Mirror Data
Mirroring data produces an exact copy of the data and keeps both copies synchronized. You can use Falcon to mirror HDFS directories, Hive tables, and snapshots.
Before creating a mirror, complete the following actions:
Set permissions to allow read, write, and execute access to the source and target directories.
You must be logged in as the owner of the directories.
Example: If the source directory were in
/user/ambari-qa/falcon
, type the following:[bash ~]$ su - root [root@bash ~]$ su - ambari-qa [ambari-qa@bash ~]$ hadoop fs -chmod 755 /user/ambari-qa/falcon/
Create the source and target cluster entity definitions, if they do not exist.
See "Creating a Cluster Entity Definition" in Creating Falcon Entity Definitions for more information.
For snapshot mirroring, you must also enable the snapshot capability on the source and target directories.
You must be logged in as the HDFS Service user and the source and target directories must be owned by the user submitting the job.
For example:
[ambari-qa@bash ~]$ su - hdfs ## Run the following command on the target cluster [hdfs@bash ~]$ hdfs dfsadmin -allowSnapshot /apps/falcon/snapshots/target ## Run the following command on the source cluster [hdfs@bash ~]$ hdfs dfsadmin -allowSnapshot /apps/falcon/snapshots/source
Mirror File System Data Using the Web UI
You can use the Falcon web UI to quickly define a mirror job and start a mirror job on HDFS.
Prerequisites
Your environment must meet the HDP versioning requirements described in "Replication Between HDP Versions" in Creating Falcon Entity Definitions.
Steps
Ensure that you have set permissions correctly and defined required entities as described in Preparing to Mirror Data.
At the top of the Falcon web UI page, click Create > Mirror > File System.
On the New HDFS Mirror page, specify the values for the following properties:
Table 8.1. General HDFS Mirror Properties
Property Description Mirror Name Name of the mirror job. The naming criteria are as follows:
Must be unique
Must start with a letter
Is case sensitive
Can contain a maximum of 40 characters
Can include numbers
Can use a dash (-) but no other special characters
Cannot contain spaces
Tags Enter the key/value pair for metadata tagging to assist in entity search in Falcon. The criteria are as follows:
Can contain 1 to 100 characters
Can include numbers
Can use a dash (-) but no other special characters
Cannot contain spaces
Table 8.2. Source and Target Mirror Properties
Property Description Source Location Specify whether the source data is local on HDFS or on Microsoft Azure or Amazon S3 in the cloud. If your target is Azure or S3, you can only use HDFS for the source. Source Cluster Select an existing cluster entity. Source Path Enter the path to the source data. Target Location Specify whether the mirror target is local on HDFS or on Microsoft Azure or Amazon S3 in the cloud. If your target is Azure or S3, you can only use HDFS for the source. Target Cluster Select an existing cluster entity to serve as target for the mirrored data. Target Path Enter the path to the directory that will contain the mirrored data. Run job here Choose whether to execute the job on the source or on the target cluster. Validity Startand End Combined with the frequency value to determine the window of time in which a Falcon mirror job can execute. The workflow job starts executing after the schedule time and when all the inputs are available. The workflow ends before the specified end time, so there is not a workflow instance at end time. Also known as run duration. Frequency How often the process is generated. Valid frequency types are minutes, hours, days, and months. Timezone The timezone is associated with the start and end times. Default timezone is UTC. Send alerts to A comma-separated list of email addresses to which alerts are sent, in the format name@company.com. Table 8.3. Advanced HDFS Mirror Properties
Property Description Max Maps for DistCp The maximum number of maps used during replication. This setting impacts performance and throttling. Default is 5. Max Bandwidth (MB) The bandwidth in MB/s used by each mapper during replication. This setting impacts performance and throttling. Default is 100 MB. Retry Policy Defines how the workflow failures should be handled. Options are Periodic, Exponential Backoff, and Final. Delay
The time period after which a retry attempt is made. For example, an Attempt value of 3 and Delay value of 10 minutes would cause the workflow retry to occur after 10 minutes, 20 minutes, and 30 minutes after the start time of the workflow. Default is 30 minutes.
Attempts
How many times the retry policy should be implemented before the job fails. Default is 3.
Access Control List
Specify the HDFS owner, group, and access permissions for the cluster. Default permissions are 755 (rwx/r-x/r-x). Click Next to view a summary of your entity definition.
(Optional) Click Preview XML to review or edit the entity definition in XML.
After verifying the entity definition, click Save.
The entity is automatically submitted for verification, but it is not scheduled to run.
Verify that you successfully created the entity.
Type the entity name in the Falcon web UI Search field and press Enter.
If the entity name appears in the search results, it was successfully created.
For more information about the search function, see "Locating and Managing Entities" in Using Advanced Falcon Features.
Schedule the entity.
In the search results, click the checkbox next to an entity name with status of
Submitted
.Click Schedule.
After a few seconds a success message displays.
Mirror Hive Data Using the Web UI
You can quickly mirror Apache Hive databases or tables between source and target clusters with HiveServer2 endpoints. You can also enable TDE encryption for your mirror.
Ensure that you have set permissions correctly and defined required entities as described in Preparing to Mirror Data.
At the top of the Falcon web UI page, click Create > Mirror > Hive.
On the New Hive Mirror page, specify the values for the following properties:
Table 8.4. General Hive Mirror Properties
Property Description Details Mirror Name Name of the mirror job.
The naming criteria are as follows:
Must be unique
Must start with a letter
Is case sensitive
Can contain 2 to 40 characters
Can include numbers
Can use a dash (-) but no other special characters
Cannot contain spaces
Tags Enter the key/value pair for metadata tagging to assist in entity search in Falcon.
The criteria are as follows:
Can contain 1 to 100 characters
Can include numbers
Can use a dash (-) but no other special characters
Cannot contain spaces
Table 8.5. Source and Target Hive Mirror Properties
Property Description Details Cluster, Source & Target Select existing cluster entities, one to serve as source for the mirrored data and one to serve as target for the mirrored data. Cluster entities must be available in Falcon before a mirror job can be created. HiveServer2 Endpoint, Source & Target Enter the location of data to be mirrored on the source and the location of the mirrored data on the target. The format is hive2:// localhost
:1000.Hive2 Kerberos Principal, Source & Target This field is automatically populated with the value of the service principal for the metastore Thrift server. The value is displayed in Ambari at Hive > Config > Advanced > Advanced hive-site > hive.metastore.kerberos.principal
and must be unique.Meta Store URI, Source & Target Used by the metastore client to connect to the remote metastore. The value is displayed in Ambari at Hive > Config > Advanced > General > hive.metastore.uris
.Kerberos Principal, Source & Target The field is automatically populated. Property=dfs.namenode.kerberos.principal and Value=nn/_HOST@EXAMPLE.COM and must be unique. Run job here Choose whether to execute the job on the source cluster or on the target cluster. None I want to copy Select to copy one or more Hive databases or copy one or more tables from a single database. You must identify the specific databases and tables to be copied. None Validity Startand End Combined with the frequency value to determine the window of time in which a Falcon mirror job can execute. The workflow job starts executing after the schedule time and when all the inputs are available. The workflow ends before the specified end time, so there is not a workflow instance at end time. Also known as run duration. Frequency Determines how often the process is generated. Valid frequency types are minutes, hours, days, and months. Timezone The timezone is associated with the validity start and end times. Default timezone is UTC. Send alerts to A comma-separated list of email addresses to which alerts are sent. The format is name@xyz.com. Table 8.6. Advanced Hive Mirror Properties
Property Description Details TDE Encryption Enables encryption of data at rest. See "Enabling Transparent Data Encryption" in Using Advanced Features for more information. Max Maps for DistCp The maximum number of maps used during replication. This setting impacts performance and throttling. Default is 5. Max Bandwidth (MB) The bandwidth in MB/s used by each mapper during replication. This setting impacts performance and throttling. Default is 100 MB. Retry Policy Defines how the workflow failures should be handled. Options are Periodic, Exponential Backoff, and Final. Delay
The time period after which a retry attempt is made. For example, an Attempt value of 3 and Delay value of 10 minutes would cause the workflow retry to occur after 10 minutes, 20 minutes, and 30 minutes after the start time of the workflow. Default is 30 minutes. Attempts
How many times the retry policy should be implemented before the job fails.
Default is 3. Access Control List
Specify the HDFS owner, group, and access permissions for the cluster. Default permissions are 755 (rwx/r-x/r-x). Click Next to view a summary of your entity definition.
(Optional) Click Preview XML to review or edit the entity definition in XML.
After verifying the entity definition, click Save.
The entity is automatically submitted for verification, but it is not scheduled to run.
Verify that you successfully created the entity.
Type the entity name in the Falcon web UI Search field and press Enter.
If the entity name appears in the search results, it was successfully created.
For more information about the search function, see "Locating and Managing Entities" in Using Advanced Falcon Features.
Schedule the entity to run the mirror job.
In the search results, click the checkbox next to an entity name with status of
Submitted
.Click Schedule.
After a few seconds a success message displays.
Mirror Data Using Snapshots
Snapshot-based mirroring is an efficient data backup method because only updated content is actually transferred during the mirror job. You can mirror snapshots from a single source directory to a single target directory. The destination directory is the target for the backup job.
Prerequisites
Source and target clusters must run Hadoop 2.7.0 or higher.
Falcon does not validate versions.
Source and target clusters should both be either secure or unsecure.
This is a recommendation, not a requirement.
Source and target clusters must have snapshot capability enabled (the default is "enabled").
The user submitting the mirror job must have access permissions on both the source and target clusters.
To mirror snapshot data with the Falcon web UI:
Ensure that you have set permissions correctly, enabled snapshot mirroring, and defined required entities as described in Preparing to Mirror Data.
At the top of the Falcon web UI page, click Create > Mirror > Snapshot.
On the New Snapshot Based Mirror page, specify the values for the following properties:
Table 8.7. Source and Target Snapshot Mirror Properties
Property Description Source, Cluster Select an existing source cluster entity. At least one cluster entity must be available in Falcon. Target, Cluster Select an existing target cluster entity. At least one cluster entity must be available in Falcon. Source, Directory Enter the path to the source data. Source, Delete Snapshot After
Specify the time period after which the mirrored snapshots are deleted from the source cluster. Snapshots are retained past this date if the number of snapshots is less than the Keep Last setting. Source, Keep Last
Specify the number of snapshots to retain on the source cluster, even if the delete time has been reached. Upon reaching the number specified, the oldest snapshot is deleted when the next job is run. Target, Directory Enter the path to the location on the target cluster in which the snapshot is stored. Target, Delete Snapshot After
Specify the time period after which the mirrored snapshots are deleted from the target cluster. Snapshots are retained past this date if the number of snapshots is less than the Keep Last setting.
Target, Keep Last
Specify the number of snapshots to retain on the target cluster, even if the delete time has been reached. Upon reaching the number specified, the oldest snapshot is deleted when the next job is run.
Run job here Choose whether to execute the job on the source or on the target cluster. Run Duration Startand End Combined with the frequency value to determine the window of time in which a Falcon mirror job can execute. The workflow job starts executing after the schedule time and when all the inputs are available. The workflow ends before the specified end time, so there is not a workflow instance at end time. Also known as validity time. Frequency How often the process is generated. Valid frequency types are minutes, hours, days, and months. Timezone Default timezone is UTC. Table 8.8. Advanced Snapshot Mirror Properties
Property Description TDE Encryption Enable to encrypt data at rest. See "Enabling Transparent Data Encryption" in Using Advanced Features for more information. Retry Policy Defines how the workflow failures should be handled. Options are Periodic, Exponential Backup, and Final. Delay
The time period after which a retry attempt is made. For example, an Attempt value of 3 and Delay value of 10 minutes would cause the workflow retry to occur after 10 minutes, 20 minutes, and 30 minutes after the start time of the workflow. Default is 30 minutes. Attempts
How many times the retry policy should be implemented before the job fails. Default is 3.
Max Maps The maximum number of maps used during DistCp replication. This setting impacts performance and throttling. Default is 5. Max Bandwidth (MB) The bandwidth in MB/s used by each mapper during replication. This setting impacts performance and throttling. Default is 100 MB. Send alerts to A comma-separated list of email addresses to which alerts are sent, in the format name@xyz.com. Access Control List
Specify the HDFS owner, group, and access permissions for the cluster. Default permissions are 755 (rwx/r-x/r-x). Click Next to view a summary of your entity definition.
(Optional) Click Preview XML to review or edit the entity definition in XML.
After verifying the entity definition, click Save.
The entity is automatically submitted for verification, but it is not scheduled to run.
Verify that you successfully created the entity.
Type the entity name in the Falcon web UI Search field and press Enter.
If the entity name appears in the search results, it was successfully created.
For more information about the search function, see "Locating and Managing Entities" in Using Advanced Falcon Features.
Schedule the entity.
In the search results, click the checkbox next to an entity name with status of
Submitted
.Click Schedule.
After a few seconds a success message displays.
Mirror File System Data Using APIs
In HDP 2.6, Falcon client-side recipes were deprecated and replaced with more extensible server-side extensions. Existing Falcon workflows that use client-side recipes are still supported, but any new mirror job must use the server-side extensions.
See the Hortonworks Community Connection (HCC) article DistCp options supported in Falcon mirroring jobs in HDP 2.5 for instructions about using server-side extensions for HDFS and Hive mirroring.
See the Apache Falcon website for additional information about using APIs to mirror data.
Supported DistCp Options for HDFS Mirroring
distcpMaxMaps
distcpMapBandwidth
overwrite
ignoreErrors
skipChecksum
removeDeletedFiles
preserveBlockSize
preserveReplicationNumber
preservePermission
preserveUser
preserveGroup
preserveChecksumType
preserveAcl
preserveXattr
preserveTimes
Following is an example of using extensions to schedule an HDFS mirror job.
falcon extension -submitAndSchedule -extensionName hdfs-mirroring -file sales-monthly.properties Content of sales-monthly.properties file: jobName=sales-monthly jobValidityStart=2016-06-30T00:00Z jobValidityEnd=2099-12-31T11:59Z jobFrequency=minutes(45) jobTimezone=UTC sourceCluster=primaryCluster targetCluster=backupCluster jobClusterName=primaryCluster sourceDir=/user/ambari-qa/sales-monthly/input targetDir=/user/ambari-qa/sales-monthly/output removeDeletedFiles=true skipChecksum=false preservePermission=true preserveUser=true
Supported DistCp Options for Hive Mirroring
distcpMaxMaps
distcpMapBandwidth
Following is an example of using extensions to schedule an Apache Hive mirror job.
falcon extension -submitAndSchedule -extensionName hive-mirroring -file hive-sales-monthly.properties Content of hive-sales-monthly.properties file: jobName=hive-sales-monthly sourceCluster=primaryCluster targetCluster=backupCluster jobClusterName=primaryCluster jobValidityStart=2016-07-19T00:02Z jobValidityEnd=2018-05-25T11:02Z jobFrequency=minutes(30) jobRetryPolicy=periodic jobRetryDelay=minutes(30) jobRetryAttempts=3 distcpMaxMaps=1 distcpMapBandwidth=100 maxEvents=-1 replicationMaxMaps=5 sourceDatabases=default sourceTables=* sourceHiveServer2Uri=hive2://primary:10000 targetHiveServer2Uri=hive2://backup:10000