Chapter 8. Mirroring Data with Falcon

You can mirror data between on-premise clusters or between an on-premises HDFS cluster and a cluster in the cloud using Microsoft Azure or Amazon S3.

Prepare to Mirror Data

Mirroring data produces an exact copy of the data and keeps both copies synchronized. You can use Falcon to mirror HDFS directories, Hive tables, and snapshots.

Before creating a mirror, complete the following actions:

Set permissions to allow read, write, and execute access to the source and target directories.
You must be logged in as the owner of the directories.
Example: If the source directory were in /user/ambari-qa/falcon, type the following:
```
[bash ~]$ su - root
[root@bash ~]$ su - ambari-qa
[ambari-qa@bash ~]$ hadoop fs -chmod 755 /user/ambari-qa/falcon/
```
Create the source and target cluster entity definitions, if they do not exist.
See "Creating a Cluster Entity Definition" in Creating Falcon Entity Definitions for more information.
For snapshot mirroring, you must also enable the snapshot capability on the source and target directories.
You must be logged in as the HDFS Service user and the source and target directories must be owned by the user submitting the job.
For example:
```
[ambari-qa@bash ~]$ su - hdfs
## Run the following command on the target cluster
[hdfs@bash ~]$ hdfs dfsadmin -allowSnapshot /apps/falcon/snapshots/target
## Run the following command on the source cluster
[hdfs@bash ~]$ hdfs dfsadmin -allowSnapshot /apps/falcon/snapshots/source
```

Mirror File System Data Using the Web UI

You can use the Falcon web UI to quickly define a mirror job and start a mirror job on HDFS.

Prerequisites

Your environment must meet the HDP versioning requirements described in "Replication Between HDP Versions" in Creating Falcon Entity Definitions.

Steps

Ensure that you have set permissions correctly and defined required entities as described in Preparing to Mirror Data.
At the top of the Falcon web UI page, click Create > Mirror > File System.

On the New HDFS Mirror page, specify the values for the following properties:

Table 8.1. General HDFS Mirror Properties

Property	Description
Mirror Name	Name of the mirror job. The naming criteria are as follows: Must be unique Must start with a letter Is case sensitive Can contain a maximum of 40 characters Can include numbers Can use a dash (-) but no other special characters Cannot contain spaces
Tags	Enter the key/value pair for metadata tagging to assist in entity search in Falcon. The criteria are as follows: Can contain 1 to 100 characters Can include numbers Can use a dash (-) but no other special characters Cannot contain spaces

Table 8.2. Source and Target Mirror Properties

Property	Description
Source Location	Specify whether the source data is local on HDFS or on Microsoft Azure or Amazon S3 in the cloud. If your target is Azure or S3, you can only use HDFS for the source.
Source Cluster	Select an existing cluster entity.
Source Path	Enter the path to the source data.
Target Location	Specify whether the mirror target is local on HDFS or on Microsoft Azure or Amazon S3 in the cloud. If your target is Azure or S3, you can only use HDFS for the source.
Target Cluster	Select an existing cluster entity to serve as target for the mirrored data.
Target Path	Enter the path to the directory that will contain the mirrored data.
Run job here	Choose whether to execute the job on the source or on the target cluster.
Validity Startand End	Combined with the frequency value to determine the window of time in which a Falcon mirror job can execute. The workflow job starts executing after the schedule time and when all the inputs are available. The workflow ends before the specified end time, so there is not a workflow instance at end time. Also known as run duration.
Frequency	How often the process is generated. Valid frequency types are minutes, hours, days, and months.
Timezone	The timezone is associated with the start and end times. Default timezone is UTC.
Send alerts to	A comma-separated list of email addresses to which alerts are sent, in the format name@company.com.

Table 8.3. Advanced HDFS Mirror Properties

Property	Description
Max Maps for DistCp	The maximum number of maps used during replication. This setting impacts performance and throttling. Default is 5.
Max Bandwidth (MB)	The bandwidth in MB/s used by each mapper during replication. This setting impacts performance and throttling. Default is 100 MB.
Retry Policy	Defines how the workflow failures should be handled. Options are Periodic, Exponential Backoff, and Final.
Delay	The time period after which a retry attempt is made. For example, an Attempt value of 3 and Delay value of 10 minutes would cause the workflow retry to occur after 10 minutes, 20 minutes, and 30 minutes after the start time of the workflow. Default is 30 minutes.
Attempts	How many times the retry policy should be implemented before the job fails. Default is 3.
Access Control List	Specify the HDFS owner, group, and access permissions for the cluster. Default permissions are 755 (rwx/r-x/r-x).

Click Next to view a summary of your entity definition.
(Optional) Click Preview XML to review or edit the entity definition in XML.
After verifying the entity definition, click Save.
The entity is automatically submitted for verification, but it is not scheduled to run.
Verify that you successfully created the entity.
1. Type the entity name in the Falcon web UI Search field and press Enter.
2. If the entity name appears in the search results, it was successfully created.
  For more information about the search function, see "Locating and Managing Entities" in Using Advanced Falcon Features.
Schedule the entity.
1. In the search results, click the checkbox next to an entity name with status of Submitted.
2. Click Schedule.
  After a few seconds a success message displays.

Mirror Hive Data Using the Web UI

You can quickly mirror Apache Hive databases or tables between source and target clusters with HiveServer2 endpoints. You can also enable TDE encryption for your mirror.

Ensure that you have set permissions correctly and defined required entities as described in Preparing to Mirror Data.
At the top of the Falcon web UI page, click Create > Mirror > Hive.

On the New Hive Mirror page, specify the values for the following properties:

Table 8.4. General Hive Mirror Properties

Property	Description	Details
Mirror Name	Name of the mirror job.	The naming criteria are as follows: Must be unique Must start with a letter Is case sensitive Can contain 2 to 40 characters Can include numbers Can use a dash (-) but no other special characters Cannot contain spaces
Tags	Enter the key/value pair for metadata tagging to assist in entity search in Falcon.	The criteria are as follows: Can contain 1 to 100 characters Can include numbers Can use a dash (-) but no other special characters Cannot contain spaces

Table 8.5. Source and Target Hive Mirror Properties

Property	Description	Details
Cluster, Source & Target	Select existing cluster entities, one to serve as source for the mirrored data and one to serve as target for the mirrored data.	Cluster entities must be available in Falcon before a mirror job can be created.
HiveServer2 Endpoint, Source & Target	Enter the location of data to be mirrored on the source and the location of the mirrored data on the target.	The format is hive2://`localhost`:1000.
Hive2 Kerberos Principal, Source & Target	This field is automatically populated with the value of the service principal for the metastore Thrift server.	The value is displayed in Ambari at `Hive > Config > Advanced > Advanced hive-site > hive.metastore.kerberos.principal` and must be unique.
Meta Store URI, Source & Target	Used by the metastore client to connect to the remote metastore.	The value is displayed in Ambari at `Hive > Config > Advanced > General > hive.metastore.uris`.
Kerberos Principal, Source & Target	The field is automatically populated.	Property=dfs.namenode.kerberos.principal and Value=nn/_HOST@EXAMPLE.COM and must be unique.
Run job here	Choose whether to execute the job on the source cluster or on the target cluster.	None
I want to copy	Select to copy one or more Hive databases or copy one or more tables from a single database. You must identify the specific databases and tables to be copied.	None
Validity Startand End	Combined with the frequency value to determine the window of time in which a Falcon mirror job can execute.	The workflow job starts executing after the schedule time and when all the inputs are available. The workflow ends before the specified end time, so there is not a workflow instance at end time. Also known as run duration.
Frequency	Determines how often the process is generated.	Valid frequency types are minutes, hours, days, and months.
Timezone	The timezone is associated with the validity start and end times.	Default timezone is UTC.
Send alerts to	A comma-separated list of email addresses to which alerts are sent.	The format is name@xyz.com.

Table 8.6. Advanced Hive Mirror Properties

Property	Description	Details
TDE Encryption	Enables encryption of data at rest.	See "Enabling Transparent Data Encryption" in Using Advanced Features for more information.
Max Maps for DistCp	The maximum number of maps used during replication.	This setting impacts performance and throttling. Default is 5.
Max Bandwidth (MB)	The bandwidth in MB/s used by each mapper during replication.	This setting impacts performance and throttling. Default is 100 MB.
Retry Policy	Defines how the workflow failures should be handled.	Options are Periodic, Exponential Backoff, and Final.
Delay	The time period after which a retry attempt is made.	For example, an Attempt value of 3 and Delay value of 10 minutes would cause the workflow retry to occur after 10 minutes, 20 minutes, and 30 minutes after the start time of the workflow. Default is 30 minutes.
Attempts	How many times the retry policy should be implemented before the job fails.	Default is 3.
Access Control List	Specify the HDFS owner, group, and access permissions for the cluster.	Default permissions are 755 (rwx/r-x/r-x).

Click Next to view a summary of your entity definition.
(Optional) Click Preview XML to review or edit the entity definition in XML.
After verifying the entity definition, click Save.
The entity is automatically submitted for verification, but it is not scheduled to run.
Verify that you successfully created the entity.
1. Type the entity name in the Falcon web UI Search field and press Enter.
2. If the entity name appears in the search results, it was successfully created.
  For more information about the search function, see "Locating and Managing Entities" in Using Advanced Falcon Features.
Schedule the entity to run the mirror job.
1. In the search results, click the checkbox next to an entity name with status of Submitted.
2. Click Schedule.
  After a few seconds a success message displays.

Mirror Data Using Snapshots

Snapshot-based mirroring is an efficient data backup method because only updated content is actually transferred during the mirror job. You can mirror snapshots from a single source directory to a single target directory. The destination directory is the target for the backup job.

Prerequisites

Source and target clusters must run Hadoop 2.7.0 or higher.
Falcon does not validate versions.
Source and target clusters should both be either secure or unsecure.
This is a recommendation, not a requirement.
Source and target clusters must have snapshot capability enabled (the default is "enabled").
The user submitting the mirror job must have access permissions on both the source and target clusters.

To mirror snapshot data with the Falcon web UI:

Ensure that you have set permissions correctly, enabled snapshot mirroring, and defined required entities as described in Preparing to Mirror Data.
At the top of the Falcon web UI page, click Create > Mirror > Snapshot.

On the New Snapshot Based Mirror page, specify the values for the following properties:

Table 8.7. Source and Target Snapshot Mirror Properties

Property	Description
Source, Cluster	Select an existing source cluster entity. At least one cluster entity must be available in Falcon.
Target, Cluster	Select an existing target cluster entity. At least one cluster entity must be available in Falcon.
Source, Directory	Enter the path to the source data.
Source, Delete Snapshot After	Specify the time period after which the mirrored snapshots are deleted from the source cluster. Snapshots are retained past this date if the number of snapshots is less than the Keep Last setting.
Source, Keep Last	Specify the number of snapshots to retain on the source cluster, even if the delete time has been reached. Upon reaching the number specified, the oldest snapshot is deleted when the next job is run.
Target, Directory	Enter the path to the location on the target cluster in which the snapshot is stored.
Target, Delete Snapshot After	Specify the time period after which the mirrored snapshots are deleted from the target cluster. Snapshots are retained past this date if the number of snapshots is less than the Keep Last setting.
Target, Keep Last	Specify the number of snapshots to retain on the target cluster, even if the delete time has been reached. Upon reaching the number specified, the oldest snapshot is deleted when the next job is run.
Run job here	Choose whether to execute the job on the source or on the target cluster.
Run Duration Startand End	Combined with the frequency value to determine the window of time in which a Falcon mirror job can execute. The workflow job starts executing after the schedule time and when all the inputs are available. The workflow ends before the specified end time, so there is not a workflow instance at end time. Also known as validity time.
Frequency	How often the process is generated. Valid frequency types are minutes, hours, days, and months.
Timezone	Default timezone is UTC.

Table 8.8. Advanced Snapshot Mirror Properties

Property	Description
TDE Encryption	Enable to encrypt data at rest. See "Enabling Transparent Data Encryption" in Using Advanced Features for more information.
Retry Policy	Defines how the workflow failures should be handled. Options are Periodic, Exponential Backup, and Final.
Delay	The time period after which a retry attempt is made. For example, an Attempt value of 3 and Delay value of 10 minutes would cause the workflow retry to occur after 10 minutes, 20 minutes, and 30 minutes after the start time of the workflow. Default is 30 minutes.
Attempts	How many times the retry policy should be implemented before the job fails. Default is 3.
Max Maps	The maximum number of maps used during DistCp replication. This setting impacts performance and throttling. Default is 5.
Max Bandwidth (MB)	The bandwidth in MB/s used by each mapper during replication. This setting impacts performance and throttling. Default is 100 MB.
Send alerts to	A comma-separated list of email addresses to which alerts are sent, in the format name@xyz.com.
Access Control List	Specify the HDFS owner, group, and access permissions for the cluster. Default permissions are 755 (rwx/r-x/r-x).

Click Next to view a summary of your entity definition.
(Optional) Click Preview XML to review or edit the entity definition in XML.
After verifying the entity definition, click Save.
The entity is automatically submitted for verification, but it is not scheduled to run.
Verify that you successfully created the entity.
1. Type the entity name in the Falcon web UI Search field and press Enter.
2. If the entity name appears in the search results, it was successfully created.
  For more information about the search function, see "Locating and Managing Entities" in Using Advanced Falcon Features.
Schedule the entity.
1. In the search results, click the checkbox next to an entity name with status of Submitted.
2. Click Schedule.
  After a few seconds a success message displays.

Mirror File System Data Using APIs

In HDP 2.6, Falcon client-side recipes were deprecated and replaced with more extensible server-side extensions. Existing Falcon workflows that use client-side recipes are still supported, but any new mirror job must use the server-side extensions.

See the Hortonworks Community Connection (HCC) article DistCp options supported in Falcon mirroring jobs in HDP 2.5 for instructions about using server-side extensions for HDFS and Hive mirroring.

See the Apache Falcon website for additional information about using APIs to mirror data.

Supported DistCp Options for HDFS Mirroring

distcpMaxMaps
distcpMapBandwidth
overwrite
ignoreErrors
skipChecksum
removeDeletedFiles
preserveBlockSize
preserveReplicationNumber
preservePermission
preserveUser
preserveGroup
preserveChecksumType
preserveAcl
preserveXattr
preserveTimes

Following is an example of using extensions to schedule an HDFS mirror job.

falcon extension -submitAndSchedule -extensionName hdfs-mirroring  -file sales-monthly.properties

Content of sales-monthly.properties file:
jobName=sales-monthly
jobValidityStart=2016-06-30T00:00Z
jobValidityEnd=2099-12-31T11:59Z
jobFrequency=minutes(45)
jobTimezone=UTC
sourceCluster=primaryCluster
targetCluster=backupCluster
jobClusterName=primaryCluster
sourceDir=/user/ambari-qa/sales-monthly/input
targetDir=/user/ambari-qa/sales-monthly/output
removeDeletedFiles=true
skipChecksum=false
preservePermission=true
preserveUser=true

Supported DistCp Options for Hive Mirroring

distcpMaxMaps
distcpMapBandwidth

Following is an example of using extensions to schedule an Apache Hive mirror job.

falcon extension -submitAndSchedule -extensionName hive-mirroring -file hive-sales-monthly.properties


Content of hive-sales-monthly.properties file:

jobName=hive-sales-monthly
sourceCluster=primaryCluster
targetCluster=backupCluster
jobClusterName=primaryCluster
jobValidityStart=2016-07-19T00:02Z
jobValidityEnd=2018-05-25T11:02Z
jobFrequency=minutes(30)
jobRetryPolicy=periodic
jobRetryDelay=minutes(30)
jobRetryAttempts=3
distcpMaxMaps=1
distcpMapBandwidth=100
maxEvents=-1
replicationMaxMaps=5
sourceDatabases=default
sourceTables=*
sourceHiveServer2Uri=hive2://primary:10000
targetHiveServer2Uri=hive2://backup:10000