Chapter 9. Mirroring Data with HiveDR in a Secure Environment

From the Apache Falcon UI, you can configure your cluster for disaster recovery by mirroring data in Hive databases, tables, or partitions. Before setting up a HiveDR job in Falcon, you should configure a secure environment.

	Important
	Before implementing HiveDR with Falcon, you should discuss your plans with Hortonworks Professional Services and Product Management to ensure proper component configuration and adequate protection of your data.

Important Considerations

Before you implement HiveDR, thoroughly consider the impact of the following items on your design and whether your environment would be appropriate for a HiveDR implementation.

Recovery point objective (RPO) is a minimum of 24 hours
The following features are not supported:
- External tables and non-native tables
- Renaming of partitions or tables
- ACID tables
The following are not copied as part of the replication process:
- Hive Views
- Hive statistics
- UDFs and security policies

Also review the "Considerations for Using Falcon".

Prerequisites

The Kerberized source and target clusters must exist.
You must be running the following Apache components:
- Hive 1.2.0 or later
- Oozie 4.2.0 or later
- Sqoop 1.4.6 or later
  Falcon uses Apache Sqoop for import and export operations. Sqoop requires a database driver to connect to the relational database. Refer to the Sqoop documentation if you need more information.
Ensure that the database driver .jar file is copied into the Oozie share library for Sqoop.

About this task

Database replication definitions can be set up only on preexisting databases. Therefore, before you can replicate a database the first time, you must create a target database. To prevent unintentional replication of private data, Hive does not allow the replication of a database without a pre-existing target.

After the database target is created, you must export all tables in the source database and import them in the target database.

Steps

Create a Hive database on the target cluster
You only need to perform this task if you are replicating an entire database. See the Data Access guide or the Apache Hive documentation.
Create a Hive table/partition on the source cluster
You only need to perform this task if the table or partition you want to replicate does not already exist. See the Data Access guide or the Apache Hive documentation
Configuring for High Availability
Preparing to Mirror Data
the section called “Prepare for Disaster Recovery”
the section called “Configure Properties for HiveDR”
the section called “Initialize Data for Hive Replication”
the section called “Mirror Hive Data Using the Web UI”

Prepare for Disaster Recovery

Before creating the required entity definitions in Falcon, you must modify several configuration files. These files can be modified from Ambari or from the command line. To access the service properties in Ambari, click the Services menu and select the service you want, then click the Configs tab. For some services, you must also click the Advanced tab.

auth-to-local Setting

For Kerberos principals to be mapped correctly across clusters, changes must be made to the hadoop.security.auth_to_local setting in the core-site.xml file on both clusters. These entries should be made for the ambariqa, hbase, and hdfs principals.

For example, to add the principal mapping for CLUSTER02 into CLUSTER01 add the following lines into the hadoop.security.auth_to_local field in CLUSTER01. Note, these lines must be inserted before the wildcard line RULE:[1:$1@$0](.*@HADOOP)s/@.*//:

RULE:[1:$1@$0](ambari-qa-CLUSTER02@HADOOP)s/.*/ambari-qa/
RULE:[1:$1@$0](hbase-CLUSTER02@HADOOP)s/.*/hbase/
RULE:[1:$1@$0](hdfs-CLUSTER02@HADOOP)s/.*/hdfs/

	Important
	Ensure the hadoop.security.auth_to_local setting in the HDFS `core-site.xml` file is consistent across all clusters. Inconsistencies in rules for hadoop.security.auth_to_local can lead to issues with delegation token renewals.

Proxy User

To allow the Falcon Oozie workflows to query the primary status of both clusters, the Oozie server hosts must be whitelisted as proxy hosts to allow cross-site Oozie access. On both clusters, the hadoop.proxyuser.oozie.hosts setting in the HDFS core-site.xml must contain the hostnames of all the Oozie servers in both clusters. The hostnames of all the Oozie servers must also be added to the hadoop.proxyuser.hive.hosts because the Oozie server proxies as the Hive user during replication.

Also, to allow cross-site Apache Hive access for Hive replication, Hive and HCatalog hosts for both clusters must be whitelisted as proxy hosts. On both clusters, the hadoop.proxyuser.hive.hosts and hadoop.proxyuser.hcat.hosts settings in the HDFS core-site.xml must contain the hostnames of all the Hive Metastore and Hive Server2 hosts for Hive and WebHCat hosts for HCatalog.

You can change these settings through Ambari.

Nameservices

For cross-site DistCp to function, both clusters must be able to resolve the nameservice of the other cluster. In Ambari, modify the following properties in the hdfs-site section to add the appropriate cluster names.

dfs.internal.nameservices:CLUSTER01 
dfs.client.failover.proxy.provider.CLUSTER02=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailover c ,! ProxyProvider 
dfs.ha.namenodes.CLUSTER02=nn1,nn2 
dfs.namenode.http-address.CLUSTER02.nn1=namenode01.hostname:50070 
dfs.namenode.http-address.CLUSTER02.nn2=namenode02.hostname:50070 
dfs.namenode.https-address.CLUSTER02.nn1=namenode01.hostname:50470 
dfs.namenode.https-address.CLUSTER02.nn2=namenode02.hostname:50470 
dfs.namenode.rpc-address.CLUSTER02.nn1=namenode01.hostname:8020 
dfs.namenode.rpc-address.CLUSTER02.nn2=namenode02.hostname:8020

Configure Properties for HiveDR

You must configure several Hadoop and Hive properties to set up a secure environment before you can implement HiveDR jobs.

In Ambari, on the source cluster, navigate to HDFS > Configs > Advanced.

Scroll to the Custom core-site section and specify the host names for each of the following attributes:

Property:hadoop.proxyuser.hdfs.hosts, Value: The host name is located at HDFS > Configs > Advanced > NameNode.

Property:hadoop.proxyuser.yarn.hosts, Value: The host name is located at YARN > Configs > Advanced > Resource Manager.

Property:hadoop.proxyuser.hive.hosts, Value: The host name is located at Hive > Configs > Advanced > Hive Metastore.

Property:hadoop.proxyuser.oozie.hosts, Value: The host name is located at Oozie > Configs > Oozie Server.

Property:hadoop.proxyuser.hcat.hosts, Value: The host name is located at Hive > Configs > Advanced > Hive Metastore

Property:hadoop.proxyuser.http.hosts, Value: The host name is located at Hive > Configs > Advanced > Hive Metastore

Property:hadoop.proxyuser.falcon.hosts, Value: The host name is located at Falcon > Configs > Falcon Server.

You can use an asterisk (*) in place of a host name to indicate that any host can connect.

Navigate to Hive > Configs > Advanced.
Scroll to the Custom hive-site section and click Add Property.

Add the indicated values for the following properties:

Property: hive.metastore.event.listeners, Value: org.apache.hive.hcatalog.listener.DbNotificationListener

Property: hive.metastore.dml.events, Value: true

Property: hive.metastore.event.db.listener.timetolive, Value: 432000s

Property: hive.exim.strict.repl.tables, Value: true

Property: hive.server2.transport.mode, Value: binary

Restart the impacted services on the cluster.
Repeat steps 1 through 6 on the target cluster.

What's Next?

You need to create both the source and the destination cluster entities from the Falcon UI.

"Creating a Cluster Entity Definition Using the Web UI"

Initialize Data for Hive Replication

Before running a HiveDR job, you must create an initial copy of any table or database that you intend to replicate on an ongoing basis. This initialization process is sometimes referred to as bootstrapping. Falcon uses the initialized data on the target cluster as a baseline to do incremental replications.

You perform an initialization by using the export and import commands. When you export the source data, you assign a base event ID. The event ID is incremented each time a new replication job runs. When the exported data is imported, a check is done to determine if a table or database already exists that has the same base event ID. If the incremented event ID of the imported data indicates that the imported data is newer, then the existing data with the same base ID is replaced. If the imported data is older, no update is performed and no error is generated.

Prerequisites

For database replication, the target database must exist on the target cluster.
The Hive tables or partitions must exist on the source cluster.
The Falcon cluster entities must exist on both the source cluster and the destination cluster.
You need the full path to any databases and tables you want to replicate.

In the following steps, replaceable text in code examples is shown in italics.

Export the table or partition data to an output location on the source, using the for replication(eventid) tag.
This ID tag must not contain numbers.
For example, to export an SQL table named global_sales:
hive > export table global_sales to '/user/tmp/global_sales' for replication(july_global_sales);
Run the distcp command to copy the exported data to the target Hive data warehouse on the destination cluster.
$ hadoop distcp hdfs://machine-1-1.openstacklocal:8020/user/tmp/global_sales hdfs://machine-2-1.openstacklocal:8020/user/hive_imports/global_sales
On the destination cluster, perform an import on the exported data.
hive > import table global_sales from '/user/hive_imports/global_sales' location '/user/sales/global_sales';

For more information, see the Apache Hive LanguageManual ImportExport wiki page.

Mirror Hive Data Using the Web UI

You can quickly mirror Apache Hive databases or tables between source and target clusters with HiveServer2 endpoints. You can also enable TDE encryption for your mirror.

Ensure that you have set permissions correctly and defined required entities as described in Preparing to Mirror Data.
At the top of the Falcon web UI page, click Create > Mirror > Hive.

On the New Hive Mirror page, specify the values for the following properties:

Table 9.1. General Hive Mirror Properties

Property	Description	Details
Mirror Name	Name of the mirror job.	The naming criteria are as follows: Must be unique Must start with a letter Is case sensitive Can contain 2 to 40 characters Can include numbers Can use a dash (-) but no other special characters Cannot contain spaces
Tags	Enter the key/value pair for metadata tagging to assist in entity search in Falcon.	The criteria are as follows: Can contain 1 to 100 characters Can include numbers Can use a dash (-) but no other special characters Cannot contain spaces

Table 9.2. Source and Target Hive Mirror Properties

Property	Description	Details
Source, Cluster	Select an existing cluster entity to serve as source for the mirrored data.	At least one cluster entity must be available in Falcon.
Source, HiveServer2 Endpoint	The Hive2 server endpoint. This is the location of data to be mirrored.	The format is hive2://`localhost`:1000.
Source, Hive2 Kerberos Principal	The service principal for the metastore Thrift server. The field is automatically populated with the value.	The value is displayed in Ambari at `Hive>Config>Advanced>Advanced hive-site>hive.metastore.kerberos.principal`.
Source, Meta Store URI	Used by the metastore client to connect to the remote metastore.	The value is displayed in Ambari at `Hive>Config>Advanced>General>hive.metastore.uris`.
Source, Kerberos Principal	The field is automatically populated.	Property=dfs.namenode.kerberos.principal and Value=nn/_HOST@EXAMPLE.COM.
Target, Cluster	Select an existing cluster entity to serve as target for the mirrored data.	At least one cluster entity must be available in Falcon.
Target, HiveServer2 Endpoint	The Hive2 server endpoint. This is the location of the mirrored data.	The format is hive2://`localhost`:1000.
Target, Hive2 Kerberos Principal	The service principal for the metastore Thrift server. The field is automatically populated.	The value is displayed in Ambari at `Hive>Config>Advanced>Advanced hive-site>hive.metastore.kerberos.principal`.
Target, Meta Store URI	Used by the metastore client to connect to the remote metastore.	The value is displayed in Ambari at `Hive>Config>Advanced>General>hive.metastore.uris`.
Target, Kerberos Principal	The field is automatically populated.	Property=dfs.namenode.kerberos.principal and Value=nn/_HOST@EXAMPLE.COM.
Run job here	Choose whether to execute the job on the source cluster or on the target cluster.	None
I want to copy	Select to copy one or more Hive databases or copy one or more tables from a single database. You must identify the specific databases and tables to be copied.	None
Validity Startand End	Combined with the frequency value to determine the window of time in which a Falcon mirror job can execute.	The workflow job starts executing after the schedule time and when all the inputs are available. The workflow ends before the specified end time, so there is not a workflow instance at end time. Also known as run duration.
Frequency	How often the process is generated.	Valid frequency types are minutes, hours, days, and months.
Timezone	The timezone is associated with the validity start and end times.	Default timezone is UTC.
Send alerts to	A comma-separated list of email addresses to which alerts are sent.	The format is name@xyz.com.

Table 9.3. Advanced Hive Mirror Properties

Property	Description	Details
TDE Encryption	Enables encryption of data at rest.	See "Enabling Transparent Data Encryption" in Using Advanced Features for more information.
Max Maps for DistCp	The maximum number of maps used during replication.	This setting impacts performance and throttling. Default is 5.
Max Bandwidth (MB)	The bandwidth in MB/s used by each mapper during replication.	This setting impacts performance and throttling. Default is 100 MB.
Retry Policy	Defines how the workflow failures should be handled.	Options are Periodic, Exponential Backoff, and Final.
Delay	The time period after which a retry attempt is made.	For example, an Attempt value of 3 and Delay value of 10 minutes would cause the workflow retry to occur after 10 minutes, 20 minutes, and 30 minutes after the start time of the workflow. Default is 30 minutes.
Attempts	How many times the retry policy should be implemented before the job fails.	Default is 3.
Access Control List	Specify the HDFS owner, group, and access permissions for the cluster.	Default permissions are 755 (rwx/r-x/r-x).

Click Next to view a summary of your entity definition.
(Optional) Click Preview XML to review or edit the entity definition in XML.
After verifying the entity definition, click Save.
The entity is automatically submitted for verification, but it is not scheduled to run.
Verify that you successfully created the entity.
1. Type the entity name in the Falcon web UI Search field and press Enter.
2. If the entity name appears in the search results, it was successfully created.
  For more information about the search function, see "Locating and Managing Entities" in Using Advanced Falcon Features.
Schedule the entity.
1. In the search results, click the checkbox next to an entity name with status of Submitted.
2. Click Schedule.
  After a few seconds a success message displays.