Chapter 10. Mirroring Data with HiveDR in a Secure Environment
From the Apache Falcon UI, you can configure your cluster for disaster recovery by mirroring data in Hive databases, tables, or partitions. Before setting up a HiveDR job in Falcon, you should configure a secure environment.
Important | |
---|---|
Before implementing HiveDR with Falcon, you should discuss your plans with Hortonworks Professional Services and Product Management to ensure proper component configuration and adequate protection of your data. |
Important Considerations
Before you implement HiveDR, thoroughly consider the impact of the following items on your design and whether your environment would be appropriate for a HiveDR implementation.
Recovery point objective (RPO) is a minimum of 24 hours
The following features are not supported:
External tables and non-native tables
Renaming of partitions or tables
ACID tables
The following are not copied as part of the replication process:
Hive Views
Hive statistics
UDFs and security policies
Also review the "Considerations for Using Falcon".
Prerequisites
The Kerberized source and target clusters must exist.
You must be running the following Apache components:
Hive 1.2.0 or later
Oozie 4.2.0 or later
Sqoop 1.4.6 or later
Falcon uses Apache Sqoop for import and export operations. Sqoop requires a database driver to connect to the relational database. Refer to the Sqoop documentation if you need more information.
Ensure that the database driver
.jar
file is copied into the Oozie share library for Sqoop.
About this task
Database replication definitions can be set up only on preexisting databases. Therefore, before you can replicate a database the first time, you must create a target database. To prevent unintentional replication of private data, Hive does not allow the replication of a database without a pre-existing target.
After the database target is created, you must export all tables in the source database and import them in the target database.
Steps
Create a Hive database on the target cluster
You only need to perform this task if you are replicating an entire database. See the Data Access guide or the Apache Hive documentation.
Create a Hive table/partition on the source cluster
You only need to perform this task if the table or partition you want to replicate does not already exist. See the Data Access guide or the Apache Hive documentation
Prepare for Disaster Recovery
Before creating the required entity definitions in Falcon, you must modify several configuration files. These files can be modified from Ambari or from the command line. To access the service properties in Ambari, click the Services menu and select the service you want, then click the Configs tab. For some services, you must also click the Advanced tab.
auth-to-local Setting
For Kerberos principals to be mapped correctly across clusters, changes must be
made to the hadoop.security.auth_to_local setting
in the core-site.xml
file on both clusters. These entries
should be made for the ambariqa, hbase,
and hdfs principals.
For example, to add the principal mapping for CLUSTER02
into CLUSTER01
add the following lines into the hadoop.security.auth_to_local field in
CLUSTER01
. Note, these lines must be inserted before
the wildcard line RULE:[1:$1@$0](.*@HADOOP)s/@.*//
:
RULE:[1:$1@$0](ambari-qa-CLUSTER02@HADOOP)s/.*/ambari-qa/ RULE:[1:$1@$0](hbase-CLUSTER02@HADOOP)s/.*/hbase/ RULE:[1:$1@$0](hdfs-CLUSTER02@HADOOP)s/.*/hdfs/
Important | |
---|---|
Ensure the hadoop.security.auth_to_local
setting in the HDFS |
Proxy User
To allow the Falcon Oozie workflows to query the primary status of both clusters, the
Oozie server hosts must be whitelisted as proxy hosts to allow cross-site Oozie access.
On both clusters, the hadoop.proxyuser.oozie.hosts
setting in the HDFS core-site.xml
must contain the hostnames of all
the Oozie servers in both clusters. The hostnames of all the Oozie servers must also be
added to the hadoop.proxyuser.hive.hosts because the
Oozie server proxies as the Hive user during replication.
Also, to allow cross-site Apache Hive access for Hive replication, Hive and HCatalog
hosts for both clusters must be whitelisted as proxy hosts. On both clusters, the
hadoop.proxyuser.hive.hosts and hadoop.proxyuser.hcat.hosts settings in the HDFS
core-site.xml
must contain the hostnames of all the Hive
Metastore and Hive Server2 hosts for Hive and WebHCat hosts for HCatalog.
You can change these settings through Ambari.
Nameservices
For cross-site DistCp to function, both clusters must be able to resolve the nameservice of the other cluster. In Ambari, modify the following properties in the hdfs-site section to add the appropriate cluster names.
dfs.internal.nameservices:CLUSTER01 dfs.client.failover.proxy.provider.CLUSTER02=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailover c ,! ProxyProvider dfs.ha.namenodes.CLUSTER02=nn1,nn2 dfs.namenode.http-address.CLUSTER02.nn1=namenode01.hostname:50070 dfs.namenode.http-address.CLUSTER02.nn2=namenode02.hostname:50070 dfs.namenode.https-address.CLUSTER02.nn1=namenode01.hostname:50470 dfs.namenode.https-address.CLUSTER02.nn2=namenode02.hostname:50470 dfs.namenode.rpc-address.CLUSTER02.nn1=namenode01.hostname:8020 dfs.namenode.rpc-address.CLUSTER02.nn2=namenode02.hostname:8020
Configure Properties for HiveDR
You must configure several Hadoop and Hive properties to set up a secure environment before you can implement HiveDR jobs.
In Ambari, on the source cluster, navigate to HDFS > Configs > Advanced.
Scroll to the Custom core-site section and specify the host names for each of the following attributes:
Property:hadoop.proxyuser.hdfs.hosts, Value: The host name is located at HDFS > Configs > Advanced > NameNode. Property:hadoop.proxyuser.yarn.hosts, Value: The host name is located at YARN > Configs > Advanced > Resource Manager. Property:hadoop.proxyuser.hive.hosts, Value: The host name is located at Hive > Configs > Advanced > Hive Metastore. Property:hadoop.proxyuser.oozie.hosts, Value: The host name is located at Oozie > Configs > Oozie Server. Property:hadoop.proxyuser.hcat.hosts, Value: The host name is located at Hive > Configs > Advanced > Hive Metastore Property:hadoop.proxyuser.http.hosts, Value: The host name is located at Hive > Configs > Advanced > Hive Metastore Property:hadoop.proxyuser.falcon.hosts, Value: The host name is located at Falcon > Configs > Falcon Server. You can use an asterisk (*) in place of a host name to indicate that any host can connect.
Navigate to Hive > Configs > Advanced.
Scroll to the Custom hive-site section and click Add Property.
Add the indicated values for the following properties:
Property: hive.metastore.transactional.event.listeners, Value: org.apache.hive.hcatalog.listener.DbNotificationListener Property: hive.metastore.dml.events, Value: true Property: hive.metastore.event.db.listener.timetolive, Value: 432000s Property: hive.exim.strict.repl.tables, Value: true Property: hive.server2.transport.mode, Value: binary Restart the impacted services on the cluster.
Repeat steps 1 through 6 on the target cluster.
What's Next?
You need to create both the source and the destination cluster entities from the Falcon UI.
Initialize Data for Hive Replication
Before running a HiveDR job, you must create an initial copy of any table or database that you intend to replicate on an ongoing basis. This initialization process is sometimes referred to as bootstrapping. Falcon uses the initialized data on the target cluster as a baseline to do incremental replications.
You perform an initialization by using the export
and
import
commands. When you export the source data, you assign a
base event ID. The event ID is incremented each time a new replication job runs. When
the exported data is imported, a check is done to determine if a table or database
already exists that has the same base event ID. If the incremented event ID of the
imported data indicates that the imported data is newer, then the existing data with the
same base ID is replaced. If the imported data is older, no update is performed and no
error is generated.
Prerequisites
For database replication, the target database must exist on the target cluster.
The Hive tables or partitions must exist on the source cluster.
The Falcon cluster entities must exist on both the source cluster and the destination cluster.
You need the full path to any databases and tables you want to replicate.
In the following steps, replaceable text in code examples is shown in italics.
Export the table or partition data to an output location on the source, using the for replication(eventid) tag.
This ID tag must not contain numbers.
For example, to export an SQL table named
global_sales
:hive >
export table
global_sales
to '/user/tmp/global_sales
' for replication(july_global_sales
);Run the distcp command to copy the exported data to the target Hive data warehouse on the destination cluster.
$
hadoop distcp hdfs://
machine-1-1.openstacklocal:8020/user/tmp/global_sales
hdfs://machine-2-1.openstacklocal:8020/user/hive_imports/global_sales
On the destination cluster, perform an import on the exported data.
hive >
import table
global_sales
from '/user/hive_imports/global_sales
' location '/user/sales/global_sales
';
For more information, see the Apache Hive LanguageManual ImportExport wiki page.
Mirror Hive Data Using the Web UI
You can quickly mirror Apache Hive databases or tables between source and target clusters with HiveServer2 endpoints. You can also enable TDE encryption for your mirror.
Ensure that you have set permissions correctly and defined required entities as described in Preparing to Mirror Data.
At the top of the Falcon web UI page, click Create > Mirror > Hive.
On the New Hive Mirror page, specify the values for the following properties:
Table 10.1. General Hive Mirror Properties
Property Description Details Mirror Name Name of the mirror job.
The naming criteria are as follows:
Must be unique
Must start with a letter
Is case sensitive
Can contain 2 to 40 characters
Can include numbers
Can use a dash (-) but no other special characters
Cannot contain spaces
Tags Enter the key/value pair for metadata tagging to assist in entity search in Falcon.
The criteria are as follows:
Can contain 1 to 100 characters
Can include numbers
Can use a dash (-) but no other special characters
Cannot contain spaces
Table 10.2. Source and Target Hive Mirror Properties
Property Description Details Source, Cluster Select an existing cluster entity to serve as source for the mirrored data. At least one cluster entity must be available in Falcon. Source, HiveServer2 Endpoint The Hive2 server endpoint. This is the location of data to be mirrored. The format is hive2:// localhost
:1000.Source, Hive2 Kerberos Principal The service principal for the metastore Thrift server. The field is automatically populated with the value. The value is displayed in Ambari at Hive>Config>Advanced>Advanced hive-site>hive.metastore.kerberos.principal
.Source, Meta Store URI Used by the metastore client to connect to the remote metastore. The value is displayed in Ambari at Hive>Config>Advanced>General>hive.metastore.uris
.Source, Kerberos Principal The field is automatically populated. Property=dfs.namenode.kerberos.principal and Value=nn/_HOST@EXAMPLE.COM. Target, Cluster Select an existing cluster entity to serve as target for the mirrored data. At least one cluster entity must be available in Falcon. Target, HiveServer2 Endpoint The Hive2 server endpoint. This is the location of the mirrored data. The format is hive2:// localhost
:1000.Target, Hive2 Kerberos Principal The service principal for the metastore Thrift server. The field is automatically populated. The value is displayed in Ambari at Hive>Config>Advanced>Advanced hive-site>hive.metastore.kerberos.principal
.Target, Meta Store URI Used by the metastore client to connect to the remote metastore. The value is displayed in Ambari at Hive>Config>Advanced>General>hive.metastore.uris
.Target, Kerberos Principal The field is automatically populated. Property=dfs.namenode.kerberos.principal and Value=nn/_HOST@EXAMPLE.COM. Run job here Choose whether to execute the job on the source cluster or on the target cluster. None I want to copy Select to copy one or more Hive databases or copy one or more tables from a single database. You must identify the specific databases and tables to be copied. None Validity Startand End Combined with the frequency value to determine the window of time in which a Falcon mirror job can execute. The workflow job starts executing after the schedule time and when all the inputs are available. The workflow ends before the specified end time, so there is not a workflow instance at end time. Also known as run duration. Frequency How often the process is generated. Valid frequency types are minutes, hours, days, and months. Timezone The timezone is associated with the validity start and end times. Default timezone is UTC. Send alerts to A comma-separated list of email addresses to which alerts are sent. The format is name@xyz.com. Table 10.3. Advanced Hive Mirror Properties
Property Description Details TDE Encryption Enables encryption of data at rest. See "Enabling Transparent Data Encryption" in Using Advanced Features for more information. Max Maps for DistCp The maximum number of maps used during replication. This setting impacts performance and throttling. Default is 5. Max Bandwidth (MB) The bandwidth in MB/s used by each mapper during replication. This setting impacts performance and throttling. Default is 100 MB. Retry Policy Defines how the workflow failures should be handled. Options are Periodic, Exponential Backoff, and Final. Delay
The time period after which a retry attempt is made. For example, an Attempt value of 3 and Delay value of 10 minutes would cause the workflow retry to occur after 10 minutes, 20 minutes, and 30 minutes after the start time of the workflow. Default is 30 minutes. Attempts
How many times the retry policy should be implemented before the job fails.
Default is 3. Access Control List
Specify the HDFS owner, group, and access permissions for the cluster. Default permissions are 755 (rwx/r-x/r-x). Click Next to view a summary of your entity definition.
(Optional) Click Preview XML to review or edit the entity definition in XML.
After verifying the entity definition, click Save.
The entity is automatically submitted for verification, but it is not scheduled to run.
Verify that you successfully created the entity.
Type the entity name in the Falcon web UI Search field and press Enter.
If the entity name appears in the search results, it was successfully created.
For more information about the search function, see "Locating and Managing Entities" in Using Advanced Falcon Features.
Schedule the entity.
In the search results, click the checkbox next to an entity name with status of
Submitted
.Click Schedule.
After a few seconds a success message displays.