Replication management and metrics

As the Data Engineer, you can use aggregated replication metrics related to Hive replication to manage replication. While you are performing replication you can query the replication progress as shown in examples. When the managed table data during replication reaches a certain volume, you need to know how to check the progress and get information from the YARN UI.

You can extract replication metrics, correlate replication metrics with scheduled queries, track scheduled executions, and look at execution details.

Replication metrics

You can get metrics about a replication policy by providing the correct replication policy name in a select query, as shown in the following example.

select * from sys.replication_metrics where policy_name=’repl<policy name>’ 
order by scheduled_execution_id desc limit 1;

The output, not shown here, is the text-formatted table named sys.replication.metrics that contains metrics described below:

scheduled_execution_id
Long
This is the execution id of the scheduler. All queries scheduled using a hive scheduler have an execution id. Eg: any long value 14. Can be obtained by querying the information_schema. select scheduled_execution_id from information_schema.scheduled_executions where policy_name =<>
policy_name
String
This is the scheduled policy name. All queries scheduled using a hive scheduler has a unique cluster namespace and policy name combination.
dump_execution_id
Long
Empty for dump and Actual value for load. This is needed to tie the dump and load together to get consolidated metrics. This is the scheduled_execution_id of the corresponding repl dump policy running on source.
metadata
Json(>MySql 5.7) Varchar
  • Compressed Json Dbname/Policy Name : This is the database name on which repl dump is called. Repl dump <dbname>
  • Replication Type : Bootstrap/Incremental
  • Staging Directory : Current Staging Directory where the data is dumped or loaded from. This is derived from the hive.repl.dir config. The current dump directory is appended to this path. Both source and target clusters need to be able to access this directory. The FS based ack is also done in this dir. Last Repl Id : Last Replication Id of this dump/load for the database
progress
Json(>MySql 5.7) Varchar
Status : SUCCESS/FAILED/IN_PROGRESS/FAILED_ADMIN
Stages : This is a json array(compressed)
  • Name - It can be one of the following Ranger Export/Ranger Import/Atlas Export/Atlas Import/Dump/Load
  • Status - Success/Failed. This will signify whether this stage succeeded or failed or is currently in progress
  • Metrics Name : Progress (Sample eg is Table : 1/10, Function : 2/2, Events : 1/10) - This will be updated to show the progress info
  • Start time
  • End time ErrorLogPath : Path where the non recoverable error marker is stored.

Correlating replication metrics and scheduled queries

The following examples show how to get information about a scheduled policy, track scheduled executions, and check the execution details of an execution ID:

Getting scheduled policy information

The following example creates a scheduled policy:

create scheduled query pol1 every 1 minute as repl dump src01 
with('hive.repl.rootdir'='hdfs://ayushsource-1.ayushsource.root.hwx.site:8020/tmp/rootDir3',
'hive.repl.replica.external.table.base.dir'='hdfs://ayushsource-1.ayushsource.root.hwx.site:8020/tmp/ext8','hive.repl.run.data.copy.tasks.on.target'='false');

The following example gets information about scheduled queries:

select * from information_schema.scheduled_queries;

Checking any number of scheduled policy executions

select * from information_schema.scheduled_executions 
where scheduled_executions.schedule_name='pol1' 
sort by scheduled_executions.scheduled_execution_id DESC LIMIT 10

Checking a single scheduled execution

select * from sys.replication_metrics where replication_metrics.scheduled_execution_id=9034;

Checking data copy through the YARN UI

When the volume table data for managed tables reaches a certain threshold, the distCp process copies the data from the source to the target cluster. DistCp submits a MapReduce application. You can check the details of distCp jobs launched by the replication using the YARN UI. The Application name of the distCp jobs launched through replication contains the policy name. You can use this information to filter the applications for a particular replication policy. The distCp job name consists of the scheduled policy name. You can use the policy name to filter the distCp jobs launched as part of the replication. In the UI you can track the following information:
  • Number of jobs launched for a particular policy in one run
  • Bytes copied
  • Number of mappers launched
  • Status of each task along with other related information