Replication management and metrics
As the Data Engineer, you can use aggregated replication metrics related to Hive replication to manage replication. While you are performing replication you can query the replication progress as shown in examples. When the managed table data during replication reaches a certain volume, you need to know how to check the progress and get information from the YARN UI.
You can extract replication metrics, correlate replication metrics with scheduled queries, track scheduled executions, and look at execution details.
You can get metrics about a replication policy by providing the correct replication policy name in a select query, as shown in the following example.
select * from sys.replication_metrics where policy_name=’repl<policy name>’ order by scheduled_execution_id desc limit 1;
The output, not shown here, is the text-formatted table named
sys.replication.metrics that contains metrics described below:
- This is the execution id of the scheduler. All queries scheduled using a hive scheduler have an execution id. Eg: any long value 14. Can be obtained by querying the information_schema. select scheduled_execution_id from information_schema.scheduled_executions where policy_name =<>
- This is the scheduled policy name. All queries scheduled using a hive scheduler has a unique cluster namespace and policy name combination.
- Empty for dump and Actual value for load. This is needed to tie the dump and load together to get consolidated metrics. This is the scheduled_execution_id of the corresponding repl dump policy running on source.
- Json(>MySql 5.7) Varchar
- Compressed Json Dbname/Policy Name : This is the database name on which repl dump is called. Repl dump <dbname>
- Replication Type : Bootstrap/Incremental
- Staging Directory : Current Staging Directory where the data is dumped or loaded from. This is derived from the hive.repl.dir config. The current dump directory is appended to this path. Both source and target clusters need to be able to access this directory. The FS based ack is also done in this dir. Last Repl Id : Last Replication Id of this dump/load for the database
- Json(>MySql 5.7) Varchar
- Status : SUCCESS/FAILED/IN_PROGRESS/FAILED_ADMIN
- Stages : This is a json array(compressed)
- Name - It can be one of the following Ranger Export/Ranger Import/Atlas Export/Atlas Import/Dump/Load
- Status - Success/Failed. This will signify whether this stage succeeded or failed or is currently in progress
- Metrics Name : Progress (Sample eg is Table : 1/10, Function : 2/2, Events : 1/10) - This will be updated to show the progress info
- Start time
- End time ErrorLogPath : Path where the non recoverable error marker is stored.
Correlating replication metrics and scheduled queries
The following examples show how to get information about a scheduled policy, track scheduled executions, and check the execution details of an execution ID:
Getting scheduled policy information
The following example creates a scheduled policy:
create scheduled query pol1 every 1 minute as repl dump src01 with('hive.repl.rootdir'='hdfs://ayushsource-1.ayushsource.root.hwx.site:8020/tmp/rootDir3', 'hive.repl.replica.external.table.base.dir'='hdfs://ayushsource-1.ayushsource.root.hwx.site:8020/tmp/ext8','hive.repl.run.data.copy.tasks.on.target'='false');
The following example gets information about scheduled queries:
select * from information_schema.scheduled_queries;
Checking any number of scheduled policy executions
select * from information_schema.scheduled_executions where scheduled_executions.schedule_name='pol1' sort by scheduled_executions.scheduled_execution_id DESC LIMIT 10
Checking a single scheduled execution
select * from sys.replication_metrics where replication_metrics.scheduled_execution_id=9034;
Checking data copy through the YARN UI
- Number of jobs launched for a particular policy in one run
- Bytes copied
- Number of mappers launched
- Status of each task along with other related information