Monitoring the performance of Hive/Impala replication policies
You can monitor the progress of a Hive/Impala replication policy using performance data
that you download as a CSV file from Replication Manager.
This file contains information about the tables and partitions being replicated, the average
throughput, and other details that can help diagnose performance issues during Hive/Impala
replications. You can view this performance data for running Hive/Impala replication jobs
and for completed jobs. The performance data is collected every two minutes. Therefore, no
data is available during the initial execution of a replication job because not enough
samples are available to estimate throughput and other reported data.
To view the performance data for a running Hive/Impala replication policy,
perform the following steps:
Go to the Cloudera Manager > Replication > Replication Policies page.
Locate and select the replication policy. Click Actions > Show History.
Click Download CSV for the HDFS Replication
Report field, and choose one of the following options to download the
following performance reports:
Performance file contains a summary report about the
performance of the replication job which includes the last performance sample for
each mapper working on the replication job.
Full Performance file contains the complete performance
report about the job which includes all the samples taken for all mappers during
the full run of the replication job.
Open the file in a spreadsheet program such as Microsoft Excel.
The following columns appear in the CSV file:
Timestamp when the performance data was collected.
Host where the YARN or MapReduce job was running.
Number of Bytes Copied for the file currently being
copied.
Time Elapsed (ms) for the copy operation of the file
currently being copied.
Number of Files Copied.
Avg Throughput (KB/s) since the start of the file
currently being copied in kilobytes per second.
File size of the Last File (bytes).
Time taken to copy Last File Time (ms).
Last file throughput (KB/s) that is being copied in
kilobytes per second.
Download the following CSV reports to view more information about the replication
job:
Listing report contains the list of files and directories
copied during the replication job.
Status report contains the full status report of the
files where the replication status is shown as:
ERROR occurred during replication, therefore the file
was not copied.
DELETED for deleted files.
SKIPPED for up-to-date files that were not
replicated.
Error Status Only report contains the status report of
all copied files with errors. The file lists the status, path, and message for the
copied files with errors.
Deleted Status Only report contains the status report of
all deleted files. The file lists the status, path, and message for the databases
and tables that were deleted.
Skipped Status Only report contains the status report of
all skipped files. The file lists the status, path, and message for the databases
and tables that were skipped.
To view the performance data for a completed Hive/Impala replication policy,
perform the following steps:
Go to the Cloudera Manager > Replication > Replication Policies page.
Locate and select the replication policy. Click Actions > Show History.
Click Download CSV for the Hive External Table
Replication Report field, and choose one of the following options to
download the following performance reports in CSV format:
Results file contains a listing of replicated
tables.
Performance file contains a summary report about the
performance of the replication job.
Open the file in a spreadsheet program such as Microsoft Excel.
The following columns appear in the CSV file:
Timestamp when the performance data was collected.
Host where the YARN or MapReduce job was running.
DbName or database name.
TableName or table name.
TotalElapsedTimeSecs is the number of seconds elapsed
from the start of the copy operation.
TotalTableCount is the total number of tables to be
copied. The value of the column shows -1 for replications where
Cloudera Manager cannot determine the number of tables being changed.
TotalPartitionCount is the total number of partitions to
be copied. If the source cluster is running Cloudera Manager 5.9 or lower, this
column shows -1 because older releases do not report this
information.
DbCount is the current number of databases copied.
DbErrorCount is the number of failed database copy
operations.
TableCount is the total number of tables for all
databases copied so far.
CurrentTableCount is the total number of tables copied
for the current database.
TableErrorCount is the total number of failed table copy
operations.
PartitionCount is the total number of partitions copied
so far for all tables.
CurrPartitionCount is the total number of partitions
copied for the current table.
PartitionSkippedCount is the number of partitions skipped
because they were copied in the previous run of the replication job.
IndexCount is the total number of index files copied for
all databases.
CurrIndexCount is the total number of index files copied
for the current database.
IndexSkippedCount is the number of index files skipped
because they were not altered. Due to a bug in Hive, this value is always
zero.
HiveFunctionCount is the number of Hive functions
copied.
ImpalaObjectCount is the number of Impala objects
copied.
Note the following limitations and
known issues about the replication reports:
If you click the CSV download too soon after the replication job starts, Cloudera
Manager returns an empty file or a CSV file that has columns headers only and a message to
try later when performance data has actually been collected.
If you employ a proxy user with the form user@domain,
performance data is not available through the links.
If the replication job only replicates small files that can be transferred in less than
a few minutes, no performance statistics are collected.
If you specify the Dynamic Replication Strategy during replication policy
creation, statistics regarding the last file transferred by a MapReduce job hide previous
transfers performed by that MapReduce job.
Only the last trace per MapReduce job is reported in the CSV file.