Monitoring the performance of Hive/Impala replication policies
You can monitor the progress of a Hive/Impala replication policy using performance data that you download as a CSV file from the Cloudera Manager Admin console.
This file contains information about the tables and partitions being replicated, the average throughput, and other details that can help diagnose performance issues during Hive/Impala replications. You can view this performance data for running Hive/Impala replication jobs and for completed jobs.
- From Cloudera Manager, select Replication > Replication Policies.
- Locate the row for the policy, select the policy, and click Actions. Select Show History.
- Click Download CSV for HDFS Replication Report, and then choose
one of the following options to view the performance report:
- Performance. Click to download a CSV file which contains a summary report about the performance of the replication job. The performance summary report includes the last performance sample for each mapper that is working on the replication job.
- Full Performance. Click to download the CSV file that contains the performance report of the job. The complete performance report includes all the samples taken for all mappers during the full execution of the replication job.
- To view the data, import the file into a spreadsheet program such as Microsoft Excel.
In addition to the performance reports, you can view the reports of files with errors, files that are deleted, and files that are skipped during the replication job. To view the reports, perform the following steps:
- On the Replication Policies page, locate the policy and click
Actions > Show History.
The Replication History page for the replication policy appears. Expand to view the replication job details.
- Click Download CSV for the following options:
-
- Listing - Click to download the CSV file that contains the replication report. The file lists the list of files and directories copied during the replication job.
- Status - Click to download the CSV file that contains the complete status
report. The file contains the full status report of the files where the status of the
replication is one of the following:
- ERROR – An error occurred and the file was not copied.
- DELETED – A deleted file.
- SKIPPED – A file where the replication was skipped because it was up-to-date.
- Error Status Only - Click to download the CSV file that contains the status report of all copied files with errors. The file lists the status, path, and message for the copied files with errors.
- Deleted Status Only - Click to download the CSV file that contains the status report of all deleted files. The file lists the status, path, and message for the databases and tables that were deleted.
- Skipped Status Only - Click to download the CSV file that contains the status report of all skipped files. The file lists the status, path, and message for the databases and tables that were skipped.
- Performance - Click to download a CSV file which contains a summary report about the performance of the running replication job. The performance summary report includes the last performance sample for each mapper that is working on the replication job.
- Full Performance - Click to download the CSV file that contains the performance report of the job. The performance report shows the samples taken for all the mappers during the full execution of the replication job.
To view the data, open the file in a spreadsheet program such as Microsoft Excel.
The performance data is collected every two minutes. Therefore, no data is available during the initial execution of a replication job because not enough samples are available to estimate throughput and other reported data.
- From Cloudera Manager, select Replication > Replication Policies.
- Locate the row for the policy, select the policy, and click Actions. Select Show History.
- To view performance of the Hive phase, click Download CSV next to
the Hive Replication Report label and select one of the following
options:
- Results - Downloads a listing of replicated tables in a CSV file.
- Performance - Downloads a performance report for the Hive replication in a CSV file.
See Table 1 for a description of the data in the HDFS performance reports.
- To view the data, open the file in a spreadsheet program such as Microsoft Excel.
The performance data is collected every two minutes. Therefore, no data is available during the initial execution of a replication job because not enough samples are available to estimate throughput and other reported data.
Hive Performance Data Columns | Description |
---|---|
Timestamp | Time when the performance data was collected |
Host | Name of the host where the YARN or MapReduce job was running. |
DbName | Name of the database. |
TableName | Name of the table. |
TotalElapsedTimeSecs | Number of seconds elapsed from the start of the copy operation. |
TotalTableCount | Total number of tables to be copied. The value ƒof the column will be
|
TotalPartitionCount | Total number of partitions to be copied. If the source
cluster is running Cloudera Manager 5.9 or lower, this column
contains a value of |
DbCount | Current number of databases copied. |
DbErrorCount | Number of failed database copy operations. |
TableCount | Total number of tables (for all databases) copied so far. |
CurrentTableCount | Total number of tables copied for current database. |
TableErrorCount | Total number of failed table copy operations. |
PartitionCount | Total number of partitions copied so far (for all tables). |
CurrPartitionCount | Total number of partitions copied for the current table. |
PartitionSkippedCount | Number of partitions skipped because they were copied in the previous run of the replication job. |
IndexCount | Total number of index files copied (for all databases). |
CurrIndexCount | Total number of index files copied for the current database. |
IndexSkippedCount | Number of Index files skipped because they were not
altered. Due to a bug in Hive, this value is always zero. |
HiveFunctionCount | Number of Hive functions copied. |
ImpalaObjectCount | Number of Impala objects copied. |
- If you click the CSV download too soon after the replication job starts, Cloudera Manager returns an empty file or a CSV file that has columns headers only and a message to try later when performance data has actually been collected.
- If you employ a proxy user with the form
user@domain
, performance data is not available through the links. - If the replication job only replicates small files that can be transferred in less than a few minutes, no performance statistics are collected.
- For replication policies that specify the Dynamic Replication Strategy, statistics regarding the last file transferred by a MapReduce job hide previous transfers performed by that MapReduce job.
- Only the last trace of each MapReduce job is reported in the CSV file.