Troubleshooting Hive replication using REPL
You need to know how to recover from the FAILED_ADMIN state that stops the replication process.
Problem: A non-recoverable error appears for a replication job and the status says FAILED_ADMIN. How do you recover a schedule from the FAILED_ADMIN state?
Solution: Perform the following steps to recover a replication schedule from this state:
- Navigate to the error log path.
- Search for the file _non_recoverable.
- Open the file, and look for information about an error that caused the replication failed.
- Fix the error.
- Delete the _non_recoverable file.
The_non_recoverable file from the last replication command execution must be deleted; otherwise your replication attempt will malfunction.
Problem: Notification events are missing in the metastore.
Solution: If notification events are not present in the metastore during replication, the replication might be in a FAILED_ADMIN status. When this occurs, notifications are deleted in the metastore. In this case, the workaround is to start a fresh bootstrap phase of replication, as follows:
- Drop the target database using beeline.
- Remove the dump directory on HDFS for the required policy. The path of _non_recoverable
error file path has the dump directory path.
The replication continues where it stopped.