Troubleshooting Hive replication using REPL

You need to know how to recover from the FAILED_ADMIN state that stops the replication process.

Problem: A non-recoverable error appears for a replication job and the status says FAILED_ADMIN. How do you recover a schedule from the FAILED_ADMIN state?

Solution: Perform the following steps to recover a replication schedule from this state:

  1. Navigate to the error log path.
  2. Search for the file _non_recoverable.
  3. Open the file, and look for information about an error that caused the replication failed.
  4. Fix the error.
  5. Delete the _non_recoverable file.

    The_non_recoverable file from the last replication command execution must be deleted; otherwise your replication attempt will malfunction.

Problem: Notification events are missing in the metastore.

Solution: If notification events are not present in the metastore during replication, the replication might be in a FAILED_ADMIN status. When this occurs, notifications are deleted in the metastore. In this case, the workaround is to start a fresh bootstrap phase of replication, as follows:

  1. Drop the target database using beeline.
  2. Remove the dump directory on HDFS for the required policy. The path of _non_recoverable error file path has the dump directory path.

    The replication continues where it stopped.