Troubleshooting Kudu Replication
Learn where to find error logs, how to resolve common replication errors, and how to identify the cause of a stalled replication job.
Error log locations
The replication job embeds the Kudu Java client directly inside the Flink TaskManagers and JobManager. Kudu-level errors, such as scan failures or write errors, appear as standard Java exceptions in the Flink logs rather than in the Kudu master or tablet server logs.
Check the following locations for specific errors:
- TaskManager logs: These logs contain errors that occur during data reading (scan RPCs) or writing (upsert or delete RPCs).
- JobManager logs: These logs contain errors related to the enumerator, such as split discovery and diff scan scheduling, as well as checkpoint coordination and job-level failures.
To access these logs, you can use the YARN Resource Manager user interface (UI) or run the following command in the command-line interface (CLI):
yarn logs -applicationId <yarn-application-id>
Common errors and remediation reference
The following table describes common error messages and their respective solutions:
| Error message | Cause | Remediation |
|---|---|---|
| Snapshot too old | The Multi-Version Concurrency Control (MVCC) history retention window was exceeded. | Stop the job, delete the checkpoint directory, and restart without the
-s flag. Increase the
--tablet_history_max_age_sec property. |
| Not found: The table does not exist | The sink table was not created and job.createTable is set to
false. |
Create the sink table manually or restart the job with
--job.createTable
true. |
Timed out: deadline exceeded (writes) |
The sink cluster is overloaded or unreachable. | Check the health of the sink tablet server. Increase the
writer.operationTimeout property. |
Timed out: deadline exceeded (scans) |
The source cluster is slow or overloaded. | Check the health of the source tablet server. Increase the
reader.scanRequestTimeout property. |
Authentication error or GSSAPI failures |
The Kerberos ticket expired or is unavailable. | Renew the ticket and resubmit the job, or use Flink keytab-based authentication. |
| ClassCastException | There is a conflict between the Kudu and Hadoop Kerberos classloaders. | Add the
-Dclassloader.parent-first-patterns.additional=org.apache.kudu
argument to the submission command. |
Investigating a stalled replication job
If the lastEndTimestamp metric stops increasing, the enumerator is not
completing discovery cycles. To resolve this, perform the following actions:
- Check the
TaskManagerlogs for scan timeout errors or hanging RPCs. A non-zeropendingCountmetric that does not drain indicates a stuck reader. - Check the
JobManagerlogs for checkpoint failure messages, such as Checkpoint expired before completing. - Review the Flink Web UI for repeated
Job restartingentries and identify the root exception in the TaskManager logs.
Resolving startup failures
If the job fails immediately after submission, you must verify the following configurations:
- Ensure that the
job.checkpointingIntervalMillisproperty is strictly less than the product ofjob.discoveryIntervalSecondsand 1000. - Verify that the checkpoint directory path is accessible and exists.
- Confirm that the source and sink master addresses are correct and that the masters are reachable.
