Troubleshooting Kudu Replication

Error log locations

The replication job embeds the Kudu Java client directly inside the Flink TaskManagers and JobManager. Kudu-level errors, such as scan failures or write errors, appear as standard Java exceptions in the Flink logs rather than in the Kudu master or tablet server logs.

Check the following locations for specific errors:

TaskManager logs: These logs contain errors that occur during data reading (scan RPCs) or writing (upsert or delete RPCs).
JobManager logs: These logs contain errors related to the enumerator, such as split discovery and diff scan scheduling, as well as checkpoint coordination and job-level failures.

To access these logs, you can use the YARN Resource Manager user interface (UI) or run the following command in the command-line interface (CLI):

yarn logs -applicationId <yarn-application-id>

Common errors and remediation reference

The following table describes common error messages and their respective solutions:


Error message	Cause	Remediation
Snapshot too old	The Multi-Version Concurrency Control (MVCC) history retention window was exceeded.	Stop the job, delete the checkpoint directory, and restart without the `-s` flag. Increase the `--tablet_history_max_age_sec` property.
Not found: The table does not exist	The sink table was not created and `job.createTable` is set to `false`.	Create the sink table manually or restart the job with `--job.createTable` `true`.
`Timed out: deadline exceeded` (writes)	The sink cluster is overloaded or unreachable.	Check the health of the sink tablet server. Increase the `writer.operationTimeout` property.
`Timed out: deadline exceeded` (scans)	The source cluster is slow or overloaded.	Check the health of the source tablet server. Increase the `reader.scanRequestTimeout` property.
`Authentication error or GSSAPI` failures	The Kerberos ticket expired or is unavailable.	Renew the ticket and resubmit the job, or use Flink keytab-based authentication.
ClassCastException	There is a conflict between the Kudu and Hadoop Kerberos classloaders.	Add the -`Dclassloader.parent-first-patterns.additional=org.apache.kudu` argument to the submission command.

Investigating a stalled replication job

If the lastEndTimestamp metric stops increasing, the enumerator is not completing discovery cycles. To resolve this, perform the following actions:

Check the TaskManager logs for scan timeout errors or hanging RPCs. A non-zero pendingCount metric that does not drain indicates a stuck reader.
Check the JobManager logs for checkpoint failure messages, such as Checkpoint expired before completing.
Review the Flink Web UI for repeated Job restarting entries and identify the root exception in the TaskManager logs.

Resolving startup failures

If the job fails immediately after submission, you must verify the following configurations:

Ensure that the job.checkpointingIntervalMillis property is strictly less than the product of job.discoveryIntervalSeconds and 1000.
Verify that the checkpoint directory path is accessible and exists.
Confirm that the source and sink master addresses are correct and that the masters are reachable.