Backing up data in Kudu

You can back up all your data in Kudu using the kudu-backup-tools.jar Kudu backup tool.

The Kudu backup tool runs a Spark job that builds the backup data file and writes it to HDFS or AWS S3, based on what you specify. Note that if you are backing up to S3, you have to provide S3 credentials to spark-submit as described in Specifying Credentials to Access S3 from Spark

The Kudu backup tool creates a full backup of your data on the first run. Subsequently, the tool creates incremental backups.

Run the following command to start the backup process:
spark-submit --class org.apache.kudu.backup.KuduBackup <path to kudu-backup2_2.11-1.12.0.jar> \
--kuduMasterAddresses <addresses of Kudu masters> \
--rootPath <path to store the backed up data> \
<table_name>
where
  • --kuduMasterAddresses is used to specify the addresses of the Kudu masters as a comma-separated list. For example, master1-host,master-2-host,master-3-host which are the actual hostnames of Kudu masters.
  • --rootPath is used to specify the path to store the backed up data. It accepts any Spark-compatible path.
    • Example for HDFS: hdfs:///kudu-backups
    • Example for AWS S3: s3a://kudu-backup/

    If you are backing up to S3 and see the “Exception in thread "main" java.lang.IllegalArgumentException: path must be absolute” error, ensure that S3 path ends with a forward slash (/).

  • <table_name> can be a table or a list of tables to be backed up.

Example:

spark-submit --class org.apache.kudu.backup.KuduBackup /opt/cloudera/parcels/CDH-7.2.1-1.cdh7.2.1.p0.4041380/lib/kudu/kudu-backup2_2.11.jar \
--kuduMasterAddresses cluster-1.cluster_name.root.hwx.site,cluster-2.cluster_name.root.hwx.site \
--rootPath hdfs:///kudu-backups \
my_table