Restoring Kudu data into the target CDP cluster

Once you have backed up your data in Kudu, you can copy the data to the target CDP cluster and then restore it using the Kudu backup tool.

  • If you applied any custom Kudu configurations in your old clusters, then you manually have to apply those configurations in your target cluster.
  • Copy the backed up Kudu data to the target CDP cluster.
  • While scanning or reading data from Kudu tables using Impala (for example, through impala-shell or Hue) to verify the records in the destination table, remember that the Impala table might point to a Kudu table with a different name, which is defined by the kudu.table_name property. Meanwhile, backup and restore tools work does not depend on Impala table names, but rather depends on actual Kudu table names.
    To get the information on the kudu.table_name property for a table, you can use the SHOW CREATE TABLE statement in impala-shell or Hue:
    > SHOW CREATE TABLE my_kudu_table;                                              
    CREATE TABLE my_kudu_table                                                      
      id BIGINT,                                                                    
      name STRING,                                                                  
      PRIMARY KEY(id)                                                               
    PARTITION BY HASH PARTITIONS 16                                                 
    STORED AS KUDU                                                                  
    TBLPROPERTIES (                                                                 
      'kudu.table_name' = 'my_kudu_table_renamed'                                   
  1. Run the following command to restore the backup on the target cluster:
    spark-submit --class org.apache.kudu.backup.KuduRestore <path to kudu-backup2_2.11-1.12.0.jar> \
    --kuduMasterAddresses <addresses of Kudu masters> \
    --rootPath <path to the stored backed up data> \
    • --kuduMasterAddresses is used to specify the addresses of the Kudu masters as a comma-separated list. For example, master1-host,master-2-host,master-3-host which are the actual hostnames of Kudu masters.
    • --rootPath is used to specify the path at which you stored the backed up data.. It accepts any Spark-compatible path.
      • Example for HDFS: hdfs:///kudu-backups
      • Example for AWS S3: s3a://kudu-backup/

      If you are backed up to S3 and see the “Exception in thread "main" java.lang.IllegalArgumentException: path must be absolute” error, ensure that S3 path ends with a forward slash (/).

    • <table_name> can be a table or a list of tables to be backed up.
    • Optional: --tableSuffix, if set, adds suffices to the restored table names. It can only be used when the createTables property is true.
    • Optional: --timestampMs is a UNIX timestamp in milliseconds that defined the latest time to use when selecting restore candidates. Its default value is System.currentTimeMillis().
    sudo -u hdfs spark-submit --class org.apache.kudu.backup.KuduRestore /opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.3758356/lib/kudu/kudu-backup2_2.11.jar \
    --kuduMasterAddresses \
    --rootPath hdfs:///kudu/kudu-backups \
  2. Restart the Kudu service in Cloudera Manager.