Prepare for the recovery

It is crucial to make sure that the master node is truly dead and does not accidentally restart while you are preparing for the recovery.

  1. If the cluster was configured without DNS aliases perform the following steps. Otherwise move on to step 2:
    1. Establish a maintenance window (one hour should be sufficient). During this time the Kudu cluster will be unavailable.
    2. Shut down all Kudu tablet server processes in the cluster.
  2. Ensure that the dead master is well and truly dead. Take whatever steps needed to prevent it from accidentally restarting; this can be quite dangerous for the cluster post-recovery.
  3. Choose one of the remaining live masters to serve as a basis for recovery. The rest of this workflow will refer to this master as the "reference" master.
  4. Choose an unused machine in the cluster where the new master will live. The master generates very little load so it can be co-located with other data services or load-generating processes, though not with another Kudu master from the same configuration. The rest of this workflow will refer to this master as the "replacement" master.
  5. Perform the following preparatory steps for the replacement master:
    • Ensure Kudu is installed on the machine, either via system packages (in which case the kudu and kudu-master packages should be installed), or via some other means.

    • Choose and record the directory where the master’s data will live.

  6. Perform the following preparatory steps for each live master:
    • Identify and record the directory where the master’s data lives. If using Kudu system packages, the default value is /var/lib/kudu/master, but it may be customized via the fs_wal_dir and fs_data_dirs configuration parameter. Please note if you’ve set fs_data_dirs to some directories other than the value of fs_wal_dir, it should be explicitly included in every command below where fs_wal_dir is also included.

    • Identify and record the master’s UUID. It can be fetched using the following command:

      $ sudo -u kudu kudu fs dump uuid --fs_wal_dir=<master_wal_dir> [--fs_data_dirs=<master_data_dir>] 2>/dev/null
      master_data_dir

      live master’s previously recorded data directory

      Example
      $ sudo -u kudu kudu fs dump uuid --fs_wal_dir=/data/kudu/master/wal --fs_data_dirs=/data/kudu/master/data 2>/dev/null
      80a82c4b8a9f4c819bab744927ad765c
  7. Perform the following preparatory steps for the reference master:
    • Identify and record the directory where the master’s data lives. If using Kudu system packages, the default value is /var/lib/kudu/master, but it may be customized using the fs_wal_dir and fs_data_dirs configuration parameter. If you have set fs_data_dirs to some directories other than the value of fs_wal_dir, it should be explicitly included in every command below where fs_wal_dir is also included.

    • Identify and record the UUIDs of every master in the cluster, using the following command:

      $ sudo -u kudu kudu local_replica cmeta print_replica_uuids --fs_wal_dir=<master_data_dir> <tablet_id> 2>/dev/null
      master_data_dir

      The reference master’s previously recorded data directory.

      tablet_id

      Must be set to the string, 00000000000000000000000000000000.

      For example
      $ sudo -u kudu kudu local_replica cmeta print_replica_uuids --fs_wal_dir=/data/kudu/master/wal --fs_data_dirs=/data/kudu/master/data  00000000000000000000000000000000 2>/dev/null
      80a82c4b8a9f4c819bab744927ad765c 2a73eeee5d47413981d9a1c637cce170 1c3f3094256347528d02ec107466aef3
  8. Using the two previously-recorded lists of UUIDs (one for all live masters and one for all masters), determine and record (by process of elimination) the UUID of the dead master.