Step 1: Recover files appended during ZDU

This section provides a scenario and the root cause of the problem.

Scenario

  1. Create a file and close it.
  2. Perform the rolling upgrade. Do not finalize the upgrade.
  3. Append to the same file created before the upgrade.
  4. Roll back the upgrade to the older version.

Problem


     # hdfs fsck /
     /dummy/appendfile: CORRUPT blockpool BP-1519352817-172.27.13.76-1692314082449 block blk_1073747034
     /dummy/appendfile: CORRUPT 1 blocks of total size 3302 B.
     Status: CORRUPT
     Number of data-nodes:	5
     Number of racks:		1
     Total dirs:			6
     Total symlinks:		0
     
     Replicated Blocks:
     Total size:	6604 B (Total open files size: 490 B)
     Total files:	2 (Files currently being written: 2)
     Total blocks (validated):	2 (avg. block size 3302 B) (Total open file blocks (not validated): 2)
     ********************************
     UNDER MIN REPL'D BLOCKS:	1 (50.0 %)
     MINIMAL BLOCK REPLICATION:	1
     CORRUPT FILES:	1
     CORRUPT BLOCKS: 	1
     CORRUPT SIZE:		3302 B
     ********************************
     Minimally replicated blocks:	1 (50.0 %)
     Over-replicated blocks:	0 (0.0 %)
     Under-replicated blocks:	0 (0.0 %)
     Mis-replicated blocks:		0 (0.0 %)
     Default replication factor:	3
     Average block replication:	1.5
     Missing blocks:		0
     Corrupt blocks:		1
     Missing replicas:		0 (0.0 %)
     Blocks queued for replication:	0
     Erasure Coded Block Groups:
     Total size:	0 B
     Total files:	0
     Total block groups (validated):	0
     Minimally erasure-coded block groups:	0
     Over-erasure-coded block groups:	0
     Under-erasure-coded block groups:	0
     Unsatisfactory placement block groups:	0
     Average block group size:	0.0
     Missing block groups:		0
     Corrupt block groups:		0
     Missing internal blocks:	0
     Blocks queued for replication:	0
     FSCK ended at Wed Aug 23 00:50:49 UTC 2023 in 23 milliseconds
     The filesystem under path '/' is CORRUPT 
    

Append operations performed during the rolling upgrade creates block files with new generation stamps (genstamps). After the rollback, the NameNode is rolled back to its previous state and is unaware of newer genstamps created during the rolling upgrade.

The rolled-back NameNode will fail to recognize these new genstamps. This makes the NameNode to treat the upcoming genstamps as corrupt blocks. This keeps the NameNode in safe mode when you restart the NameNode during the rollback.

The reported blocks 4563 has reached the threshold 0.9990 of total blocks 4566. The number of live datanodes 6 has reached the minimum number 1. Name node detected blocks with generation stamps in future. This means that Name node metadata is inconsistent. This can happen if Name node metadata files have been manually replaced. Exiting safe mode will cause loss of 807114959 byte(s). Please restart name node with right metadata or use "hdfs dfsadmin -safemode forceExit" if you are certain that the NameNode was started with the correct FsImage and edit logs. If you encountered this during a rollback, it is safe to exit with -safemode forceExit.