Recover boot volume in case of fsck errors

Sometimes, for example after a sudden shutdown, it may happen that the boot volume of an instance shows errors like:

.....
Failure: File system check of the root filesystem failed
The root filesystem on /dev/vda1 requires a manual fsck
.....

To fix the error you would need to detach the volume, mount it on some other server, run fsck and finally re-attach it to the original instance. Problem is that when the affected volume is the boot one, you will not be able to detach it from the instance.

So, in this case, you would need to contact Cloud Admins who will:

  • copy the Ceph volume:

    rbd -p cinder-ceph-ct1-cl1 cp volume-<UUID> cinder-ceph-ct1-cl1/backup_<some_meaningful_name>
    
  • fix the new volume:

    • source the appropriate environment variables for the tenant

    • execute:

      cinder manage --name=<some_meaningful_name>_systemvol cinder-ct1-cl1@cinder-ceph-ct1-cl1#cinder-ceph-ct1-cl1 backup_<some_meaningful_name
      
    • from the OpenStack GUI, go to the “Volumes” tab and attach the volume called “<some_meaningful_name>_systemvol” to an auxiliary server

    • on the auxiliary server:

      mkdir /tmp/disk
      # mount/umount, is useful in some cases, to clear internal logs
      mount --nouuid /dev/vdb1 /tmp/disk
      # identify filesystem type
      cat /proc/mounts | grep vdb1
      # for ext4 execute:
      fsck /dev/vdb1
      # for xfs execute:
      xfs_repair /dev/vdb1
      
    • from the OpenStack GUI, remove the attachment for the volume “<some_meaningful_name>_systemvol”

    • if “<some_meaningful_name>_systemvol” is a system volume, make sure the “bootable” flag is set

  • inform the user that she should create a new instance based on the new volume called “<some_meaningful_name>_systemvol”, and then move any other attachment from the old instance to the new one

Alternative procedure

An alternative procedure to manipulate disk volumes, provided they are not encrypted, is the following. With this procedure we only need Ceph admin privileges, no OpenStack credentials are needed as we perform our magic under the hood. The machine needs to be powered-off.

  • copy the Ceph volume and move the original volume, just in case something goes wrong:

    rbd -p cinder-pool cp volume-12345678-abcd-1234-ab12-90abcdef1234 cinder-pool/copy-12345678-abcd-1234-ab12-90abcdef1234
    rbd -p cinder-pool rename volume-12345678-abcd-1234-ab12-90abcdef1234 cinder-pool/volume-12345678-abcd-1234-ab12-90abcdef1234-original
    
  • disable some Ceph features (will re-enable them later, with only exception of deep-flatten which is only set at creation time):

    rbd feature disable copy-12345678-abcd-1234-ab12-90abcdef1234 object-map fast-diff deep-flatten
    
  • map the new volume, this will create entries under /dev/rbd/<pool_name>/<volume_name> and for all existing partitions, which you should later mount to some local path, for example:

    rbd -p cinder-pool map copy-12345678-abcd-1234-ab12-90abcdef1234 --name client.admin
    
  • now you can either mount and inspect partition, or repair (partition needs to be unmounted):

    mkdir /tmp/test
    mount /dev/rbd/cinder-pool/copy-12345678-abcd-1234-ab12-90abcdef1234-part1 /tmp/test/
    umount /tmp/test
    fsck.ext4 /dev/rbd/cinder-pool/copy-12345678-abcd-1234-ab12-90abcdef1234-part1
    
  • while working on the volume you may also want to reset the root password so you can later be able to login from the GUI. Mount the partition holding /, edit file /etc/shadow and wipe the second field from the root line (REMEMBER to set a random, difficult user password as soon as possible!) so the entry looks more or less like:

    ...
    root::17947:0:99999:7:::
    ...
    
  • once you are done, unmount any mounted partition and unmap the volume:

    rbd unmap cinder-pool/copy-12345678-abcd-1234-ab12-90abcdef1234
    
  • add back some Ceph features, rebuild the object map, verify everything is OK and finally rename the new volume so it gets the name associated to the original volume:

    rbd -p cinder-pool feature enable copy-12345678-abcd-1234-ab12-90abcdef1234 object-map fast-diff
    rbd -p cinder-pool object-map rebuild copy-12345678-abcd-1234-ab12-90abcdef1234
    rbd -p cinder-pool info copy-12345678-abcd-1234-ab12-90abcdef1234
    rbd -p cinder-pool rename copy-12345678-abcd-1234-ab12-90abcdef1234 cinder-pool/volume-12345678-abcd-1234-ab12-90abcdef1234
    
  • go to the OpenStack GUI and start your instance. Remember to set the root password, if you reset it, earlier.