Note: This is a fairly rambling explanation of recent events. I assume that you have at least a passing knowledge of LVM and its terminology. This was written to prove that it is possible to recover from
pvmove failing (in certain cases) due to the way it performs its operations and backs up metadata.
I recently received a new hard drive, and so I excitedly partitioned and formatted it, extended my LVM VGs and merrily moved my LVs completely across to the new drive using
pvmove. So far, so wonderful — I was now ready to rebuild my other two drives so that only critical things were covered by�RAID and everything else was just in a large VG that spanned multiple drives. I now had three drives, all Samsung SpinPoint:
- 500GB �(
- 1TB (
- 1.5TB (
sdc, and brand-new)
The partition setup I wanted:
- 48MiB RAID-1 across all drives, for
/boot(already set up across
- 4.8GiB RAID-1 across all drives, for
/(already set up across
- 50GiB RAID-1 across all drives, for �
- 50GiB RAID-5 across all drives, for
/home(giving 100GiB usable space)
- The rest of each drive divided into 50/100GiB partitions, to be spread among my large data VG, containing films, music, backups, and other large data.
I moved all of the existing LVs to the new drive, which had plenty of space. Next, I sorted out the partitions for the two older drives, and initialised the RAIDs for my
mdadm -C /dev/md2 -l1 -n3 /dev/sda3 /dev/sdb3 missing
# initialise the RAID for LVM
# set up RAID-5 across /dev/sda5, /dev/sdb5 and (later) /dev/sdb5
mdadm -C /dev/md3 -l5 -n3 /dev/sda5 /dev/sdb5 missing
I then expanded the VGs and started to copy data across. I began with the
safe VG that contained
/home, as I wanted to be sure it was indeed safe. I have all of the photos I’ve ever taken, my email backups and archives, and various private keys and other important data, as well as things that could be replaced due to duplication elsewhere (including working copies of most of my projects). Being LVM, I could move the LVs around without needing to unmount them or reboot into a LiveCD:
vgextend sys /dev/md2
vgextend safe /dev/md3
# move all data from /dev/sdc2 to other volumes in the group
# ... disaster struck before the second pvmove
Then, disaster struck: �
pvmove spat a set of I/O errors at me… but kept running! I wasn’t paying attention (as moving 40GiB of data takes quite a while) and so I had just left it to work away.
pvmove: dev_open(/dev/sda1) called while suspended
pvmove: dev_open(/dev/sda2) called while suspended
pvmove: dev_open(/dev/sda3) called while suspended
pvmove: dev_open(/dev/sda5) called while suspended
pvmove: dev_open(/dev/md2) called while suspended
pvmove: dev_open(/dev/md3) called while suspended
It was only after it had completed and I started getting I/O and “permission denied” errors while trying to access my home directory that I first realised anything had gone wrong. After some frantic searching on the Internet, it seemed that I had to wave goodbye to all of my personal data, including those bits that weren’t backed up (though I’ve learned my lesson about backups now!).
I was determined not to give in to data loss, and so I gave myself a few recovery options:
- Create an image of the partition where the data used to reside.
- Try to recover pictures (the most critical thing for me) via foremost, which is a non-destructive operation.
- Try to rebuild the file system using
During my investigations and poking around, I discovered that there was another option: I could try to rebuild the VG from an old copy of its metadata.
Introduction to recovery
It turns out that many LVM utilities store a backup of the VG’s metadata before and after performing various operations, so as long as your
/etc/lvm directory is fine, you have a fighting chance. (Actually, it is theoretically possible to retrieve old versions of the metadata from the LVM partition itself, but this gets much more complex and I won’t deal with it here.)
The reason for this is that
pvmove works by creating a temporary target LV, mirroring data from the existing LV to the temporary target (making checkpoints along the way), and only removing the original once the mirroring has completed. This operation can then be interrupted and resumed at any time without loss of data. This also has the advantage that a full copy of the data on the original LV is still available; only the metadata explaining where it begins and ends has changed.
To access the old versions of this metadata, investigate the contents of your
/etc/lvm/archive directory as root. You should find one or more files of the form
volgroup is the name of the relevant VG and
nnnnn is a number. These files contain a backup of the volume group’s metadata, such as which partitions it resides on and where it’s located. The contents may look something like this:
contents = "Text Format Volume Group"
version = 1
description = "Created *before* executing 'vgextend safe /dev/md/3'"
creation_host = "hostname" # Linux hostname 2.6.31-gentoo-r3 #3 SMP Mon Nov 23 11:31:59 GMT 2009 i686
creation_time = 1262624964 # Mon Jan 4 17:09:24 2010
id = "zbSAI8-ExUy-PxJw-SHX7-y9PL-AL2y-kiTk8g"
seqno = 10
status = ["RESIZEABLE", "READ", "WRITE"]
flags = 
extent_size = 8192 # 4 Megabytes
max_lv = 0
max_pv = 0
id = "Vw3IF5-U92i-V2aL-9we1-A4E7-7j1P-4G142u"
device = "/dev/sdc1" # Hint only
status = ["ALLOCATABLE"]
flags = 
dev_size = 199993122 # 95.3642 Gigabytes
pe_start = 384
pe_count = 24413 # 95.3633 Gigabytes
id = "glU5zb-l9I7-SzGg-aXx5-lvgL-on4r-Faqkes"
status = ["READ", "WRITE", "VISIBLE"]
flags = 
segment_count = 1
start_extent = 0
extent_count = 12799 # 49.9961 Gigabytes
type = "striped"
stripe_count = 1 # linear
stripes = [
I went through the backups for my
safe volume group, and found the one before I extended the volume group to the (now corrupt)
/dev/md3, which is actually the example given above.
Recovering the volume group
It was at this point that I rebooted and held my breath. I brought the system into runlevel 1 (single-user mode) and logged in as root. Out of curiosity I ran some SMART tests on the drives —
sda was reporting no partitions, and internal testing eventually reported a persistent read error at about 80% into the drive. It looked like that drive was toast, which would explain why writing to the RAID-5 had failed.
smartctl --all /dev/sda
After verifying that it was a hardware issue, I decided to try to point the VG at its previous location (just
/dev/sdc1), thereby getting it to forget that the
home LV was on the now-inaccessible RAID-5. This could be accomplished by the magic of
vgcfgrestore -f /etc/lvm/archive/safe_00006.vg safe
# re-scan for VGs, and "safe" should show up as inactive
# activate the VG and any LVs inside
vgchange -ay safe
# re-scan for LVs and verify that safe/home exists
# not taking any chances -- mount it read-only
mount -o ro /dev/safe/home /home
mount succeeded, I began to copy all of the data to my large partition, where it should be safe. I also copied the major bits I didn’t want to lose to a USB flash drive as well, just to be certain. Once that was done, I took a deep breath and rebooted again, this time back to the full system.
Clearing up afterwards
Fortunately, everything worked fine — it was as if nothing had ever gone wrong! All that was left now was to rebuild the broken RAID section.
I decided for safety that I would re-initialise the
safe VG’s RAID as RAID-1 rather than RAID-5, so that any two of the drives could fail without losing my precious data:
mdadm -S /dev/md3
# remove the RAID descriptors from the partitions
mdadm --zero-superblock /dev/sda5
mdadm --zero-superblock /dev/sdb5
# create as RAID-1
mdadm -C /dev/md3 -l1 -n3 /dev/sda5 /dev/sdb5 missing
What I hadn’t anticipated was that this would rebuild the RAID-1 using the data from the previous RAID-5… including the LVM descriptors and corrupt content. Once the RAID was online, I ran
pvscan and received a warning that the VG metadata was inconsistent. For one heart-stopping moment, it also appeared that LVM thought my home partition was on the corrupt RAID! I quickly stopped the array and thought again.
After some further investigation, I discovered that the first 256 sectors of a partition are used by LVM to hold various information (including multiple copies of the VG metadata). All I had to do to fix this was to destroy that information:
mdadm -S /dev/md3
# zero the first 256 sectors
dd if=/dev/zero of=/dev/sda5 bs=512 count=256
dd if=/dev/zero of=/dev/sdb5 bs=512 count=256
# recreate the array -- it may complain that they are already in an array
mdadm -C /dev/md3 -l1 -n3 /dev/sda5 /dev/sdb5 missing
Once this was done, I could run
pvcreate /dev/md3 and proceed to extend the VG and
pvmove data as before, without worrying that
sda would give out on me and cause everything to fail again.
I hope this helps someone!