Recovering from pvmove failure

January 7, 2010

The background
The disaster
Introduction to recovery
Recovering the volume group
Clearing up afterwards
Comments (5)

Note: This is a fairly rambling explanation of recent events. I assume that you have at least a passing knowledge of LVM and its terminology. This was written to prove that it is possible to recover from pvmove failing (in certain cases) due to the way it performs its operations and backs up metadata.

The background

I recently received a new hard drive, and so I excitedly partitioned and formatted it, extended my LVM VGs and merrily moved my LVs completely across to the new drive using pvmove. So far, so wonderful — I was now ready to rebuild my other two drives so that only critical things were covered by�RAID and everything else was just in a large VG that spanned multiple drives. I now had three drives, all Samsung SpinPoint:

500GB �(sdb)
1TB (sda)
1.5TB (sdc, and brand-new)

The partition setup I wanted:

48MiB RAID-1 across all drives, for /boot (already set up across sda and sdb)
4.8GiB RAID-1 across all drives, for /(already set up across sda and sdb)
50GiB RAID-1 across all drives, for �/usr, /opt, /var
50GiB RAID-5 across all drives, for /home (giving 100GiB usable space)
The rest of each drive divided into 50/100GiB partitions, to be spread among my large data VG, containing films, music, backups, and other large data.

The disaster

I moved all of the existing LVs to the new drive, which had plenty of space. Next, I sorted out the partitions for the two older drives, and initialised the RAIDs for my sys and safe VGs:

# set up RAID-1 across /dev/sda3, /dev/sdb3 and (later) /dev/sdc3
mdadm -C /dev/md2 -l1 -n3 /dev/sda3 /dev/sdb3 missing
# initialise the RAID for LVM
pvcreate /dev/md2
# set up RAID-5 across /dev/sda5, /dev/sdb5 and (later) /dev/sdb5
mdadm -C /dev/md3 -l5 -n3 /dev/sda5 /dev/sdb5 missing
pvcreate /dev/md3

I then expanded the VGs and started to copy data across. I began with the safe VG that contained /home, as I wanted to be sure it was indeed safe. I have all of the photos I’ve ever taken, my email backups and archives, and various private keys and other important data, as well as things that could be replaced due to duplication elsewhere (including working copies of most of my projects). Being LVM, I could move the LVs around without needing to unmount them or reboot into a LiveCD:

# extend the volume groups
vgextend sys /dev/md2
vgextend safe /dev/md3
# move all data from /dev/sdc2 to other volumes in the group
pvmove /dev/sdc2
# ... disaster struck before the second pvmove
pvmove /dev/sdc1

Then, disaster struck: �pvmove spat a set of I/O errors at me… but kept running! I wasn’t paying attention (as moving 40GiB of data takes quite a while) and so I had just left it to work away.

pvmove: dev_open(/dev/sda) called while suspended
pvmove: dev_open(/dev/sda1) called while suspended
pvmove: dev_open(/dev/sda2) called while suspended
pvmove: dev_open(/dev/sda3) called while suspended
pvmove: dev_open(/dev/sda5) called while suspended
pvmove: dev_open(/dev/md2) called while suspended
pvmove: dev_open(/dev/md3) called while suspended

It was only after it had completed and I started getting I/O and “permission denied” errors while trying to access my home directory that I first realised anything had gone wrong. After some frantic searching on the Internet, it seemed that I had to wave goodbye to all of my personal data, including those bits that weren’t backed up (though I’ve learned my lesson about backups now!).

I was determined not to give in to data loss, and so I gave myself a few recovery options:

Create an image of the partition where the data used to reside.
Try to recover pictures (the most critical thing for me) via foremost, which is a non-destructive operation.
Try to rebuild the file system using reiserfsck --rebuild-tree.

During my investigations and poking around, I discovered that there was another option: I could try to rebuild the VG from an old copy of its metadata.

Introduction to recovery

It turns out that many LVM utilities store a backup of the VG’s metadata before and after performing various operations, so as long as your /etc/lvm directory is fine, you have a fighting chance. (Actually, it is theoretically possible to retrieve old versions of the metadata from the LVM partition itself, but this gets much more complex and I won’t deal with it here.)

The reason for this is that pvmove works by creating a temporary target LV, mirroring data from the existing LV to the temporary target (making checkpoints along the way), and only removing the original once the mirroring has completed. This operation can then be interrupted and resumed at any time without loss of data. This also has the advantage that a full copy of the data on the original LV is still available; only the metadata explaining where it begins and ends has changed.

To access the old versions of this metadata, investigate the contents of your /etc/lvm/archive directory as root. You should find one or more files of the form volgroup_nnnnn.vg, where volgroup is the name of the relevant VG and nnnnn is a number. These files contain a backup of the volume group’s metadata, such as which partitions it resides on and where it’s located. The contents may look something like this:

# Generated by LVM2 version 2.02.56(1) (2009-11-24): Mon Jan 4 17:09:24 2010

contents = "Text Format Volume Group"
version = 1

description = "Created *before* executing 'vgextend safe /dev/md/3'"

creation_host = "hostname" # Linux hostname 2.6.31-gentoo-r3 #3 SMP Mon Nov 23 11:31:59 GMT 2009 i686
creation_time = 1262624964 # Mon Jan 4 17:09:24 2010

safe {
id = "zbSAI8-ExUy-PxJw-SHX7-y9PL-AL2y-kiTk8g"
seqno = 10
status = ["RESIZEABLE", "READ", "WRITE"]
flags = []
extent_size = 8192 # 4 Megabytes
max_lv = 0
max_pv = 0

physical_volumes {

pv0 {
id = "Vw3IF5-U92i-V2aL-9we1-A4E7-7j1P-4G142u"
device = "/dev/sdc1" # Hint only

status = ["ALLOCATABLE"]
flags = []
dev_size = 199993122 # 95.3642 Gigabytes
pe_start = 384
pe_count = 24413 # 95.3633 Gigabytes
}
}

logical_volumes {

home {
id = "glU5zb-l9I7-SzGg-aXx5-lvgL-on4r-Faqkes"
status = ["READ", "WRITE", "VISIBLE"]
flags = []
segment_count = 1

segment1 {
start_extent = 0
extent_count = 12799 # 49.9961 Gigabytes

type = "striped"
stripe_count = 1 # linear

stripes = [
"pv0", 0
]
}
}
}
}

I went through the backups for my safe volume group, and found the one before I extended the volume group to the (now corrupt) /dev/md3, which is actually the example given above.

Recovering the volume group

It was at this point that I rebooted and held my breath. I brought the system into runlevel 1 (single-user mode) and logged in as root. Out of curiosity I ran some SMART tests on the drives — sda was reporting no partitions, and internal testing eventually reported a persistent read error at about 80% into the drive. It looked like that drive was toast, which would explain why writing to the RAID-5 had failed.

smartctl -t short /dev/sda
sleep 180
smartctl --all /dev/sda

After verifying that it was a hardware issue, I decided to try to point the VG at its previous location (just /dev/sdc1), thereby getting it to forget that the home LV was on the now-inaccessible RAID-5. This could be accomplished by the magic of vgcfgrestore:

# restore the VG metadata for the "safe" VG from the given file
vgcfgrestore -f /etc/lvm/archive/safe_00006.vg safe
# re-scan for VGs, and "safe" should show up as inactive
vgscan
# activate the VG and any LVs inside
vgchange -ay safe
# re-scan for LVs and verify that safe/home exists
lvscan
# not taking any chances -- mount it read-only
mount -o ro /dev/safe/home /home

After the mount succeeded, I began to copy all of the data to my large partition, where it should be safe. I also copied the major bits I didn’t want to lose to a USB flash drive as well, just to be certain. Once that was done, I took a deep breath and rebooted again, this time back to the full system.

Clearing up afterwards

Fortunately, everything worked fine — it was as if nothing had ever gone wrong! All that was left now was to rebuild the broken RAID section.

I decided for safety that I would re-initialise the safe VG’s RAID as RAID-1 rather than RAID-5, so that any two of the drives could fail without losing my precious data:

# stop the RAID array
mdadm -S /dev/md3
# remove the RAID descriptors from the partitions
mdadm --zero-superblock /dev/sda5
mdadm --zero-superblock /dev/sdb5
# create as RAID-1
mdadm -C /dev/md3 -l1 -n3 /dev/sda5 /dev/sdb5 missing

What I hadn’t anticipated was that this would rebuild the RAID-1 using the data from the previous RAID-5… including the LVM descriptors and corrupt content. Once the RAID was online, I ran pvscan and received a warning that the VG metadata was inconsistent. For one heart-stopping moment, it also appeared that LVM thought my home partition was on the corrupt RAID! I quickly stopped the array and thought again.

After some further investigation, I discovered that the first 256 sectors of a partition are used by LVM to hold various information (including multiple copies of the VG metadata). All I had to do to fix this was to destroy that information:

# ensure the array is stopped
mdadm -S /dev/md3
# zero the first 256 sectors
dd if=/dev/zero of=/dev/sda5 bs=512 count=256
dd if=/dev/zero of=/dev/sdb5 bs=512 count=256
# recreate the array -- it may complain that they are already in an array
mdadm -C /dev/md3 -l1 -n3 /dev/sda5 /dev/sdb5 missing

Once this was done, I could run pvcreate /dev/md3 and proceed to extend the VG and pvmove data as before, without worrying that sda would give out on me and cause everything to fail again.

I hope this helps someone!

mdadm -C /dev/md3 -l1 -n3 /dev/sda5 /dev/sdb5 missing

5 Responses to Recovering from pvmove failure

David Durant says: January 8, 2010 at 00:42

This is why JungleDisk (or equivalent) is your friend. We used to cross-backup between Annie’s and my PC overnight but now we both use JD to push the diffs up to EC2. Seemless, works on Windows and Linux, zero effort and about $5 a month.

Dave says: January 8, 2010 at 13:14

I’ve got a sync system (s3cmd) — it’s just that the upload speed here (~5KiB/s gusting up to 20KiB/s) sucks enough that the initial upload would take forever. Just uploading 5400 photos (~4GiB) has so far been running continuously since shortly after I finished recovery on the evening of the 4th… with 800 still to go as of this moment.

David Durant says: January 8, 2010 at 15:11

*nods* – My initial upload with JD took circa 4-5 days. For people with serious amounts of data (I’m only syncing ~400 Meg probably) it would be quicker if the service took DVDs by post (“never underestimate the bandwidth of a Volvo full of DVDs” 🙂 ).

Oh – and Katie still hasn’t replied to David P.

Dave says: January 8, 2010 at 18:46

Indeed it would. I believe S3 do accept hard drives in some situations (http://aws.amazon.com/importexport/) but that depends on cost- and time-effectiveness.

I’ve just prodded Katie about the email, and she’s looking into it now. Her excuse is that she’s been ill (very true) and that nothing requiring intellect has been done.

Turbo Fredriksson says: September 18, 2010 at 09:29

I had almost the same problem, with the exeption that the pvmove failed, removing the temporary LV, leaving the PV ’empty’.

The cfg restore wasn’t enough for me. My PV was still showing up as empty. But I did the vgcfgrestore, vgscan et al. The PV still showed as empty… But I also did a ‘pvmove –abort’ (even though it seems that there WAS no move in progress). Then I want to bed, didn’t want to risk doing something worse due to fatigue :). When I woke up, I checked the PV. It was full again!!

So give it some time… Don’t panic! This is actually the most important part!! DO. NOT. PANIC.

Oh, and don’t reboot if you can avoid it (I didn’t).

This is what my (failed) pvmove gave me:

celia:~# pvmove -v /dev/md6 /dev/md9
Finding volume group “movies”
Archiving volume group “movies” metadata (seqno 86).
Creating logical volume pvmove0
Moving 119234 extents of logical volume movies/movies
Found volume group “movies”
Updating volume group metadata
Creating volume group backup “/etc/lvm/backup/movies” (seqno 87).
Found volume group “movies”
Found volume group “movies”
Suspending movies-movies (253:1) with device flush
Found volume group “movies”
Found volume group “movies”
Creating movies-pvmove0
device-mapper: create ioctl failed: Enhet eller resurs upptagen
Loading movies-movies table
device-mapper: reload ioctl failed: Ogiltigt argument
Checking progress every 15 seconds
WARNING: dev_open(/dev/md6) called while suspended
WARNING: dev_open(/dev/md2) called while suspended
WARNING: dev_open(/dev/md3) called while suspended
WARNING: dev_open(/dev/md4) called while suspended
WARNING: dev_open(/dev/md5) called while suspended
WARNING: dev_open(/dev/md8) called while suspended
WARNING: dev_open(/dev/md9) called while suspended
WARNING: dev_open(/dev/md6) called while suspended
WARNING: dev_open(/dev/md2) called while suspended
WARNING: dev_open(/dev/md3) called while suspended
WARNING: dev_open(/dev/md4) called while suspended
WARNING: dev_open(/dev/md5) called while suspended
WARNING: dev_open(/dev/md8) called while suspended
WARNING: dev_open(/dev/md9) called while suspended
WARNING: dev_open(/dev/md6) called while suspended
WARNING: dev_open(/dev/md2) called while suspended
WARNING: dev_open(/dev/md3) called while suspended
WARNING: dev_open(/dev/md4) called while suspended
WARNING: dev_open(/dev/md5) called while suspended
WARNING: dev_open(/dev/md8) called while suspended
WARNING: dev_open(/dev/md9) called while suspended
/dev/md6: Moved: 100,0%
Found volume group “movies”
Found volume group “movies”
Found volume group “movies”
Found volume group “movies”
Suspending movies-pvmove0 (253:0) with device flush
Found volume group “movies”
Creating movies-pvmove0
device-mapper: create ioctl failed: Enhet eller resurs upptagen
Unable to reactivate logical volume “pvmove0”
Found volume group “movies”
Loading movies-movies table
Resuming movies-movies (253:1)
Found volume group “movies”
Removing movies-pvmove0 (253:0)
Found volume group “movies”
Removing temporary pvmove LV
Writing out final volume group after pvmove
Creating volume group backup “/etc/lvm/backup/movies” (seqno 89).
celia:~# pvmove -v /dev/md6 /dev/md9
Finding volume group “movies”
Archiving volume group “movies” metadata (seqno 89).
Creating logical volume pvmove0
No data to move for movies
celia:~# pvdisplay /dev/md6
— Physical volume —
PV Name /dev/md6
VG Name movies
PV Size 465,76 GB / not usable 1,44 MB
Allocatable yes
PE Size (KByte) 4096
Total PE 119234
Free PE 119234
Allocated PE 0
PV UUID O4Stdr-Izas-1rwv-aPoF-2aNu-15ZR-kfxi2Y
celia:~# less pvdisplay.txt
[…]
— Physical volume —
PV Name /dev/md6
VG Name movies
PV Size 465,76 GB / not usable 1,44 MB
Allocatable yes (but full)
PE Size (KByte) 4096
Total PE 119234
Free PE 0
Allocated PE 119234
PV UUID O4Stdr-Izas-1rwv-aPoF-2aNu-15ZR-kfxi2Y

The pvdisplay.txt text was before I did the pvmove…

A Blog Less Ordinary