[[PageOutline]] = 2010/05/08のRAID5の復旧 = == 対象 == * 500GB x 6 RAID5 array == 経過 == === 原因 === 1. RAID5構成diskの6台中1台を誤って接続したまま、別のdiskへOSをインストールした際に、気付かずにRAID5構成disk(のおそらくsuperblock)を消去 * sudo fdisk -l /dev/sdg {{{ /dev/sdg1 1 60789 488287611 fd Linux raid autodetect }}} * sudo mdadm --examine /dev/sdg1 {{{ mdadm: No md superblock detected on /dev/sdg1. }}} * cat /proc/mdstat {{{ md2 : inactive sdj1[4](S) sdh1[5](S) sdd1[1](S) sdf1[0](S) sdi1[3](S) 2441437440 blocks }}} === 縮退起動 === 1. mdadm --runで degraded start (5/6) * sudo mdadm --run /dev/md2 --verbose {{{ mdadm: started /dev/md2 }}} * cat /proc/mdstat {{{ md2 : active raid5 sdj1[4] sdh1[5] sdd1[1] sdf1[0] sdi1[3] 2441437440 blocks level 5, 64k chunk, algorithm 2 [6/5] [UU_UUU] }}} 1. 縮退運転中にまた別のdiskのI/Oエラー、および複数のdiskにhard resetting linkが発生 * => [./kern.log#fail] * SATA-I/Fとdiskの相性が悪い模様。以前よりたびたび発生していたが、今回は致命的 * cat /proc/mdstat {{{ md2 : active raid5 sdj1[6](F) sdh1[7](F) sdd1[1] sdf1[0] sdi1[8](F) 2441437440 blocks level 5, 64k chunk, algorithm 2 [6/2] [UU____] }}} 1. 再始動不能 * sudo mdadm --run /dev/md2 --verbose {{{ mdadm: failed to run array /dev/md2: Device or resource busy }}} === OS再起動 === 1. system reboot 1. OS再起動するも、2/6 failedとなり始動不可 * cat /proc/mdstat {{{ md2 : inactive sdc1[3](S) sdb1[5](S) sdd1[4](S) sdh1[1](S) sdj1[0](S) 2441437440 blocks }}} * sudo mdadm --run /dev/md2 --verbose {{{ mdadm: failed to run array /dev/md2: Input/output error }}} * dmesg | tail -n 30 {{{ [ 78.297675] md: kicking non-fresh sdb1 from array! [ 78.297685] md: unbind [ 78.321292] md: export_rdev(sdb1) [ 78.410382] raid5: device sdc1 operational as raid disk 3 [ 78.410384] raid5: device sdd1 operational as raid disk 4 [ 78.410386] raid5: device sdh1 operational as raid disk 1 [ 78.410388] raid5: device sdj1 operational as raid disk 0 [ 78.410861] raid5: allocated 6386kB for md2 [ 78.410907] 3: w=1 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0 [ 78.410910] 4: w=2 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0 [ 78.410912] 1: w=3 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0 [ 78.410914] 0: w=4 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0 [ 78.410916] raid5: not enough operational devices for md2 (2/6 failed) [ 78.411098] RAID5 conf printout: [ 78.411100] --- rd:6 wd:4 [ 78.411102] disk 0, o:1, dev:sdj1 [ 78.411103] disk 1, o:1, dev:sdh1 [ 78.411105] disk 3, o:1, dev:sdc1 [ 78.411107] disk 4, o:1, dev:sdd1 [ 78.411529] raid5: failed to run raid set md2 [ 78.411651] md: pers->run() failed ... }}} 1. /dev/sdb1がmdstatから消滅 * cat /proc/mdstat {{{ md2 : inactive sda1[6](S) sdc1[3] sdd1[4] sdh1[1] sdj1[0] 2441437440 blocks }}} 1. /dev/sdb1をre-addし、再始動 * sudo mdadm /dev/md2 -a /dev/sdb1 {{{ mdadm: re-added /dev/sdb1 }}} * sudo mdadm --run /dev/md2 --verbose {{{ mdadm: started /dev/md2 }}} * cat kern.log {{{ May 8 23:07:16 HOSTNAME kernel: [ 308.856084] md: bind May 8 23:07:19 HOSTNAME kernel: [ 311.836915] raid5: device sdb1 operational as raid disk 5 May 8 23:07:19 HOSTNAME kernel: [ 311.836923] raid5: device sdc1 operational as raid disk 3 May 8 23:07:19 HOSTNAME kernel: [ 311.836929] raid5: device sdd1 operational as raid disk 4 May 8 23:07:19 HOSTNAME kernel: [ 311.836934] raid5: device sdh1 operational as raid disk 1 May 8 23:07:19 HOSTNAME kernel: [ 311.836939] raid5: device sdj1 operational as raid disk 0 May 8 23:07:19 HOSTNAME kernel: [ 311.838484] raid5: allocated 6386kB for md2 May 8 23:07:19 HOSTNAME kernel: [ 311.838789] 5: w=1 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0 May 8 23:07:19 HOSTNAME kernel: [ 311.838796] 3: w=2 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0 May 8 23:07:19 HOSTNAME kernel: [ 311.838801] 4: w=3 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0 May 8 23:07:19 HOSTNAME kernel: [ 311.838807] 1: w=4 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0 May 8 23:07:19 HOSTNAME kernel: [ 311.838812] 0: w=5 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0 May 8 23:07:19 HOSTNAME kernel: [ 311.838818] raid5: raid level 5 set md2 active with 5 out of 6 devices, algorithm 2 May 8 23:07:19 HOSTNAME kernel: [ 311.852170] RAID5 conf printout: May 8 23:07:19 HOSTNAME kernel: [ 311.852174] --- rd:6 wd:5 May 8 23:07:19 HOSTNAME kernel: [ 311.852179] disk 0, o:1, dev:sdj1 May 8 23:07:19 HOSTNAME kernel: [ 311.852184] disk 1, o:1, dev:sdh1 May 8 23:07:19 HOSTNAME kernel: [ 311.852188] disk 3, o:1, dev:sdc1 May 8 23:07:19 HOSTNAME kernel: [ 311.852192] disk 4, o:1, dev:sdd1 May 8 23:07:19 HOSTNAME kernel: [ 311.852196] disk 5, o:1, dev:sdb1 May 8 23:07:19 HOSTNAME kernel: [ 311.852306] md2: detected capacity change from 0 to 2500031938560 May 8 23:07:19 HOSTNAME kernel: [ 311.852642] md2:RAID5 conf printout: May 8 23:07:19 HOSTNAME kernel: [ 311.853380] --- rd:6 wd:5 May 8 23:07:19 HOSTNAME kernel: [ 311.853386] disk 0, o:1, dev:sdj1 May 8 23:07:19 HOSTNAME kernel: [ 311.853390] disk 1, o:1, dev:sdh1 May 8 23:07:19 HOSTNAME kernel: [ 311.853394] disk 2, o:1, dev:sda1 May 8 23:07:19 HOSTNAME kernel: [ 311.853398] disk 3, o:1, dev:sdc1 May 8 23:07:19 HOSTNAME kernel: [ 311.853402] disk 4, o:1, dev:sdd1 May 8 23:07:19 HOSTNAME kernel: [ 311.853406] disk 5, o:1, dev:sdb1 May 8 23:07:19 HOSTNAME kernel: [ 311.853513] unknown partition table May 8 23:07:19 HOSTNAME kernel: [ 311.855855] md: recovery of RAID array md2 May 8 23:07:19 HOSTNAME kernel: [ 311.855863] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. May 8 23:07:19 HOSTNAME kernel: [ 311.855868] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. May 8 23:07:19 HOSTNAME kernel: [ 311.855883] md: using 128k window, over a total of 488287488 blocks. }}} * cat /proc/mdstat {{{ md2 : active raid5 sdb1[5] sda1[6] sdc1[3] sdd1[4] sdh1[1] sdj1[0] 2441437440 blocks level 5, 64k chunk, algorithm 2 [6/5] [UU_UUU] [>....................] recovery = 0.0% (117376/488287488) finish=415.8min speed=19562K/sec }}} 1. 再度diskのI/Oエラー、および複数のdiskにhard resetting linkが発生 * => [./kern.log#fail2] === 物理的配置換え === 1. 問題の起きるSATA-I/Fの使用を諦め、別のM/Bにdisk6台を繋ぎなおす * cat /proc/mdstat {{{ md2 : inactive sdd1[5](S) sdg1[7](S) sdf1[0](S) sde1[4](S) sdc1[1](S) sdh1[3](S) 2929724928 blocks }}} 1. 今度は4/6 failedとなり再始動不可 * sudo mdadm --run /dev/md2 {{{ mdadm: failed to run array /dev/md2: Input/output error }}} * dmesg | tail -n 30 {{{ [ 128.378868] md: kicking non-fresh sdd1 from array! [ 128.378876] md: unbind [ 128.400016] md: export_rdev(sdd1) [ 128.400096] md: kicking non-fresh sde1 from array! [ 128.400101] md: unbind [ 128.430012] md: export_rdev(sde1) [ 128.430082] md: kicking non-fresh sdh1 from array! [ 128.430087] md: unbind [ 128.500012] md: export_rdev(sdh1) [ 128.564040] raid5: device sdf1 operational as raid disk 0 [ 128.564043] raid5: device sdc1 operational as raid disk 1 [ 128.564449] raid5: allocated 6386kB for md2 [ 128.564469] 0: w=1 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0 [ 128.564471] 1: w=2 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0 [ 128.564472] raid5: not enough operational devices for md2 (4/6 failed) [ 128.564720] RAID5 conf printout: [ 128.564722] --- rd:6 wd:2 [ 128.564723] disk 0, o:1, dev:sdf1 [ 128.564725] disk 1, o:1, dev:sdc1 [ 128.564917] raid5: failed to run raid set md2 [ 128.565090] md: pers->run() failed ... }}} 1. unbindされたdeviceをre-add * sudo mdadm /dev/md2 -a /dev/sdc1 {{{ mdadm: Cannot open /dev/sdc1: Device or resource busy }}} * sudo mdadm /dev/md2 -a /dev/sdd1 {{{ mdadm: re-added /dev/sdd1 }}} * sudo mdadm /dev/md2 -a /dev/sde1 {{{ mdadm: re-added /dev/sde1 }}} * sudo mdadm /dev/md2 -a /dev/sdf1 {{{ mdadm: Cannot open /dev/sdf1: Device or resource busy }}} * sudo mdadm /dev/md2 -a /dev/sdg1 {{{ mdadm: Cannot open /dev/sdg1: Device or resource busy }}} * sudo mdadm /dev/md2 -a /dev/sdh1 {{{ mdadm: re-added /dev/sdh1 }}} * sudo mdadm /dev/md2 -a /dev/sdi1 {{{ mdadm: cannot find /dev/sdi1: No such file or directory }}} * sudo mdadm /dev/md2 -r /dev/sdg1 {{{ mdadm: hot removed /dev/sdg1 }}} * sudo mdadm /dev/md2 -a /dev/sdg1 {{{ mdadm: re-added /dev/sdg1 }}} 1. 再始動 * sudo mdadm --run /dev/md2 {{{ mdadm: started /dev/md2 }}} 1. rebuilding * cat /proc/mdstat {{{ md2 : active raid5 sdg1[7] sdh1[3] sde1[4] sdd1[5] sdf1[0] sdc1[1] 2441437440 blocks level 5, 64k chunk, algorithm 2 [6/5] [UU_UUU] [>....................] recovery = 1.0% (5284864/488287488) finish=128.6min speed=62558K/sec }}} 1. rebuild completed * cat /proc/mdstat {{{ md2 : active raid5 sdg1[2] sdh1[3] sde1[4] sdd1[5] sdf1[0] sdc1[1] 2441437440 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU] }}} * => [./kern.log#recover]