wiki:TipAndDoc/storage/RAID/failure/20100508

Version 1 (modified by mitty, 14 years ago) (diff)

--

2010/05/08のRAID5の復旧

対象

  • 500GB x 6 RAID5 array

経過

原因

  1. RAID5構成diskの6台中1台を誤って接続したまま、別のdiskへOSをインストールした際に、気付かずにRAID5構成disk(のおそらくsuperblock)を消去
    • sudo fdisk -l /dev/sdg
      /dev/sdg1               1       60789   488287611   fd  Linux raid autodetect
      
    • sudo mdadm --examine /dev/sdg1
      mdadm: No md superblock detected on /dev/sdg1.
      
    • cat /proc/mdstat
      md2 : inactive sdj1[4](S) sdh1[5](S) sdd1[1](S) sdf1[0](S) sdi1[3](S)
            2441437440 blocks
      

縮退起動

  1. mdadm --runで degraded start (5/6)
    • sudo mdadm --run /dev/md2 --verbose
      mdadm: started /dev/md2
      
    • cat /proc/mdstat
      md2 : active raid5 sdj1[4] sdh1[5] sdd1[1] sdf1[0] sdi1[3]
            2441437440 blocks level 5, 64k chunk, algorithm 2 [6/5] [UU_UUU]
      
  2. 縮退運転中にまた別のdiskのI/Oエラー、および複数のdiskにhard resetting linkが発生
    • => kern.log
    • SATA-I/Fとdiskの相性が悪い模様。以前よりたびたび発生していたが、今回は致命的
    • cat /proc/mdstat
      md2 : active raid5 sdj1[6](F) sdh1[7](F) sdd1[1] sdf1[0] sdi1[8](F)
            2441437440 blocks level 5, 64k chunk, algorithm 2 [6/2] [UU____]
      
  3. 再始動不能
    • sudo mdadm --run /dev/md2 --verbose
      mdadm: failed to run array /dev/md2: Device or resource busy
      

OS再起動

  1. system reboot
  2. OS再起動するも、2/6 failedとなり始動不可
    • cat /proc/mdstat
      md2 : inactive sdc1[3](S) sdb1[5](S) sdd1[4](S) sdh1[1](S) sdj1[0](S)
            2441437440 blocks
      
    • sudo mdadm --run /dev/md2 --verbose
      mdadm: failed to run array /dev/md2: Input/output error
      
    • dmesg | tail -n 30
      [   78.297675] md: kicking non-fresh sdb1 from array!
      [   78.297685] md: unbind<sdb1>
      [   78.321292] md: export_rdev(sdb1)
      [   78.410382] raid5: device sdc1 operational as raid disk 3
      [   78.410384] raid5: device sdd1 operational as raid disk 4
      [   78.410386] raid5: device sdh1 operational as raid disk 1
      [   78.410388] raid5: device sdj1 operational as raid disk 0
      [   78.410861] raid5: allocated 6386kB for md2
      [   78.410907] 3: w=1 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0
      [   78.410910] 4: w=2 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0
      [   78.410912] 1: w=3 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0
      [   78.410914] 0: w=4 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0
      [   78.410916] raid5: not enough operational devices for md2 (2/6 failed)
      [   78.411098] RAID5 conf printout:
      [   78.411100]  --- rd:6 wd:4
      [   78.411102]  disk 0, o:1, dev:sdj1
      [   78.411103]  disk 1, o:1, dev:sdh1
      [   78.411105]  disk 3, o:1, dev:sdc1
      [   78.411107]  disk 4, o:1, dev:sdd1
      [   78.411529] raid5: failed to run raid set md2
      [   78.411651] md: pers->run() failed ...
      
  3. /dev/sdb1がmdstatから消滅
    • cat /proc/mdstat
      md2 : inactive sda1[6](S) sdc1[3] sdd1[4] sdh1[1] sdj1[0]
            2441437440 blocks
      
  4. /dev/sdb1をre-addし、再始動
    • sudo mdadm /dev/md2 -a /dev/sdb1
      mdadm: re-added /dev/sdb1
      
    • sudo mdadm --run /dev/md2 --verbose
      mdadm: started /dev/md2
      
      • cat kern.log
        May  8 23:07:16 HOSTNAME kernel: [  308.856084] md: bind<sdb1>
        May  8 23:07:19 HOSTNAME kernel: [  311.836915] raid5: device sdb1 operational as raid disk 5
        May  8 23:07:19 HOSTNAME kernel: [  311.836923] raid5: device sdc1 operational as raid disk 3
        May  8 23:07:19 HOSTNAME kernel: [  311.836929] raid5: device sdd1 operational as raid disk 4
        May  8 23:07:19 HOSTNAME kernel: [  311.836934] raid5: device sdh1 operational as raid disk 1
        May  8 23:07:19 HOSTNAME kernel: [  311.836939] raid5: device sdj1 operational as raid disk 0
        May  8 23:07:19 HOSTNAME kernel: [  311.838484] raid5: allocated 6386kB for md2
        May  8 23:07:19 HOSTNAME kernel: [  311.838789] 5: w=1 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0
        May  8 23:07:19 HOSTNAME kernel: [  311.838796] 3: w=2 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0
        May  8 23:07:19 HOSTNAME kernel: [  311.838801] 4: w=3 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0
        May  8 23:07:19 HOSTNAME kernel: [  311.838807] 1: w=4 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0
        May  8 23:07:19 HOSTNAME kernel: [  311.838812] 0: w=5 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0
        May  8 23:07:19 HOSTNAME kernel: [  311.838818] raid5: raid level 5 set md2 active with 5 out of 6 devices, algorithm 2
        May  8 23:07:19 HOSTNAME kernel: [  311.852170] RAID5 conf printout:
        May  8 23:07:19 HOSTNAME kernel: [  311.852174]  --- rd:6 wd:5
        May  8 23:07:19 HOSTNAME kernel: [  311.852179]  disk 0, o:1, dev:sdj1
        May  8 23:07:19 HOSTNAME kernel: [  311.852184]  disk 1, o:1, dev:sdh1
        May  8 23:07:19 HOSTNAME kernel: [  311.852188]  disk 3, o:1, dev:sdc1
        May  8 23:07:19 HOSTNAME kernel: [  311.852192]  disk 4, o:1, dev:sdd1
        May  8 23:07:19 HOSTNAME kernel: [  311.852196]  disk 5, o:1, dev:sdb1
        May  8 23:07:19 HOSTNAME kernel: [  311.852306] md2: detected capacity change from 0 to 2500031938560
        May  8 23:07:19 HOSTNAME kernel: [  311.852642]  md2:RAID5 conf printout:
        May  8 23:07:19 HOSTNAME kernel: [  311.853380]  --- rd:6 wd:5
        May  8 23:07:19 HOSTNAME kernel: [  311.853386]  disk 0, o:1, dev:sdj1
        May  8 23:07:19 HOSTNAME kernel: [  311.853390]  disk 1, o:1, dev:sdh1
        May  8 23:07:19 HOSTNAME kernel: [  311.853394]  disk 2, o:1, dev:sda1
        May  8 23:07:19 HOSTNAME kernel: [  311.853398]  disk 3, o:1, dev:sdc1
        May  8 23:07:19 HOSTNAME kernel: [  311.853402]  disk 4, o:1, dev:sdd1
        May  8 23:07:19 HOSTNAME kernel: [  311.853406]  disk 5, o:1, dev:sdb1
        May  8 23:07:19 HOSTNAME kernel: [  311.853513]  unknown partition table
        May  8 23:07:19 HOSTNAME kernel: [  311.855855] md: recovery of RAID array md2
        May  8 23:07:19 HOSTNAME kernel: [  311.855863] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
        May  8 23:07:19 HOSTNAME kernel: [  311.855868] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
        May  8 23:07:19 HOSTNAME kernel: [  311.855883] md: using 128k window, over a total of 488287488 blocks.
        
    • cat /proc/mdstat
      md2 : active raid5 sdb1[5] sda1[6] sdc1[3] sdd1[4] sdh1[1] sdj1[0]
            2441437440 blocks level 5, 64k chunk, algorithm 2 [6/5] [UU_UUU]
            [>....................]  recovery =  0.0% (117376/488287488) finish=415.8min speed=19562K/sec
      
  5. 再度diskのI/Oエラー、および複数のdiskにhard resetting linkが発生

物理的配置換え

  1. 問題の起きるSATA-I/Fの使用を諦め、別のM/Bにdisk6台を繋ぎなおす
    • cat /proc/mdstat
      md2 : inactive sdd1[5](S) sdg1[7](S) sdf1[0](S) sde1[4](S) sdc1[1](S) sdh1[3](S)
            2929724928 blocks
      
  2. 今度は4/6 failedとなり再始動不可
    • sudo mdadm --run /dev/md2
      mdadm: failed to run array /dev/md2: Input/output error
      
    • dmesg | tail -n 30
      [  128.378868] md: kicking non-fresh sdd1 from array!
      [  128.378876] md: unbind<sdd1>
      [  128.400016] md: export_rdev(sdd1)
      [  128.400096] md: kicking non-fresh sde1 from array!
      [  128.400101] md: unbind<sde1>
      [  128.430012] md: export_rdev(sde1)
      [  128.430082] md: kicking non-fresh sdh1 from array!
      [  128.430087] md: unbind<sdh1>
      [  128.500012] md: export_rdev(sdh1)
      [  128.564040] raid5: device sdf1 operational as raid disk 0
      [  128.564043] raid5: device sdc1 operational as raid disk 1
      [  128.564449] raid5: allocated 6386kB for md2
      [  128.564469] 0: w=1 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0
      [  128.564471] 1: w=2 pa=0 pr=6 m=1 a=2 r=6 op1=0 op2=0
      [  128.564472] raid5: not enough operational devices for md2 (4/6 failed)
      [  128.564720] RAID5 conf printout:
      [  128.564722]  --- rd:6 wd:2
      [  128.564723]  disk 0, o:1, dev:sdf1
      [  128.564725]  disk 1, o:1, dev:sdc1
      [  128.564917] raid5: failed to run raid set md2
      [  128.565090] md: pers->run() failed ...
      
  3. unbindされたdeviceをre-add
    • sudo mdadm /dev/md2 -a /dev/sdc1
      mdadm: Cannot open /dev/sdc1: Device or resource busy
      
    • sudo mdadm /dev/md2 -a /dev/sdd1
      mdadm: re-added /dev/sdd1
      
    • sudo mdadm /dev/md2 -a /dev/sde1
      mdadm: re-added /dev/sde1
      
    • sudo mdadm /dev/md2 -a /dev/sdf1
      mdadm: Cannot open /dev/sdf1: Device or resource busy
      
    • sudo mdadm /dev/md2 -a /dev/sdg1
      mdadm: Cannot open /dev/sdg1: Device or resource busy
      
    • sudo mdadm /dev/md2 -a /dev/sdh1
      mdadm: re-added /dev/sdh1
      
    • sudo mdadm /dev/md2 -a /dev/sdi1
      mdadm: cannot find /dev/sdi1: No such file or directory
      
    • sudo mdadm /dev/md2 -r /dev/sdg1
      mdadm: hot removed /dev/sdg1
      
    • sudo mdadm /dev/md2 -a /dev/sdg1
      mdadm: re-added /dev/sdg1
      
  4. 再始動
    • sudo mdadm --run /dev/md2
      mdadm: started /dev/md2
      
  5. rebuilding
    • cat /proc/mdstat
      md2 : active raid5 sdg1[7] sdh1[3] sde1[4] sdd1[5] sdf1[0] sdc1[1]
            2441437440 blocks level 5, 64k chunk, algorithm 2 [6/5] [UU_UUU]
            [>....................]  recovery =  1.0% (5284864/488287488) finish=128.6min speed=62558K/sec
      
  6. rebuild completed
    • cat /proc/mdstat
      md2 : active raid5 sdg1[2] sdh1[3] sde1[4] sdd1[5] sdf1[0] sdc1[1]
            2441437440 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
      
    • => kern.log