XFS on x86_64
- 革命の日々! XFS on x86_64
XFSって1ページ書き出すだけでもスタック3.5Kも使うのよね。カーネルスタックはページサイズx2なので8Kしかない。
- Kazuho@Cybozu Labs: MySQL and the XFS stack overflow problem
- LKML: Dave Chinner: Re: XFS status update for May 2012
IOWs,the only solution that would fix the problem was to split allocations into a different stack so that we have the approximately 4k of stack space needed for the worst case XFS stack usage (double btree split requiring metadata IO) and still have enough space left for the DM/MD/SCSI stack underneath it...
- LKML: John Berthels: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
- スタックを突き破っている例
- LKML: Eric Sandeen: Re: Stack overrun in 3.5.0-rc7 w/ cfq 別の例
CONFIG_4KSTACKS
- 革命の日々! さよなら CONFIG_4KSTACKS
XFSスタックオーバーフロー問題の余波でCONFIG_4KSTACKSが消えました。といっても従来の 8Kスタックとは異なりプロセス8K + IRQ 4K のハイブリッドスタックに以降することになった。
ただし、元々4Kスタックがなかった非x86アーキではスタックサイズが変わってないので 全然本質的な解決じゃない
- x86 Linux のメモリモデル、プロセス空間切り替え、カーネルスタック - naoyaのはてなダイアリー
Linux カーネルは割り込みスタックを用意していないという実装上の特徴がある。
- これが変更になった?
disable dirty page writeback
- twitter:kosaki55tea/status/183995247800500225
@kosaki55tea Linux 的には何がトレンドなの?> FileSystem
@osada XFSのスタックの問題は解決ずみなので自分の好みで決めていい。と言いたかったのです
- https://twitter.com/kosaki55tea/status/167843479936970753
@kosaki55tea @m_bird @rkmathi http://mkosaki.blog46.fc2.com/blog-entry-1097.html … これですかね?
@naota344 @m_bird @rkmathi ああ、それなら解決した。今はメモリ回収の延長でwritebackしない
- (RFC PATCH 0/5) Reduce filesystem writeback from page reclaim (again)
Patch 1 disables writeback of filesystem pages from direct reclaim entirely. Anonymous pages are still written Patch 2 disables writeback of filesystem pages from kswapd unless the priority is raised to the point where kswapd is considered to be in trouble. Patch 3 throttles reclaimers if too many dirty pages are being encountered and the zones or backing devices are congested. Patch 4 invalidates dirty pages found at the end of the LRU so they are reclaimed quickly after being written back rather than waiting for a reclaimer to find them Patch 5 tries to prioritise inodes backing dirty pages found at the end of the LRU.
- 革命の日々! さよなら CONFIG_4KSTACKS
いまのXFSは別の問題があって、David Chinnerがあんまり考えずにメモリ回収時のwrite pageをdisableしちゃったからダーティーページが多いときに負荷を書けると回収可能キャッシュ残っててもOOM Killerで死ぬよ
- 上のpatchでこの副作用が起きているのでは?
disable dirty page writebackで回避できていないように見える例
- LKML: Eric Sandeen: Re: Stack overrun in 3.5.0-rc7 w/ cfq
- 3.5で普通に突き破っているように見えるが、これは結局何が起きたのか
- パッチで回避できたはずなのではないのか
その他
- LKML: Ted Ts'o: Re: ext4 deep stack with mark_page_dirty reclaim
- ext4でstack overflowした例?
- この例だと、symbolにext4と入っているスタックで2440byte消費しているように見える
- XFSだけでなく、他のFSでもstackを突き破る可能性があるのでは…?
misc
- duplicate UUID
[358274.571175] XFS (dm-2): Filesystem has duplicate UUID - can't mount
- Linux Tips - XFS Filesystem has duplicate UUID problem
mount -o nouuid /dev/sdb7 disk-7
xfs_admin -U generate /dev/sdb7
- Linux Tips - XFS Filesystem has duplicate UUID problem
- http://lwn.net/Articles/476263/ XFS: the filesystem of the future? [LWN.net]
testing xfs_growfs
- 結論
- diskの後方に向けて拡張するのは、開始セクタを変えずにパーティションを切り直し、リマウントしてxfs_growfsすればよい
- diskの前方に向けて拡張するのは、パーティション内容をあらかじめdd等で移動する手間が必要
- データ総量が十分小さければ、前方にあらかじめ別のパーティションを切り、cpした上で古いパーティションを削除して、新しいパーティションを後方拡張する方が操作ミスを起こしにくいと思われる
- LVMを使えばより単純になると思われる
- ただし、LVM自体を扱う手間が増える(障害時の復旧手順が複雑化)
expanding /dev/sdb1
- mkfs
- sudo fdisk /dev/sdb -l
Disk /dev/sdb: 4294 MB, 4294967296 bytes 255 heads, 63 sectors/track, 522 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Disk /dev/sdb doesn't contain a valid partition table
- sudo fdisk /dev/sdb
Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-522, default 1): Using default value 1 Last cylinder, +cylinders or +size{K,M,G} (1-522, default 522): +1G /dev/sdb1 1 132 1060258+ 83 Linux
- sudo mkfs.xfs /dev/sdb1
meta-data=/dev/sdb1 isize=256 agcount=4, agsize=66266 blks = sectsz=512 attr=2 data = bsize=4096 blocks=265064, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
- sudo blkid /dev/sdb1
/dev/sdb1: UUID="e47dbfff-97ca-43b8-9a29-89700956a7dd" TYPE="xfs"
- sudo fdisk /dev/sdb -l
- mount
- sudo mkdir /mnt/sdb1
- sudo mount /dev/sdb1 /mnt/sdb1
- df /mnt/sdb1/ -h
/dev/sdb1 1.1G 4.2M 1022M 1% /mnt/sdb1
- テストデータの用意
- sudo dd if=/dev/urandom of=/mnt/sdb1/rand.10M bs=1024 count=10240
- cp /mnt/sdb1/rand.10M .
- remake partition
- sudo umount /mnt/sdb1/
- sudo fdisk /dev/sdb
Command (m for help): d Selected partition 1 Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-522, default 1): Using default value 1 Last cylinder, +cylinders or +size{K,M,G} (1-522, default 522): Using default value 522 /dev/sdb1 1 522 4192933+ 83 Linux
- sudo blkid /dev/sdb1
/dev/sdb1: UUID="e47dbfff-97ca-43b8-9a29-89700956a7dd" TYPE="xfs"
- sudo mount /dev/sdb1 /mnt/sdb1
- df /mnt/sdb1/ -h
/dev/sdb1 1.1G 15M 1012M 2% /mnt/sdb1
- sha1sum -b rand.10M /mnt/sdb1/rand.10M
72d3a67ecd23e77bbc57eb959b945c6cbdae58e1 *rand.10M 72d3a67ecd23e77bbc57eb959b945c6cbdae58e1 */mnt/sdb1/rand.10M
- resize xfs filesystem
- sudo xfs_growfs /mnt/sdb1/
meta-data=/dev/sdb1 isize=256 agcount=4, agsize=66266 blks = sectsz=512 attr=2 data = bsize=4096 blocks=265064, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 data blocks changed from 265064 to 1048233
- df /mnt/sdb1/ -h
Filesystem Size Used Avail Use% Mounted on /dev/sdb1 4.0G 15M 4.0G 1% /mnt/sdb1
- sha1sum -b rand.10M /mnt/sdb1/rand.10M
72d3a67ecd23e77bbc57eb959b945c6cbdae58e1 *rand.10M 72d3a67ecd23e77bbc57eb959b945c6cbdae58e1 */mnt/sdb1/rand.10M
- sudo xfs_growfs /mnt/sdb1/
- 問題なく拡張された
test with 3 partitions
- make partitions
- sudo fdisk /dev/sdb
- snip
- sudo fdisk /dev/sdb -l
/dev/sdb1 1 132 1060258+ 83 Linux /dev/sdb2 133 394 2104515 83 Linux /dev/sdb3 395 522 1028160 83 Linux
- sudo fdisk /dev/sdb
- mkfs
- sudo mkfs.xfs /dev/sdb1
meta-data=/dev/sdb1 isize=256 agcount=4, agsize=66266 blks = sectsz=512 attr=2 data = bsize=4096 blocks=265064, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
- sudo mkfs.xfs /dev/sdb2
meta-data=/dev/sdb2 isize=256 agcount=4, agsize=131532 blks = sectsz=512 attr=2 data = bsize=4096 blocks=526128, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
- sudo mkfs.xfs /dev/sdb3
meta-data=/dev/sdb3 isize=256 agcount=4, agsize=64260 blks = sectsz=512 attr=2 data = bsize=4096 blocks=257040, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=1200, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
- sudo blkid /dev/sdb[123]
/dev/sdb1: UUID="9065c3c2-ab3d-4460-978e-e37c88f31f5f" TYPE="xfs" /dev/sdb2: UUID="b9515b82-4d3c-457d-8d32-330cfea692df" TYPE="xfs" /dev/sdb3: UUID="17374506-2cb1-4451-a5a8-b15cb1410870" TYPE="xfs"
- sudo mkfs.xfs /dev/sdb1
- mount
- sudo mkdir /mnt/sdb1
- sudo mkdir /mnt/sdb2
- sudo mkdir /mnt/sdb3
- sudo mount /dev/sdb1 /mnt/sdb1
- sudo mount /dev/sdb2 /mnt/sdb2
- sudo mount /dev/sdb3 /mnt/sdb3
- df -h
/dev/sdb1 1.1G 4.2M 1022M 1% /mnt/sdb1 /dev/sdb2 2.0G 4.2M 2.0G 1% /mnt/sdb2 /dev/sdb3 1000M 4.2M 996M 1% /mnt/sdb3
- テストデータの用意
- sudo dd if=/dev/urandom of=/mnt/sdb1/rand1.10M bs=1024 count=10240
- sudo dd if=/dev/urandom of=/mnt/sdb2/rand2.10M bs=1024 count=10240
- sudo dd if=/dev/urandom of=/mnt/sdb3/rand3.10M bs=1024 count=10240
- sha1sum -b /mnt/sdb*/rand*.10M
11b2836d9c88f3890224f5d4250e2dd86ddca3e6 */mnt/sdb1/rand1.10M 3b2fb16c6416b5c6314c396f03275fd20ff8f24b */mnt/sdb2/rand2.10M 0fb674e54da58858fa1a825f724f5e0c782a8a04 */mnt/sdb3/rand3.10M
/dev/sdb2をdisk後方に向けて拡張
- remake partition
- sudo umount /dev/sdb[123]
- sudo fdisk /dev/sdb
/dev/sdb1 1 132 1060258+ 83 Linux /dev/sdb2 133 394 2104515 83 Linux /dev/sdb3 395 522 1028160 83 Linux Command (m for help): d Partition number (1-4): 3 Command (m for help): d Partition number (1-4): 2 Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 2 First cylinder (133-522, default 133): Using default value 133 Last cylinder, +cylinders or +size{K,M,G} (133-522, default 522): Using default value 522 /dev/sdb1 1 132 1060258+ 83 Linux /dev/sdb2 133 522 3132675 83 Linux
- resize xfs filesystem
- sudo mount /dev/sdb2 /mnt/sdb2
- df -h
/dev/sdb2 2.0G 15M 2.0G 1% /mnt/sdb2
- sha1sum -b /mnt/sdb*/rand*.10M
3b2fb16c6416b5c6314c396f03275fd20ff8f24b */mnt/sdb2/rand2.10M
- sudo xfs_growfs /mnt/sdb2/
meta-data=/dev/sdb2 isize=256 agcount=4, agsize=131532 blks = sectsz=512 attr=2 data = bsize=4096 blocks=526128, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 data blocks changed from 526128 to 783168
- sudo blkid /dev/sdb[123]
/dev/sdb1: UUID="9065c3c2-ab3d-4460-978e-e37c88f31f5f" TYPE="xfs" /dev/sdb2: UUID="b9515b82-4d3c-457d-8d32-330cfea692df" TYPE="xfs"
- df -h
/dev/sdb2 3.0G 15M 3.0G 1% /mnt/sdb2
- 問題なく拡張された
/dev/sdb2をdisk前方に向けて拡張
適当でない手順
remake partition with name sdb1
- sudo umount /dev/sdb[123]
- sudo fdisk /dev/sdb
/dev/sdb1 1 132 1060258+ 83 Linux /dev/sdb2 133 522 3132675 83 Linux Command (m for help): d Partition number (1-4): 1 Command (m for help): d Selected partition 2 Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-522, default 1): Using default value 1 Last cylinder, +cylinders or +size{K,M,G} (1-522, default 522): Using default value 522 /dev/sdb1 1 522 4192933+ 83 Linux
- sudo mount /dev/sdb1 /mnt/sdb1
- sha1sum -b /mnt/sdb*/rand*.10M
11b2836d9c88f3890224f5d4250e2dd86ddca3e6 */mnt/sdb1/rand1.10M
- 当然だがsdb2ではなくsdb1がmountされてしまう
remake partition with name sdb2
- sudo umount /dev/sdb[123]
- sudo fdisk /dev/sdb
/dev/sdb1 1 522 4192933+ 83 Linux Command (m for help): d Selected partition 1 Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 2 First cylinder (1-522, default 1): Using default value 1 Last cylinder, +cylinders or +size{K,M,G} (1-522, default 522): Using default value 522 /dev/sdb2 1 522 4192933+ 83 Linux
- sudo mount /dev/sdb2 /mnt/sdb2
- sha1sum -b /mnt/sdb*/rand*.10M
11b2836d9c88f3890224f5d4250e2dd86ddca3e6 */mnt/sdb2/rand1.10M
- やはりsdb1の内容がmountされる
overwrite sdb1 with 0x00 and remake partition
- sudo umount /dev/sdb[123]
- sudo fdisk /dev/sdb
/dev/sdb2 1 522 4192933+ 83 Linux Command (m for help): d Selected partition 2 Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-522, default 1): Using default value 1 Last cylinder, +cylinders or +size{K,M,G} (1-522, default 522): 132 Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 2 First cylinder (133-522, default 133): Using default value 133 Last cylinder, +cylinders or +size{K,M,G} (133-522, default 522): Using default value 522 /dev/sdb1 1 132 1060258+ 83 Linux /dev/sdb2 133 522 3132675 83 Linux
- sudo dd if=/dev/zero of=/dev/sdb1 bs=1024 count=10240
- sudo mount /dev/sdb1 /mnt/sdb1 -t xfs
[ 8848.630996] XFS: bad magic number [ 8848.630996] XFS: SB validate failed
- sudo mount /dev/sdb2 /mnt/sdb2
- sha1sum -b /mnt/sdb*/rand*.10M
3b2fb16c6416b5c6314c396f03275fd20ff8f24b */mnt/sdb2/rand2.10M
- sudo umount /dev/sdb[123]
- sudo fdisk /dev/sdb
/dev/sdb1 1 132 1060258+ 83 Linux /dev/sdb2 133 522 3132675 83 Linux Command (m for help): d Partition number (1-4): 1 Command (m for help): d Selected partition 2 Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 2 First cylinder (1-522, default 1): Using default value 1 Last cylinder, +cylinders or +size{K,M,G} (1-522, default 522): Using default value 522 /dev/sdb2 1 522 4192933+ 83 Linux
- sudo mount /dev/sdb2 /mnt/sdb2 -t xfs
[ 9060.520973] XFS: bad magic number [ 9060.520973] XFS: SB validate failed
- 全くmount出来なくなる
適当(と思われる)方法
- mount
- sudo fdisk -l /dev/sdb
/dev/sdb1 1 132 1060258+ 83 Linux /dev/sdb2 133 522 3132675 83 Linux
- sudo fdisk -lu /dev/sdb
/dev/sdb1 63 2120579 1060258+ 83 Linux /dev/sdb2 2120580 8385929 3132675 83 Linux
- sudo mount /dev/sdb1 /mnt/sdb1
- sudo mount /dev/sdb2 /mnt/sdb2
- df -h
/dev/sdb1 1.1G 15M 1012M 2% /mnt/sdb1 /dev/sdb2 3.0G 15M 3.0G 1% /mnt/sdb2
- sha1sum -b /mnt/sdb*/rand*.10M
11b2836d9c88f3890224f5d4250e2dd86ddca3e6 */mnt/sdb1/rand1.10M 3b2fb16c6416b5c6314c396f03275fd20ff8f24b */mnt/sdb2/rand2.10M
- sudo fdisk -l /dev/sdb
- テストデータの用意
- sudo dd if=/dev/urandom of=/mnt/sdb1/rand1.900M bs=512 count=921600
- sudo dd if=/dev/urandom of=/mnt/sdb2/rand2.2900M bs=1024 count=2969600
- sudo dd if=/dev/urandom of=/mnt/sdb1/rand1.max bs=512
- sudo dd if=/dev/urandom of=/mnt/sdb2/rand2.max bs=512 &
- df -h
/dev/sdb1 1.1G 1.1G 20K 100% /mnt/sdb1 /dev/sdb2 3.0G 3.0G 20K 100% /mnt/sdb2
- sha1sum -b /mnt/sdb*/rand* > checksum.sha1
- remake partition
- sudo umount /dev/sdb[123]
- sudo fdisk /dev/sdb
Command (m for help): d Partition number (1-4): 1 Command (m for help): d Selected partition 2 Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-522, default 1): Using default value 1 Last cylinder, +cylinders or +size{K,M,G} (1-522, default 522): Using default value 522 /dev/sdb1 1 522 4192933+ 83 Linux
- sudo mount /dev/sdb1 /mnt/sdb1
- sha1sum -c checksum.sha1
/mnt/sdb1/rand1.10M: OK /mnt/sdb1/rand1.900M: OK /mnt/sdb1/rand1.max: OK sha1sum: /mnt/sdb2/rand2.10M: No such file or directory /mnt/sdb2/rand2.10M: FAILED open or read sha1sum: /mnt/sdb2/rand2.2900M: No such file or directory /mnt/sdb2/rand2.2900M: FAILED open or read sha1sum: /mnt/sdb2/rand2.max: No such file or directory /mnt/sdb2/rand2.max: FAILED open or read sha1sum: WARNING: 3 of 6 listed files could not be read
- 旧sdb1の内容がmountされる
- copy sdb2 to sdb1 with dd
- sudo umount /dev/sdb[123]
- sudo dd if=/dev/sdb of=/dev/sdb bs=512 count=6265350 seek=63 skip=2120580 &
/dev/sdb1 63 2120579 1060258+ 83 Linux /dev/sdb2 2120580 8385929 3132675 83 Linux
- seek=BLOCKS
- skip BLOCKS obs-sized blocks at start of output
- skip=BLOCKS
- skip BLOCKS ibs-sized blocks at start of input
- 6265350 = 8385929 - 2120580 + 1
- seek=BLOCKS
- re-mount
- sudo mount /dev/sdb1 /mnt/sdb2
- df -h
/dev/sdb1 3.0G 3.0G 20K 100% /mnt/sdb2
- verify data
- sha1sum -c checksum.sha1
sha1sum: /mnt/sdb1/rand1.10M: No such file or directory /mnt/sdb1/rand1.10M: FAILED open or read sha1sum: /mnt/sdb1/rand1.900M: No such file or directory /mnt/sdb1/rand1.900M: FAILED open or read sha1sum: /mnt/sdb1/rand1.max: No such file or directory /mnt/sdb1/rand1.max: FAILED open or read /mnt/sdb2/rand2.10M: OK /mnt/sdb2/rand2.2900M: OK /mnt/sdb2/rand2.max: OK sha1sum: WARNING: 3 of 6 listed files could not be read
- 問題なく旧sdb2の内容になっている
- sha1sum -c checksum.sha1
- resize xfs filesystem
- sudo xfs_growfs /mnt/sdb2
meta-data=/dev/sdb1 isize=256 agcount=6, agsize=131532 blks = sectsz=512 attr=2 data = bsize=4096 blocks=783168, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 data blocks changed from 783168 to 1048233
- df -h
/dev/sdb1 4.0G 3.0G 1.1G 75% /mnt/sdb2
- sudo xfs_growfs /mnt/sdb2