ZFS (sync, async) R/W IOPS / throughput performance tuning

本文讨论一下zfs读写IOPS或吞吐量的优化技巧, (读写操作分同步和异步两种情况).

影响性能的因素
1. 底层设备的性能直接影响同步读写 iops, throughput. 异步读写和cache(arc, l2arc) 设备或配置有关.

2. vdev 的冗余选择影响iops, through.
因为ZPOOL的IO是均分到各vdevs的, 所以vdev越多, IO和吞吐能力越好.
vdev本身的话, 写性能 mirror > raidz1 > raidz2 > raidz3 , 
读性能看实际存储的盘数量决定. (raidz1(3) = raidz2(4) = raidz3(5) > mirror(n))

3. 底层设备的IO对齐影响IOPS. 
在创建zpool 时需指定ashift, 而且以后都无法更改. 
建议同一个vdev底层设备的sector一致, 如果不一致的话, 建议取最大的扇区作为ashift. 或者将不一致的块设备分到不同的vdev里面.
例如sda sdb的sector=512, sdc sdd的sector=4K
zpool create -o ashift=9 zp1 mirror sda sdb
zpool add -o ashift=12 zp1 mirror sdc sdd
       ashift
           Pool  sector  size exponent, to the power of 2 (internally referred to as "ashift"). I/O operations will be
           aligned to the specified size boundaries. Additionally, the minimum (disk) write size will be  set  to  the
           specified  size,  so  this  represents a space vs. performance trade-off. The typical case for setting this
           property is when performance is important and the underlying disks use 4KiB sectors but report 512B sectors
           to the OS (for compatibility reasons); in that case, set ashift=12 (which is 1<<12 = 4096).

           For  optimal  performance,  the  pool sector size should be greater than or equal to the sector size of the
           underlying disks. Since the property cannot be changed after pool creation, if in a given  pool,  you  ever
           want to use drives that report 4KiB sectors, you must set ashift=12 at pool creation time.

           Keep in mind is that the ashift is vdev specific and is not a pool global.  This means that when adding new
           vdevs to an existing pool you may need to specify the ashift.

这里有一个工具收录了一些常见设备的扇区大小.
https://github.com/zfsonlinux/zfs/blob/master/cmd/zpool/zpool_vdev.c#L108
如果不清楚底层设备的扇区大小, 为了对齐可以设置为13(8K).
例如
# zpool create -o ashift=13 zp1 scsi-36c81f660eb17fb001b2c5fec6553ff5e
# zpool create -o ashift=9 zp2 scsi-36c81f660eb17fb001b2c5ff465cff3ed
# zfs create -o mountpoint=/data01 zp1/data01
# zfs create -o mountpoint=/data02 zp2/data02

# date +%F%T; dd if=/dev/zero of=/data01/test.img bs=1024K count=8192 oflag=sync,noatime,nonblock; date +%F%T;
2014-06-2609:57:35
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 46.4277 s, 185 MB/s
2014-06-2609:58:22

# date +%F%T; dd if=/dev/zero of=/data02/test.img bs=1024K count=8192 oflag=sync,noatime,nonblock; date +%F%T;
2014-06-2609:58:32
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 43.9984 s, 195 MB/s
2014-06-2609:59:16

# zpool list
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
zp1   3.62T  8.01G  3.62T     0%  1.00x  ONLINE  -
zp2   3.62T  8.00G  3.62T     0%  1.00x  ONLINE  -

大文件看不出区别, 小文件的话, 如果文件小于ashift设置的大小, 那么就等于浪费空间, 同时降低了小文件的写效率. 增加cache占用等. 

4. 底层设备的模式, 建议JBOD或passthrough, 绕过RAID卡的控制.

5. zfs 参数直接影响iops和吞吐量.
5.1 
对于数据库类型的应用, 大文件, 离散的小数据集访问, 选择recordsize 大于或等于数据库的块大小比较好. 例如PostgreSQL 8K的block_size, 建议zfs recordsize大于等于8KB. 一般不建议调整recordsize, 使用默认的128K就能满足大多数需求.
       recordsize=size
           Specifies a suggested block size for files in the file system. This property is  designed  solely  for  use
           with  database  workloads  that  access  files  in  fixed-size records. ZFS automatically tunes block sizes
           according to internal algorithms optimized for typical access patterns.

           For databases that create very large files but access them in small random chunks, these algorithms may  be
           suboptimal.  Specifying a recordsize greater than or equal to the record size of the database can result in
           significant performance gains. Use of this property for general purpose file systems is  strongly  discour-
           aged, and may adversely affect performance.

           The  size  specified  must  be  a  power  of two greater than or equal to 512 and less than or equal to 128
           Kbytes.

           Changing the file system’s recordsize affects only files created afterward; existing files are  unaffected.

           This property can also be referred to by its shortened column name, recsize.

测试 : 
# zpool create -o ashift=12 zp1 scsi-36c81f660eb17fb001b2c5fec6553ff5e
# zfs create -o mountpoint=/data01 -o recordsize=8K -o atime=off zp1/data01
# zfs create -o mountpoint=/data02 -o recordsize=128K -o atime=off zp1/data02
# zfs create -o mountpoint=/data03 -o recordsize=512 -o atime=off zp1/data03
关闭数据缓存, 不影响结果.
# zfs set primarycache=metadata zp1/data01
# zfs set primarycache=metadata zp1/data02
# zfs set primarycache=metadata zp1/data03
# mkdir -p /data01/pgdata
# mkdir -p /data02/pgdata
# mkdir -p /data03/pgdata
# chown postgres:postgres /data0*/pgdata

pg_test_fsync 测试结果, 512最差, 8K和128K差不多.
512
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        fdatasync                         252.052 ops/sec    3967 usecs/op
        fsync                             248.701 ops/sec    4021 usecs/op
Non-Sync'ed 8kB writes:
        write                            7615.510 ops/sec     131 usecs/op
8K
        fdatasync                         329.874 ops/sec    3031 usecs/op
        fsync                             329.008 ops/sec    3039 usecs/op
Non-Sync'ed 8kB writes:
        write                           83849.214 ops/sec      12 usecs/op
128K
        fdatasync                         329.207 ops/sec    3038 usecs/op
        fsync                             328.739 ops/sec    3042 usecs/op
Non-Sync'ed 8kB writes:
        write                           76100.311 ops/sec      13 usecs/op


5.2 

你可能感兴趣的:(数据库,php)