Sunday, January 23, 2011

limit linux background flush (dirty pages)

Background flushing in linux happens when either too much written data is pending (adjustable via /proc/sys/vm/dirty_background_ratio) or a timeout for pending writes is reached (/proc/sys/vm/dirty_expire_centisecs). Unless another limit is being hit (/proc/sys/vm/dirty_ratio), more written data may be cached. Further writes will block.

In theory, this should create a background process writing out dirty pages without disturbing other processes. In practice, it does disturb any process doing uncached reading or synchronous writing. Badly. This is because the background flush actually writes at 100% device speed and any other device requests at this time will be delayed (because all queues and write-caches on the road are filled).

Is there any way to limit the amount of requests per second the flushing process performs, or otherwise effectively prioritize other device I/O?

  • What is your average for Dirty in /proc/meminfo? This should not normally exceed your /proc/sys/vm/dirty_ratio. On a dedicated file server I have dirty_ratio set to a very high percentage of memory (90), as I will never exceed it. Your dirty_ration is too low, when you hit it, everything craps out, raise it.

    korkman : The problem is not processes being blocked when hitting dirty_ratio. I'm okay with that. But the "background" process writing out dirty data to the disks fills up queues without mercy and kills IOPS performance. It's called IO starvation I think. In fact, setting dirty_ratio_bytes extremely low (like 1 MB) helps alot, because flushing will occur almost immediately and queues will be kept empty. Drawback is possibly lower throughput for sequential, but that's okay.
    Luke : You turned off all elevators? What else did you tweak from a vanilla system?
    korkman : See my self-answer. The end of the story was to remove dirty caching and leave that part to the HW controller. Elevators are kinda irrelevant with HW write-cache in place. The controller has it's own elevator algorithms so having any elevator in software only adds overhead.
    From Luke
  • After lots of benchmarking with sysbench, I come to this conclusion:

    To survive (performance-wise) a situation where

    • an evil copy process floods dirty pages
    • and hardware write-cache is present (possibly also without that)
    • and synchronous reads or writes per second (IOPS) are critical

    just dump all elevators, queues and dirty page caches. The correct place for dirty pages is in the RAM of that hardware write-cache.

    Adjust dirty_ratio (or new dirty_bytes) as low as possible, but keep an eye on sequential throughput. In my particular case, 15 MB were optimum (echo 15000000 > dirty_bytes).

    This is more a hack than a solution because gigabytes of RAM are now used for read caching only instead of dirty cache. For dirty cache to work out well in this situation, linux kernel background flusher would need to average at what speed the underlying device accepts requests and adjust background flushing accordingly. Not easy.


    Specs and benchmarks for comparison:

    Tested while dd'ing zeros to disk, sysbench showed huge success, boosting 10 threads fsync writes at 16 kb from 33 to 700 IOPS (idle limit: 1500 IOPS) and single thread from 8 to 400 IOPS.

    Without load, IOPS were unaffected (~1500) and throughput slightly reduced (from 251 MB/s to 216 MB/s).

    dd call:

    dd if=/dev/zero of=dumpfile bs=1024 count=20485672
    

    for sysbench, the test_file.0 was prepared to be unsparse with:

    dd if=/dev/zero of=test_file.0 bs=1024 count=10485672
    

    sysbench call for 10 threads:

    sysbench --test=fileio --file-num=1 --num-threads=10 --file-total-size=10G --file-fsync-all=on --file-test-mode=rndwr --max-time=30 --file-block-size=16384 --max-requests=0 run
    

    sysbench call for 1 thread:

    sysbench --test=fileio --file-num=1 --num-threads=1 --file-total-size=10G --file-fsync-all=on --file-test-mode=rndwr --max-time=30 --file-block-size=16384 --max-requests=0 run
    

    Smaller block sizes showed even more drastic numbers.

    --file-block-size=4096 with 1 GB dirty_bytes:

    sysbench 0.4.12:  multi-threaded system evaluation benchmark
    
    Running the test with following options:
    Number of threads: 1
    
    Extra file open flags: 0
    1 files, 10Gb each
    10Gb total file size
    Block size 4Kb
    Number of random requests for random IO: 0
    Read/Write ratio for combined random IO test: 1.50
    Calling fsync() after each write operation.
    Using synchronous I/O mode
    Doing random write test
    Threads started!
    Time limit exceeded, exiting...
    Done.
    
    Operations performed:  0 Read, 30 Write, 30 Other = 60 Total
    Read 0b  Written 120Kb  Total transferred 120Kb  (3.939Kb/sec)
          0.98 Requests/sec executed
    
    Test execution summary:
          total time:                          30.4642s
          total number of events:              30
          total time taken by event execution: 30.4639
          per-request statistics:
               min:                                 94.36ms
               avg:                               1015.46ms
               max:                               1591.95ms
               approx.  95 percentile:            1591.30ms
    
    Threads fairness:
          events (avg/stddev):           30.0000/0.00
          execution time (avg/stddev):   30.4639/0.00
    

    --file-block-size=4096 with 15 MB dirty_bytes:

    sysbench 0.4.12:  multi-threaded system evaluation benchmark
    
    Running the test with following options:
    Number of threads: 1
    
    Extra file open flags: 0
    1 files, 10Gb each
    10Gb total file size
    Block size 4Kb
    Number of random requests for random IO: 0
    Read/Write ratio for combined random IO test: 1.50
    Calling fsync() after each write operation.
    Using synchronous I/O mode
    Doing random write test
    Threads started!
    Time limit exceeded, exiting...
    Done.
    
    Operations performed:  0 Read, 13524 Write, 13524 Other = 27048 Total
    Read 0b  Written 52.828Mb  Total transferred 52.828Mb  (1.7608Mb/sec)
        450.75 Requests/sec executed
    
    Test execution summary:
          total time:                          30.0032s
          total number of events:              13524
          total time taken by event execution: 29.9921
          per-request statistics:
               min:                                  0.10ms
               avg:                                  2.22ms
               max:                                145.75ms
               approx.  95 percentile:              12.35ms
    
    Threads fairness:
          events (avg/stddev):           13524.0000/0.00
          execution time (avg/stddev):   29.9921/0.00
    

    --file-block-size=4096 with 15 MB dirty_bytes on idle system:

    sysbench 0.4.12: multi-threaded system evaluation benchmark

    Running the test with following options:
    Number of threads: 1
    
    Extra file open flags: 0
    1 files, 10Gb each
    10Gb total file size
    Block size 4Kb
    Number of random requests for random IO: 0
    Read/Write ratio for combined random IO test: 1.50
    Calling fsync() after each write operation.
    Using synchronous I/O mode
    Doing random write test
    Threads started!
    Time limit exceeded, exiting...
    Done.
    
    Operations performed:  0 Read, 43801 Write, 43801 Other = 87602 Total
    Read 0b  Written 171.1Mb  Total transferred 171.1Mb  (5.7032Mb/sec)
     1460.02 Requests/sec executed
    
    Test execution summary:
          total time:                          30.0004s
          total number of events:              43801
          total time taken by event execution: 29.9662
          per-request statistics:
               min:                                  0.10ms
               avg:                                  0.68ms
               max:                                275.50ms
               approx.  95 percentile:               3.28ms
    
    Threads fairness:
          events (avg/stddev):           43801.0000/0.00
          execution time (avg/stddev):   29.9662/0.00
    

    Test-System:

    • Adaptec 5405Z (that's 512 MB write-cache with protection)
    • Intel Xeon L5520
    • 6 GiB RAM @ 1066 MHz
    • Motherboard Supermicro X8DTN (5520 Chipset)
    • 12 seagate barracuda 1 TB disks
      • 10 in linux software raid 10
    • kernel 2.6.32
    • filesystem xfs
    • debian unstable

    In sum, I am now sure this configuration will perform well in idle, high load and even full load situations for database traffic that otherwise would have been starved by sequential traffic. Sequential throughput is higher than two gigabit links can deliver anyway, so no problem reducing it a bit.

    Thanks for reading and comments!

    From korkman

0 comments:

Post a Comment