Green Mutexes

InnoDB implements a mutex and rw-lock using pthread synchronization objects when compiled for Linux. These add a great feature -- SHOW INNODB STATUS is able to list the objects and locations where threads are blocked. Unfortunately this comes at a price. The InnoDB mutex uses much more CPU than a pthread mutex for workloads with mutex contention. Feature request 52806 is open for this but the feature request description is vague.

 

How much more CPU does it use?

 

I extracted all of the code needed to use an InnoDB mutex into a standalone file. This includes much of the sync array code. I then added benchmark functions that do the following in a loop: lock the mutex, increment a counter, sleep for a configurable amount of time and then unlock the mutex. This was implemented for pthread mutexes and the InnoDB mutex and run for a configurable number of threads on a test server that has 24 CPUs with hyperthreading enabled. The test was run for 256 threads and repeated for sleep times of 1, 10 and 50 microseconds. The sleep time is the length of time a thread keeps the mutex locked. On this hardware the maximum duration of an InnoDB mutex busy-wait is 31 microseconds with the default options so I tested the case where waiting might work and definitely will not work.The performance summary is that InnoDB mutex uses much more CPU and does many more context switches while response time was a bit better with the pthread mutex.

 

For a 1 microsecond sleep time:

  • the InnoDB mutex takes 60.7 usecs/loop, does 750k context switches/second and has 48% CPU utilization
  • the pthread mutex takes 53.4 usecs/loop, does 80k context switches/second and has 0% CPU utilization

For a 10 microsecond sleep time:

  • the InnoDB mutex takes 70.1 usecs/loop, does 750k context switches/second and has 37% CPU utilization
  • the pthread mutex takes 62.7 usecs/loop, does 66k context switches/second and has 0% CPU utilization

For a 50 microsecond sleep time:

  • the InnoDB mutex takes 110.9 usecs/loop, does 720k context switches/second and has 38% CPU utilization
  • the pthread mutex takes 109.5 usecs/loop, does 38k context switches/second and has 0% CPU utilization

All of these tests were repeated for 8, 10, 11, 12, 13 and 16 threads. The overhead from the InnoDB mutex begins to show at 10 threads in my case. Note that my test server has 12 CPUs and 24 with HT enabled. At 11 threads the InnoDB mutex has 16% CPU utilization.

 

Why does it use more CPU?

 

There are a few reasons why the InnoDB mutex uses a lot more CPU. The first is that all waiting threads are woken when a mutex is unlocked. By comparison the Linux pthread mutex should wake one thread in that case. As at most one thread will be able to lock the mutex after starting to run it probably isn't a good idea to spend the CPU time to wake all of them.

 

The second reason is that the InnoDB implementation maintains all blocked threads on the sync array. Not only is more code required to do that, but there are inefficiencies in the code. The sync array is guarded by a global pthread mutex which introduces more contention and this is locked and unlocked twice to record that a thread is sleeping. The code also uses an os_event which is similar to a pthread condition variable and internally contains a pthread mutex. InnoDB uses one per InnoDB mutex and threads wait on the os_event. Getting a thread to wait on it requires two more pthread mutex lock/unlock calls which uses even more CPU and adds contention.

 

The third reason is that the InnoDB implementation implements its own busy-wait loop that retries mutex lock attempts for some time before going to sleep on the os_event. The manual has details on InnoDB spin wait configuration. In 5.1 the default is to retry 30 times with a maximum delay of 1 microsecond (on my hardware) between each retry. So each thread spins for at most 30 microseconds (on my hardware) and then goes to sleep on the os_event. Note that many of the pthread mutexes used by MySQL also have a busy-wait loop although the documentation for PTHREAD_MUTEX_ADAPTIVE_NP can be hard to find.

 

Does tuning fix this?

 

There are two problems. The first is that user CPU utilization is high from the busy-wait loop. The impact from this can be reduced by tuning the busy-wait my.cnf options. The second problem is that sys CPU utilization and the context switch rate are high. Not much can be tuned to fix this. The cause is that all waiting threads are woken on mutex unlock (see os_event_set). I don't think the fix is as simple as changing from pthread_cond_broadcast to pthread_cond_signal in os_event_set.

 

The manual has details on tuning the InnoDB mutex busy-wait loop. By default innob_spin_wait_delay=6 and innodb_sync_spin_loops=30. I used a different test server than the experiments above but it still has 24 cores. With the default settings the InnoDB mutex does 760k context switches/second and has 36% CPU utilization (user=12%, sys=24%). After reducing innodb_sync_spin_loops from 30 to 10, the context switch rate drops to 720/second and CPU utilization drops to 30% (user=6%, sys=24%).



  • Ipays Craxs
    星期二上午 9:20
  • Inaam Rana Mark,
    Nice job explaining the issue. I have tried using pthread_cond_signal but as you said it is not as simple as changing _broadcast with _signal. On the workloads that we tested it on, it actually caused regression. The problem seems to be that in a
    n overloaded system the contention seems to converge on one or two mutexes. Most of the waiting threads line up for these couple of hot mutexes. In case of pthread_cond_broadcast everyone is woken up and quite a few of them gets a chance to acquire the mutex when they eventually get to the CPU. In case of pthread_cond_signal only one thread gets up, wait for its turn to get CPU, be done with the mutex and then wakes up another thread thus serializing the time spent in OS run queue.

    But I do agree that we need to look into this and come up with a better solution.
    查看翻译
    星期二上午 9:30
  • Mark Callaghan The busy-wait loop is an opportunity to overcome some latency. If you wake a few threads then one of them might get the mutex immediately and the others while running the busy-wait loop. But I am not clear on a few things -- like how long does it take to wake a thread on a contended server and how long does it take to put a thread to sleep. 查看翻译
    星期二上午 9:52
  • Matthew Boehm Is this on 5.1 built-in InnoDB or the innodb-plugin provided in 5.1 distro? 查看翻译
    星期二上午 11:58
  • Bradley C Kuszmaul How do you sleep for 1 microsecond? 查看翻译
    星期二下午 4:46
  • Mark Callaghan Numbers are from 5.1 plugin. I extracted all code needed to one file so perf tests are from my standalone code.

    Bradley C Kuszmaul - you are right. While I might have called usleep(1) the sleep probably wasn't 1 usec. Based on the perf results the sleep time was different than for usleep(10).
    查看翻译
    星期二下午 4:49
  • Mark Callaghan Woud have been better to stay busy in a loop reading rdtsc until X usecs passed. Regardless I think my point is still valid that this code uses far too much CPU 查看翻译
    星期二下午 4:54
  • Bradley C Kuszmaul Can you publish your single file of code? 查看翻译
    星期二下午 6:34
  • Mark Callaghan Shared with you 查看翻译
    星期二下午 6:45
  • Vladislav Vaintroub Is PTHREAD_MUTEX_ADAPTIVE_NP anything worth using? Did you measure it? 查看翻译
    昨天上午 3:51
  • Mark Callaghan I have results from using it and from not using it. It didn't make a difference in my tests. Perhaps it will make a difference if I change the code to use rdtsc and get more accurate lock-hold times.

    InnoDB uses PTHREAD_MUTEX_ADAPTIVE_NP for the pthread mutexes contained by the os_event structs it creates. But that option isn't used for the pthread mutex that guards access to the sync array.
    昨天上午 5:56
  • Bradley C Kuszmaul Re: usleep precision. On my laptop (Sandy Bridge i7-2640M 2.8GHz, running FC17), I see the following numbers: usleep(1) gives sleep times of about 50-60us. usleep(10) gives sleep times of 60-70us. usleep(100) gives sleep times of about 150us. I measured it with rdtsc, and plotted it with gnuplot. http://people.csail.mit.edu/bradley/usleep/usleep.pdf (See http://people.csail.mit.edu/bradley/usleep/ for the code and plotting script) 查看翻译
    昨天上午 6:13
  • Chang Chen hi mark. Can you share the single file of code? thanks

  • 你可能感兴趣的:(thread,server,OS,documentation,Signal,loops)