Linux Signal Handler 死锁问题

Linux 上的程序可以注册信号处理函数(signal handler)用于处理信号,signal handler 不能随便调用,否则可能一不留神就死锁了,我们曾在这个问题上反复踩坑。这篇文章我们来讲 Linux Signal Handler 的死锁问题。

signal handler 常见的使用场景是处理程序的退出。比如,一个后台运行的程序,
可以注册一个处理 SIGTERM 的 signal handler,在收到这个信号后,让程序退出运行。我曾经在项目里见到类似以下结构类的代码:

#include 
#include 
#include 
#include 
#include 

enum class Status {
        START = 0,
        RUNNING = 1,
        STOPPED = 2,
};


std::mutex lock;
std::condition_variable cond;
Status gStatus = Status::START;

void sigHandler(int signo) {
    std::unique_lock lk(lock);  // 这里可能会死锁
    gStatus = Status::STOPPED;
    cond.notify_all();
    std::cout << "signal handler called!" << std::endl;
}

int main(int argc, char **argv) {
    gStatus = Status::RUNNING;
    std::signal(SIGTERM, sigHandler);

    std::unique_lock lk(lock);
    // 通过睡眠 4s 模拟程序初始化操作
    std::this_thread::sleep_for(std::chrono::seconds(4));
    cond.wait(lk, []() {
        if (gStatus == Status::STOPPED) {
                std::cout << "exit now!" << std::endl;
                return true;
        }
        return false;
    });

    return 0;
}

它使用条件变量来控制程序退出:主线程的条件变量阻塞等待gStatus == Status::STOPPED,然后通过信号处理函数来设置gStatus = Status::STOPPED,并通过条件变量通知到主线程。这中间还有一行std::this_thread::sleep_for(std::chrono::seconds(4));,我们用这样代码来模拟程序启动过程的初始化操作。以上代码,只要在启动后的 4s 内给程序发一个 SIGTERM 信号它会立马死锁!查看进程堆栈:

(gdb) thread apply all bt

Thread 1 (Thread 0x7f617b6ef3c0 (LWP 1663714) "sighdl"):
#0  futex_wait (private=0, expected=2, futex_word=0x55e01f691160 ) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x55e01f691160 , private=0) at ./nptl/lowlevellock.c:49
#2  0x00007f617b098002 in lll_mutex_lock_optimized (mutex=0x55e01f691160 ) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x55e01f691160 ) at ./nptl/pthread_mutex_lock.c:93
#4  0x000055e01f68e607 in __gthread_mutex_lock (__mutex=0x55e01f691160 ) at /usr/include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749
#5  0x000055e01f68e6c4 in std::mutex::lock (this=0x55e01f691160 ) at /usr/include/c++/11/bits/std_mutex.h:100
#6  0x000055e01f68ea37 in std::unique_lock::lock (this=0x7ffeb101d780) at /usr/include/c++/11/bits/unique_lock.h:139
#7  0x000055e01f68e74b in std::unique_lock::unique_lock (this=0x7ffeb101d780, __m=...) at /usr/include/c++/11/bits/unique_lock.h:69
#8  0x000055e01f68e34a in sigHandler (signo=15) at sighdl.cpp:19
#9  
#10 0x00007f617b0e578a in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x7ffeb101de10, rem=0x7ffeb101de10) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#11 0x00007f617b0ea677 in __GI___nanosleep (req=, rem=) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#12 0x000055e01f68e99f in std::this_thread::sleep_for > (__rtime=...) at /usr/include/c++/11/bits/this_thread_sleep.h:82
#13 0x000055e01f68e4a1 in main (argc=1, argv=0x7ffeb101dfa8) at sighdl.cpp:31
(gdb) thread
[Current thread is 1 (Thread 0x7f617b6ef3c0 (LWP 1663714))]
(gdb) f 5
#5  0x000055e01f68e6c4 in std::mutex::lock (this=0x55e01f691160 ) at /usr/include/c++/11/bits/std_mutex.h:100
100          int __e = __gthread_mutex_lock(&_M_mutex);
(gdb) p *this
$1 = { = {_M_mutex = {__data = {__lock = 2, __count = 0, __owner = 1663714, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
      __size = "\002\000\000\000\000\000\000\000\342b\031\000\001", '\000' , __align = 2}}, }
(gdb)

Linux 会随机挑选进程中的一个线程来执行 signal handler,在我们的实例程序中只有主线程,
因此主线程被选来执行 signal handler 了。第九个堆栈帧显示主线程从sleep()被切到 singal handler 函数了,此时主线程正拿着条件变量的锁呢,而 signal handler 因为要操作条件变量又尝试去加锁,这样就死锁了。

信号处理函数中不能用互斥变量,这一点可以进一步从互斥变量的文档man pthread_mutex_lock中得到验证:

The mutex functions are not async-signal safe. What this means is that they should not be called from a signal handler. In particular, calling pthread_mutex_lock or pthread_mutex_unlock from a signal handler may deadlock the calling thread.

这里提到 async-signal-safe 的概念,一个函数是 async-signal-safe 的意味着这个函数可以在 signal handler 里被安全地调用而不用担心诸如死锁这样的问题。signal handler 不能用互斥变量、条件变量,那我们如何通知一个程序安全退出?可以用信号量,man sem_post文档里是这么写的:

sem_post() is async-signal-safe: it may be safely called within a signal handler.

可通过man signal-safety查看 Linux 中还有哪些常用函数是 async-signal-safe。signal handler 中除了使用 Linux 常见函数需要注意外,其他库函数也不能乱用。比如 glog 打印日志函数,glog 日志打印函数有段代码会获取互斥锁,我也是碰到才发现 glog 原来还会干这种事! 总而言之,signal handler 函数的设计应该尽量简短,不要在里面堆复杂逻辑。

你可能感兴趣的:(c++linux)