Hi Christian
On Thu, Oct 24, 2019 at 14:57:32 CEST, Stefan Thöni wrote:
We encountered a problem with the port in which grpc server deadlocks when using the poll function of libc to poll sockets via the lwip or lxip plugin. We determined that poll calls Libc::suspend in task.cc which in term calls Pthreads::suspend_myself where the deadlock oocurs at myself.lock.lock();.
Note that the grpc server uses several threads which apparantly are all waiting/suspended when the problem occurred.
I suspect an interplay of pthread mutexes and Libc::suspend(). In the current runtime implementation the only thread that is able to resume suspended pthreads is the main component thread. On the other hand, the current pthread-mutex implementation does not use the Libc::suspend() functionality but Genode::Lock. If the main thread now fails to grab a pthread mutex it blocks at the Genode::Lock and, thus, is unable to process incoming signals and deblock suspended threads waiting for I/O progress. In this case, some code paths retain a pthread mutex across potentially blocking operations like poll().
Could you please check my suspicion by inspecting the backtrace of thread 2 (which is the main thread) in your grpc component?
The main thread 2 is waiting at pthread_cond_timedwait which blocks at a genode semaphore.
How should we patch this?
Greets Stefan