Hello Genodians
We have ported google protocol buffers and the Google RPC (grpc) to genode and would like to contribute that to genode world.
We encountered a problem with the port in which grpc server deadlocks when using the poll function of libc to poll sockets via the lwip or lxip plugin. We determined that poll calls Libc::suspend in task.cc which in term calls Pthreads::suspend_myself where the deadlock oocurs at myself.lock.lock();.
Note that the grpc server uses several threads which apparantly are all waiting/suspended when the problem occurred.
You can check out or working branch at https://github.com/trimpim/genode-world/commit/18da8122f805de0a137b778a6d83c... and execute scenario using run/grpc runscript reproduce the problem.
Does anyone has any idea why a deadlock might occur in that situation?
Kind regards Stefan
Hallo Stefan,
On Thu, Oct 24, 2019 at 14:57:32 CEST, Stefan Thöni wrote:
We encountered a problem with the port in which grpc server deadlocks when using the poll function of libc to poll sockets via the lwip or lxip plugin. We determined that poll calls Libc::suspend in task.cc which in term calls Pthreads::suspend_myself where the deadlock oocurs at myself.lock.lock();.
Note that the grpc server uses several threads which apparantly are all waiting/suspended when the problem occurred.
I suspect an interplay of pthread mutexes and Libc::suspend(). In the current runtime implementation the only thread that is able to resume suspended pthreads is the main component thread. On the other hand, the current pthread-mutex implementation does not use the Libc::suspend() functionality but Genode::Lock. If the main thread now fails to grab a pthread mutex it blocks at the Genode::Lock and, thus, is unable to process incoming signals and deblock suspended threads waiting for I/O progress. In this case, some code paths retain a pthread mutex across potentially blocking operations like poll().
Could you please check my suspicion by inspecting the backtrace of thread 2 (which is the main thread) in your grpc component?
Greets
Hi Christian
On Thu, Oct 24, 2019 at 14:57:32 CEST, Stefan Thöni wrote:
We encountered a problem with the port in which grpc server deadlocks when using the poll function of libc to poll sockets via the lwip or lxip plugin. We determined that poll calls Libc::suspend in task.cc which in term calls Pthreads::suspend_myself where the deadlock oocurs at myself.lock.lock();.
Note that the grpc server uses several threads which apparantly are all waiting/suspended when the problem occurred.
I suspect an interplay of pthread mutexes and Libc::suspend(). In the current runtime implementation the only thread that is able to resume suspended pthreads is the main component thread. On the other hand, the current pthread-mutex implementation does not use the Libc::suspend() functionality but Genode::Lock. If the main thread now fails to grab a pthread mutex it blocks at the Genode::Lock and, thus, is unable to process incoming signals and deblock suspended threads waiting for I/O progress. In this case, some code paths retain a pthread mutex across potentially blocking operations like poll().
Could you please check my suspicion by inspecting the backtrace of thread 2 (which is the main thread) in your grpc component?
The main thread 2 is waiting at pthread_cond_timedwait which blocks at a genode semaphore.
How should we patch this?
Greets Stefan
Hello Stefan,
On Thu, Oct 24, 2019 at 15:41:46 CEST, Stefan Thöni wrote:
The main thread 2 is waiting at pthread_cond_timedwait which blocks at a genode semaphore.
How should we patch this?
The deadlock potential described in my previous mail was identified some time ago but did not occur in our use cases. We are actually working on this topic during our libc overhaul but are not finished yet. What you could do in the mean time is to identify the cause of this situation and try to solve it outside the libc.
- Why is the mutex hold during a blocking call to poll()? Could this be changed? - How big is the timeout in pthread_cond_timedwait and should the main thread wake up to resolve the lock situation? Could the timeout be reduced and accompanied by a loop? - Could the pthread support (resp. the use of multiple threads) be disabled in the RPC library?
Greets