grpc server causes deadlock in libc

List overview All Threads
Download

newer

older

Can't run runscript...

java applications and stdin

Stefan Thöni

24 Oct 2019 24 Oct '19

2:57 p.m.

Hello Genodians

We have ported google protocol buffers and the Google RPC (grpc) to genode and would like to contribute that to genode world.

We encountered a problem with the port in which grpc server deadlocks when using the poll function of libc to poll sockets via the lwip or lxip plugin. We determined that poll calls Libc::suspend in task.cc which in term calls Pthreads::suspend_myself where the deadlock oocurs at myself.lock.lock();.

Note that the grpc server uses several threads which apparantly are all waiting/suspended when the problem occurred.

You can check out or working branch at https://github.com/trimpim/genode-world/commit/18da8122f805de0a137b778a6d83c... and execute scenario using run/grpc runscript reproduce the problem.

Does anyone has any idea why a deadlock might occur in that situation?

Kind regards Stefan

-- Freundliche Grüsse Stefan Thöni Chairman of the Board Senior Security Architect +41 79 610 64 95 gapfruit AG Baarerstrasse 135 6300 Zug https://gapfruit.com

Attachments:

0x05D66A288F9939FF.asc (application/pgp-keys — 24.8 KB)
stefan_thoeni.vcf (text/x-vcard — 173 bytes)
signature.asc (application/pgp-signature — 833 bytes)

Show replies by date

Christian Helmuth

24 Oct 24 Oct

3:14 p.m.

Hallo Stefan,

On Thu, Oct 24, 2019 at 14:57:32 CEST, Stefan Thöni wrote:

...

We encountered a problem with the port in which grpc server deadlocks when using the poll function of libc to poll sockets via the lwip or lxip plugin. We determined that poll calls Libc::suspend in task.cc which in term calls Pthreads::suspend_myself where the deadlock oocurs at myself.lock.lock();.

Note that the grpc server uses several threads which apparantly are all waiting/suspended when the problem occurred.

I suspect an interplay of pthread mutexes and Libc::suspend(). In the current runtime implementation the only thread that is able to resume suspended pthreads is the main component thread. On the other hand, the current pthread-mutex implementation does not use the Libc::suspend() functionality but Genode::Lock. If the main thread now fails to grab a pthread mutex it blocks at the Genode::Lock and, thus, is unable to process incoming signals and deblock suspended threads waiting for I/O progress. In this case, some code paths retain a pthread mutex across potentially blocking operations like poll().

Could you please check my suspicion by inspecting the backtrace of thread 2 (which is the main thread) in your grpc component?

Greets

-- Christian Helmuth Genode Labs https://www.genode-labs.com/ · https://genode.org/ https://twitter.com/GenodeLabs · /ˈdʒiː.nəʊd/ Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth

Stefan Thöni

3:41 p.m.

Hi Christian

...

On Thu, Oct 24, 2019 at 14:57:32 CEST, Stefan Thöni wrote:

...
We encountered a problem with the port in which grpc server deadlocks when using the poll function of libc to poll sockets via the lwip or lxip plugin. We determined that poll calls Libc::suspend in task.cc which in term calls Pthreads::suspend_myself where the deadlock oocurs at myself.lock.lock();.

Note that the grpc server uses several threads which apparantly are all waiting/suspended when the problem occurred.

I suspect an interplay of pthread mutexes and Libc::suspend(). In the current runtime implementation the only thread that is able to resume suspended pthreads is the main component thread. On the other hand, the current pthread-mutex implementation does not use the Libc::suspend() functionality but Genode::Lock. If the main thread now fails to grab a pthread mutex it blocks at the Genode::Lock and, thus, is unable to process incoming signals and deblock suspended threads waiting for I/O progress. In this case, some code paths retain a pthread mutex across potentially blocking operations like poll().

Could you please check my suspicion by inspecting the backtrace of thread 2 (which is the main thread) in your grpc component?

The main thread 2 is waiting at pthread_cond_timedwait which blocks at a genode semaphore.

How should we patch this?

Greets Stefan

Christian Helmuth

25 Oct 25 Oct

11:49 a.m.

Hello Stefan,

On Thu, Oct 24, 2019 at 15:41:46 CEST, Stefan Thöni wrote:

...

The main thread 2 is waiting at pthread_cond_timedwait which blocks at a genode semaphore.

How should we patch this?

The deadlock potential described in my previous mail was identified some time ago but did not occur in our use cases. We are actually working on this topic during our libc overhaul but are not finished yet. What you could do in the mean time is to identify the cause of this situation and try to solve it outside the libc.

- Why is the mutex hold during a blocking call to poll()? Could this be changed? - How big is the timeout in pthread_cond_timedwait and should the main thread wake up to resolve the lock situation? Could the timeout be reduced and accompanied by a loop? - Could the pthread support (resp. the use of multiple threads) be disabled in the RPC library?

Greets

2086

Age (days ago)

2087

Last active (days ago)

users@lists.genode.org

3 comments

2 participants

tags (0)

participants (2)

Christian Helmuth
Stefan Thöni