Re: Deadlock in combination with pthread, lwip and grpc

11 May 2020

      Hello Sid,
that's an impressive backtrace ;-) which raises the following
questions.
1) Does the thread 13 really spin in the spinlock or does it reenter
   the spinlcok aquisition quite rapidly?
2) Could this be data curruption due to invalid pointers or life-time
   issues?
3) Does the C++ runtime support catching exceptions in Genode::Mutex
   while unwinding the occurence of another exception (_Unwind_Find_FDE)?
Regarding your questions...
On Wed, May 06, 2020 at 16:14:22 CEST, Sid Hussmann wrote:
...
At this point, I also notice many threads being stuck after a `pthread_exit` call:
This is expected as the exiting threads block until they are destroyed
in pthread_join() by another thread. A thread cannot destroy itself on
most kernels supported.
...
Despite our best efforts we have failed to pinpoint the problem so
far. A few questions:

Is the spinlock really a problem here?

The spinlock implementation should not be the problem, the storage
location of the mutex might be.
...

I was able to reproduce this problem using the newest version of

gRPC [1] and the current `genodelabs/staging` branch. Reading [0]
makes me wonder if the problem can be seen as "yet another instance
where the VFS synchronization has bitten us".
This might be and we are working towards a single-threaded VFS usage
in libc. Complex, multi-threaded scenarios like yours with third-party
monster libraries like gRPC are quite a challenge we're ready to
accept after this work is done.
...

Unlike `lwip`, `lxip` seems to give up after just one or two

connections with this warning: `_wait_and_dispatch_one_io_signal
called from non-entrypoint thread "pthread.5"`. May this issue be
related or should we just focus on `lwip`?
I'm afraid this stems from [1] and hints the execution-model
assumptions of the current lxip port are violated. I have no quick fix
for this.
[1] https://github.com/genodelabs/genode/blob/dd899fde29448e16c96b2860c391ddcbf2...
...

In the GDB backtrace e.g. of `thread 13` at `#5` I notice  a path

to `libgcc` that does not belong to my machine. Is this neglegtable
or does it mess with my GDB output?
Please see my question 3) above.
...
We just updated the `grpc` and `protobuf` libraries [1], which we
will publish shortly. The problem however persists with both
versions. To reproduce this issue, please checkout branch [2] and
run `grpc-server-lwip` under Linux. Starting multiple client
connections in parallel can be done by running
`./stress_parallel_execute.sh 10.0.2.55:8899` using [3]. Note that
the problem sometimes takes a few thousand of parallel connections
to reproduce.
I must admit that I currently lack capacities to look deeper into
this on the practical side.
Greets
-- 
Christian Helmuth
Genode Labs

https://www.genode-labs.com/ · https://genode.org/
https://twitter.com/GenodeLabs · /ˈdʒiː.nəʊd/

Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden
Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

Re: Deadlock in combination with pthread, lwip and grpc