Deadlock in combination with pthread, lwip and grpc

Mon May 11 11:45:28 CEST 2020

Hello Sid,

that's an impressive backtrace ;-) which raises the following
questions.

1) Does the thread 13 really spin in the spinlock or does it reenter
   the spinlcok aquisition quite rapidly?
2) Could this be data curruption due to invalid pointers or life-time
   issues?
3) Does the C++ runtime support catching exceptions in Genode::Mutex
   while unwinding the occurence of another exception (_Unwind_Find_FDE)?

Regarding your questions...

On Wed, May 06, 2020 at 16:14:22 CEST, Sid Hussmann wrote:
> At this point, I also notice many threads being stuck after a `pthread_exit` call:

This is expected as the exiting threads block until they are destroyed
in pthread_join() by another thread. A thread cannot destroy itself on
most kernels supported.

> Despite our best efforts we have failed to pinpoint the problem so
> far. A few questions:
> 
> 1. Is the spinlock really a problem here?

The spinlock implementation should not be the problem, the storage
location of the mutex might be.

> 2. I was able to reproduce this problem using the newest version of
> gRPC [1] and the current `genodelabs/staging` branch. Reading [0]
> makes me wonder if the problem can be seen as "yet another instance
> where the VFS synchronization has bitten us".

This might be and we are working towards a single-threaded VFS usage
in libc. Complex, multi-threaded scenarios like yours with third-party
monster libraries like gRPC are quite a challenge we're ready to
accept after this work is done.

> 3. Unlike `lwip`, `lxip` seems to give up after just one or two
> connections with this warning: `_wait_and_dispatch_one_io_signal
> called from non-entrypoint thread "pthread.5"`. May this issue be
> related or should we just focus on `lwip`?

I'm afraid this stems from [1] and hints the execution-model
assumptions of the current lxip port are violated. I have no quick fix
for this.

[1] https://github.com/genodelabs/genode/blob/dd899fde29448e16c96b2860c391ddcbf2880a86/repos/dde_linux/src/lib/lxip/timer_handler.cc#L307-L322

> 4. In the GDB backtrace e.g. of `thread 13` at `#5` I notice  a path
> to `libgcc` that does not belong to my machine. Is this neglegtable
> or does it mess with my GDB output?

Please see my question 3) above.

> We just updated the `grpc` and `protobuf` libraries [1], which we
> will publish shortly. The problem however persists with both
> versions. To reproduce this issue, please checkout branch [2] and
> run `grpc-server-lwip` under Linux. Starting multiple client
> connections in parallel can be done by running
> `./stress_parallel_execute.sh 10.0.2.55:8899` using [3]. Note that
> the problem sometimes takes a few thousand of parallel connections
> to reproduce.

I must admit that I currently lack capacities to look deeper into
this on the practical side.

Greets
-- 
Christian Helmuth
Genode Labs

https://www.genode-labs.com/ · https://genode.org/
https://twitter.com/GenodeLabs · /ˈdʒiː.nəʊd/

Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden
Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth