Re: Deadlock in combination with pthread, lwip and grpc

18 May 2020

      Moin Sid,
On Sat, May 16, 2020 at 10:10:27 CEST, Sid Hussmann wrote:
...
This does not yield any deadlock. After examining the problem for
the last days, I'm not so sure anymore if throwing of an exception
has any influence on the deadlock at all.
Thanks for diving into this pit, I was just clutching at this straw
but we should put this aside with your current findings.
...
It is quite a challenge to really find out what the root problem of
this is. Surely, it does not help if there is a huge library like
gRPC involved. To eliminate any influence of gRPC, I created a
scenario [1] that tries to boil down the problem using a combination
of `pthread`, `lwip` and dynamic heap allocation with `std::string`.
The scenario has one `_server_thread` that listens for new tcp
connections and creates a connection thread for each accepted
connection. The connection thread then allocates a `std::string`
with 50k data, appends whatever was received and is then sent back.
After 10 incoming sessions the threads that are done are cleaned up
with `pthread_join()` from within the `_server_thread`. To stress
this server, I use the same gRPC stress client [2] described in my
last mail, since the data transferred shouldn't be relevant anyways.
After less than 100 connections this will result in a deadlock,
where all the threads seem to be at a consistant state:
[...]
...
Continuing and stopping the process over and over again results in
the same backtraces of the threads. Looking at the backtraces,
nothing seems unusual. A few are stopping, a few are reading and one
is writing. None of the `Genode::Lock`s are pointing to the same
address. I'm lost. Is there anything I'm missing? I start to believe
`lwip` is the root cause here...
I'm not sure how to interpret correctly what you wrote above. Do you
think the processing is stuck completely? Did you try to ICMP ping or
forced ARP requests with the lwip instance?
...
From the backtraces I extracted the following picture.
- The initial thread, the entrypoint and the signal-handler thread are
  in normal state and ready for serving incoming I/O signals to the
  IP stack or the libc.
- Thread 10, 11, and 12 are suspended in read(), which means there was
  no data available for reading on the call to LWIP read. This could
  be expected, but should not hold for long if I understand your
  stress test correctly.
- Thread 8 suspends in write(), which could only happen on an
  insufficient-buffer condition. The question is: Does it leave this
  state if LWIP pbufs are freed for this operation?
To narrowing down the issue further (and I also suspect lwip to be the
culprit), you may to add another I/O-signal source to your test, for
example a low-periodic wakeup of the entrypoint for a diagnostic log
message. If this works (as expected) the entrypoint may itself use
some LWIP related function like nonblocking send or recv or even
socket creation/destruction.
It's a pity that the most important information is hidden in the LWIP
data structures like: Is there pbuf memory pressure? How long are the
send queues? Are there incoming packets? Do ARP and ICMP still work?
Greets
-- 
Christian Helmuth
Genode Labs

https://www.genode-labs.com/ · https://genode.org/
https://twitter.com/GenodeLabs · /ˈdʒiː.nəʊd/

Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden
Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

Re: Deadlock in combination with pthread, lwip and grpc