Re: Deadlock in combination with pthread, lwip and grpc

21 May 2020

      Sali Christian,
thank you for your feedback. I focused my debugging more towards lwIP this time.
On 5/18/20 10:20 AM, Christian Helmuth wrote:
...

The initial thread, the entrypoint and the signal-handler thread are
in normal state and ready for serving incoming I/O signals to the
IP stack or the libc.

I verified that the entrypoint thread is still running by printing out log messages every 10 seconds. More on that below.
...

Thread 10, 11, and 12 are suspended in read(), which means there was
no data available for reading on the call to LWIP read. This could
be expected, but should not hold for long if I understand your
stress test correctly.

Thread 8 suspends in write(), which could only happen on an
insufficient-buffer condition. The question is: Does it leave this
state if LWIP pbufs are freed for this operation?

To narrowing down the issue further (and I also suspect lwip to be the
culprit), you may to add another I/O-signal source to your test, for
example a low-periodic wakeup of the entrypoint for a diagnostic log
message.
Good point. I played around with different memory settings in `lwipopts.h` such as `MEM_SIZE`, `MEM(P)_OVERFLOW_CHECK`, `MEM(P)_SANITY_CHECK` or `*MBOX_SIZE` defines. Also, I increased the memory for `test-tcp_echo_server` in the run script. This did not have any affect on how many connections can be handled. Not sure if I'm chasing a different problem now than in my first mail, but here memory overflow seems not to be the issue...
...
If this works (as expected) the entrypoint may itself use
some LWIP related function like nonblocking send or recv or even
socket creation/destruction.
It's a pity that the most important information is hidden in the LWIP
data structures like: Is there pbuf memory pressure? How long are the
send queues? Are there incoming packets? Do ARP and ICMP still work?
Also good point. To be able to talk to the actual lwIP stack, I removed the `nic_router` from `echo_server_lwip_tcp.run` in [1]. After the "deadlock" occurs, the lwIP stack does not respond to ARP nor ICMP. However, I noticed in wireshark [0] that lwIP is still trying to re-transmit a packet a couple of times, even though the client has closed the connection with a RST (Port 8899 -> 38868). 
I then implemented the `Lwip::tcp_poll_callback()` in `lwip/vfs.cc` [2] which gets called to notify the application (in this case vfs) when a connection is idle. After "deadlock", the `tcp_poll_callback` function gets called repeatedly forever. So the lwIP stack does not seem to be completely stuck. However, the RST from the client has not been handled at all.
1. I'm trying to figure out how a pthread maps to data being sent via lwIP. E.g. when calling `write()` from a pthread is it the same thread in which lwIP is sending the data? If not, is it safe to ignore the hints from [4]?
2. In `Lwip::tcp_sent_callback()` [2] we ignore the length argument completely. Is this intended and handled with the `process_io` functionality?
3. I didn't find an easy way to find out more about queues of lwIP. With my current findings, does it still make sense to iterate through all the linked lists of lwIP?
4. Since lwIP still handles callbacks and re-transmitts packets after the deadlock, the problem seems to be `vfs` related. Would you agree?
[0] https://cloud.gapfruit.com/s/BeGEZjw9qFMizbd
[1] https://github.com/sidhussmann/genode/tree/lwip-pthread-deadlock
[2] https://github.com/sidhussmann/genode/commit/6c924f5a33739d667a0ec7aa229ccf2...
[3] https://github.com/sidhussmann/genode/blob/6c924f5a33739d667a0ec7aa229ccf2c8...
[4] http://www.nongnu.org/lwip/2_1_x/multithreading.html
Cheers,
Sid
-- 
Sid Hussmann
CTO & Founder
gapfruit AG
Baarerstrasse 135
6300 Zug - Switzerland
sid.hussmann@gapfruit.com

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

Re: Deadlock in combination with pthread, lwip and grpc