Sali Christian,
thank you for your feedback. I focused my debugging more towards lwIP this time.
On 5/18/20 10:20 AM, Christian Helmuth wrote:
- The initial thread, the entrypoint and the signal-handler thread are in normal state and ready for serving incoming I/O signals to the IP stack or the libc.
I verified that the entrypoint thread is still running by printing out log messages every 10 seconds. More on that below.
Thread 10, 11, and 12 are suspended in read(), which means there was no data available for reading on the call to LWIP read. This could be expected, but should not hold for long if I understand your stress test correctly.
Thread 8 suspends in write(), which could only happen on an insufficient-buffer condition. The question is: Does it leave this state if LWIP pbufs are freed for this operation?
To narrowing down the issue further (and I also suspect lwip to be the culprit), you may to add another I/O-signal source to your test, for example a low-periodic wakeup of the entrypoint for a diagnostic log message.
Good point. I played around with different memory settings in `lwipopts.h` such as `MEM_SIZE`, `MEM(P)_OVERFLOW_CHECK`, `MEM(P)_SANITY_CHECK` or `*MBOX_SIZE` defines. Also, I increased the memory for `test-tcp_echo_server` in the run script. This did not have any affect on how many connections can be handled. Not sure if I'm chasing a different problem now than in my first mail, but here memory overflow seems not to be the issue...
If this works (as expected) the entrypoint may itself use some LWIP related function like nonblocking send or recv or even socket creation/destruction.
It's a pity that the most important information is hidden in the LWIP data structures like: Is there pbuf memory pressure? How long are the send queues? Are there incoming packets? Do ARP and ICMP still work?
Also good point. To be able to talk to the actual lwIP stack, I removed the `nic_router` from `echo_server_lwip_tcp.run` in [1]. After the "deadlock" occurs, the lwIP stack does not respond to ARP nor ICMP. However, I noticed in wireshark [0] that lwIP is still trying to re-transmit a packet a couple of times, even though the client has closed the connection with a RST (Port 8899 -> 38868). I then implemented the `Lwip::tcp_poll_callback()` in `lwip/vfs.cc` [2] which gets called to notify the application (in this case vfs) when a connection is idle. After "deadlock", the `tcp_poll_callback` function gets called repeatedly forever. So the lwIP stack does not seem to be completely stuck. However, the RST from the client has not been handled at all.
1. I'm trying to figure out how a pthread maps to data being sent via lwIP. E.g. when calling `write()` from a pthread is it the same thread in which lwIP is sending the data? If not, is it safe to ignore the hints from [4]?
2. In `Lwip::tcp_sent_callback()` [2] we ignore the length argument completely. Is this intended and handled with the `process_io` functionality?
3. I didn't find an easy way to find out more about queues of lwIP. With my current findings, does it still make sense to iterate through all the linked lists of lwIP?
4. Since lwIP still handles callbacks and re-transmitts packets after the deadlock, the problem seems to be `vfs` related. Would you agree?
[0] https://cloud.gapfruit.com/s/BeGEZjw9qFMizbd [1] https://github.com/sidhussmann/genode/tree/lwip-pthread-deadlock [2] https://github.com/sidhussmann/genode/commit/6c924f5a33739d667a0ec7aa229ccf2... [3] https://github.com/sidhussmann/genode/blob/6c924f5a33739d667a0ec7aa229ccf2c8... [4] http://www.nongnu.org/lwip/2_1_x/multithreading.html
Cheers, Sid