Hello Sid,
that's an impressive backtrace ;-) which raises the following questions.
1) Does the thread 13 really spin in the spinlock or does it reenter the spinlcok aquisition quite rapidly? 2) Could this be data curruption due to invalid pointers or life-time issues? 3) Does the C++ runtime support catching exceptions in Genode::Mutex while unwinding the occurence of another exception (_Unwind_Find_FDE)?
Regarding your questions...
On Wed, May 06, 2020 at 16:14:22 CEST, Sid Hussmann wrote:
At this point, I also notice many threads being stuck after a `pthread_exit` call:
This is expected as the exiting threads block until they are destroyed in pthread_join() by another thread. A thread cannot destroy itself on most kernels supported.
Despite our best efforts we have failed to pinpoint the problem so far. A few questions:
- Is the spinlock really a problem here?
The spinlock implementation should not be the problem, the storage location of the mutex might be.
- I was able to reproduce this problem using the newest version of
gRPC [1] and the current `genodelabs/staging` branch. Reading [0] makes me wonder if the problem can be seen as "yet another instance where the VFS synchronization has bitten us".
This might be and we are working towards a single-threaded VFS usage in libc. Complex, multi-threaded scenarios like yours with third-party monster libraries like gRPC are quite a challenge we're ready to accept after this work is done.
- Unlike `lwip`, `lxip` seems to give up after just one or two
connections with this warning: `_wait_and_dispatch_one_io_signal called from non-entrypoint thread "pthread.5"`. May this issue be related or should we just focus on `lwip`?
I'm afraid this stems from [1] and hints the execution-model assumptions of the current lxip port are violated. I have no quick fix for this.
[1] https://github.com/genodelabs/genode/blob/dd899fde29448e16c96b2860c391ddcbf2...
- In the GDB backtrace e.g. of `thread 13` at `#5` I notice a path
to `libgcc` that does not belong to my machine. Is this neglegtable or does it mess with my GDB output?
Please see my question 3) above.
We just updated the `grpc` and `protobuf` libraries [1], which we will publish shortly. The problem however persists with both versions. To reproduce this issue, please checkout branch [2] and run `grpc-server-lwip` under Linux. Starting multiple client connections in parallel can be done by running `./stress_parallel_execute.sh 10.0.2.55:8899` using [3]. Note that the problem sometimes takes a few thousand of parallel connections to reproduce.
I must admit that I currently lack capacities to look deeper into this on the practical side.
Greets