Deadlock in combination with pthread, lwip and grpc

Sid Hussmann sid.hussmann at gapfruit.com
Sat May 16 10:10:27 CEST 2020


Hi Christian,

thank you for your input and clarifications. Let me sum up my experience of the debugging hell of the last few days.

Regarding your questions:

> 1) Does the thread 13 really spin in the spinlock or does it reenter
>    the spinlcok aquisition quite rapidly?

You were right to question about whether thread 13 spins around the spinlock. It looks like I took a misleading snapshot. 

> 2) Could this be data curruption due to invalid pointers or life-time
>    issues?

I doubt that this is due to a gRPC bug. I created a simplified scenario [1] without gRPC where I could reproduce the/a deadlock (more information below).

> 3) Does the C++ runtime support catching exceptions in Genode::Mutex
>    while unwinding the occurence of another exception (_Unwind_Find_FDE)?

That's a good question. I tried to verify this by starting several pthreads that try to acquire the same mutex with following code snipped. 

```c++
try {
	log("waiting for mutex");
	Mutex::Guard guard(mutex());
	log("mutex aquired. sleeping.");
	usleep(2000*1000);
	log("throw exception");
	throw -1;
} catch (...) {
	log("caught exception");
}
```

This does not yield any deadlock. After examining the problem for the last days, I'm not so sure anymore if throwing of an exception has any influence on the deadlock at all.

It is quite a challenge to really find out what the root problem of this is. Surely, it does not help if there is a huge library like gRPC involved. To eliminate any influence of gRPC, I created a scenario [1] that tries to boil down the problem using a combination of `pthread`, `lwip` and dynamic heap allocation with `std::string`. The scenario has one `_server_thread` that listens for new tcp connections and creates a connection thread for each accepted connection. The connection thread then allocates a `std::string` with 50k data, appends whatever was received and is then sent back. After 10 incoming sessions the threads that are done are cleaned up with `pthread_join()` from within the `_server_thread`. To stress this server, I use the same gRPC stress client [2] described in my last mail, since the data transferred shouldn't be relevant anyways. 

After less than 100 connections this will result in a deadlock, where all the threads seem to be at a consistant state:

```gdb

(gdb) info threads 
  Id   Target Id             Frame 
* 1    LWP 76221 "ld.lib.so" pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
  2    LWP 76227 "ld.lib.so" pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
  3    LWP 76231 "ld.lib.so" pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
  4    LWP 76239 "ld.lib.so" pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
  5    LWP 76736 "ld.lib.so" pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
  6    LWP 76742 "ld.lib.so" pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
  7    LWP 76748 "ld.lib.so" pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
  8    LWP 76749 "ld.lib.so" pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
  9    LWP 76755 "ld.lib.so" pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
  10   LWP 76761 "ld.lib.so" pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
  11   LWP 76762 "ld.lib.so" pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
  12   LWP 76763 "ld.lib.so" pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
(gdb) thread apply all bt

Thread 12 (LWP 76763):
#0  pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
#1  0x000000005004efdd in lx_futex (val=0, op=0, uaddr=<optimized out>) at /home/user/genode/repos/base-linux/src/lib/syscall/linux_syscalls.h:436
#2  thread_stop_myself (myself=<optimized out>) at /home/user/genode/repos/base-linux/src/include/base/internal/lock_helper.h:62
#3  Genode::Cancelable_lock::lock (this=this at entry=0x40bfeb08, myself=...) at /home/user/genode/repos/base/src/lib/base/lock.cc:128
#4  0x000000005004f183 in Genode::Cancelable_lock::lock (this=this at entry=0x40bfeb08) at /home/user/genode/repos/base/src/lib/base/lock.cc:66
#5  0x000000000af520d6 in Genode::Lock::lock (this=this at entry=0x40bfeb08) at /home/user/genode/repos/base/include/base/lock.h:33
#6  0x000000000af60087 in Libc::Pthread_pool::suspend_myself (timeout_ms=<optimized out>, check=..., this=0xaff8878 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space+16112>) at /home/user/genode/repos/libports/src/lib/libc/internal/pthread_pool.h:89
#7  Libc::Kernel::suspend (this=0xaff4988 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space>, check=..., timeout_ms=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/internal/kernel.h:506
#8  0x000000000af8477e in Libc::Vfs_plugin::read (this=0xaff6dc0 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space+9272>, fd=<optimized out>, buf=0x40bfedd0, count=4096) at /home/user/genode/repos/libports/src/lib/libc/vfs_plugin.cc:885
#9  0x000000000af74178 in do_recvfrom (fd=<optimized out>, buf=0x40bfedd0, len=4096, src_addr=<optimized out>, src_addrlen=0x0, flags=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/socket_fs_plugin.cc:756
#10 0x000000000af7424b in Libc::Socket_fs::Plugin::read (this=<optimized out>, fd=<optimized out>, buf=<optimized out>, count=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/socket_fs_plugin.cc:1086
#11 0x0000000001000709 in Echo_server::Server_main::connection_handler (arg=0x7ffff7f7a240) at /home/user/genode/repos/libports/src/test/tcp_echo_server/main.cc:65
#12 0x000000000af66fd1 in Libc::Pthread::Thread_object::entry (this=0x7ffff7f5dfe8) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:71
#13 0x0000000050095f35 in Genode::Thread::_thread_start () at /home/user/genode/repos/base/include/base/output.h:54
#14 0x00000000500ab752 in thread_start () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_clone.S:62

Thread 11 (LWP 76762):
#0  pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
#1  0x000000005004efdd in lx_futex (val=0, op=0, uaddr=<optimized out>) at /home/user/genode/repos/base-linux/src/lib/syscall/linux_syscalls.h:436
#2  thread_stop_myself (myself=<optimized out>) at /home/user/genode/repos/base-linux/src/include/base/internal/lock_helper.h:62
#3  Genode::Cancelable_lock::lock (this=this at entry=0x40afeb08, myself=...) at /home/user/genode/repos/base/src/lib/base/lock.cc:128
#4  0x000000005004f183 in Genode::Cancelable_lock::lock (this=this at entry=0x40afeb08) at /home/user/genode/repos/base/src/lib/base/lock.cc:66
#5  0x000000000af520d6 in Genode::Lock::lock (this=this at entry=0x40afeb08) at /home/user/genode/repos/base/include/base/lock.h:33
#6  0x000000000af60087 in Libc::Pthread_pool::suspend_myself (timeout_ms=<optimized out>, check=..., this=0xaff8878 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space+16112>) at /home/user/genode/repos/libports/src/lib/libc/internal/pthread_pool.h:89
#7  Libc::Kernel::suspend (this=0xaff4988 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space>, check=..., timeout_ms=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/internal/kernel.h:506
#8  0x000000000af8477e in Libc::Vfs_plugin::read (this=0xaff6dc0 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space+9272>, fd=<optimized out>, buf=0x40afedd0, count=4096) at /home/user/genode/repos/libports/src/lib/libc/vfs_plugin.cc:885
#9  0x000000000af74178 in do_recvfrom (fd=<optimized out>, buf=0x40afedd0, len=4096, src_addr=<optimized out>, src_addrlen=0x0, flags=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/socket_fs_plugin.cc:756
#10 0x000000000af7424b in Libc::Socket_fs::Plugin::read (this=<optimized out>, fd=<optimized out>, buf=<optimized out>, count=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/socket_fs_plugin.cc:1086
#11 0x0000000001000709 in Echo_server::Server_main::connection_handler (arg=0x7ffff7f7a1f0) at /home/user/genode/repos/libports/src/test/tcp_echo_server/main.cc:65
#12 0x000000000af66fd1 in Libc::Pthread::Thread_object::entry (this=0x7ffff7f5dde8) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:71
#13 0x0000000050095f35 in Genode::Thread::_thread_start () at /home/user/genode/repos/base/include/base/output.h:54
#14 0x00000000500ab752 in thread_start () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_clone.S:62

Thread 10 (LWP 76761):
#0  pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
#1  0x000000005004efdd in lx_futex (val=0, op=0, uaddr=<optimized out>) at /home/user/genode/repos/base-linux/src/lib/syscall/linux_syscalls.h:436
#2  thread_stop_myself (myself=<optimized out>) at /home/user/genode/repos/base-linux/src/include/base/internal/lock_helper.h:62
#3  Genode::Cancelable_lock::lock (this=this at entry=0x409feb08, myself=...) at /home/user/genode/repos/base/src/lib/base/lock.cc:128
#4  0x000000005004f183 in Genode::Cancelable_lock::lock (this=this at entry=0x409feb08) at /home/user/genode/repos/base/src/lib/base/lock.cc:66
#5  0x000000000af520d6 in Genode::Lock::lock (this=this at entry=0x409feb08) at /home/user/genode/repos/base/include/base/lock.h:33
#6  0x000000000af60087 in Libc::Pthread_pool::suspend_myself (timeout_ms=<optimized out>, check=..., this=0xaff8878 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space+16112>) at /home/user/genode/repos/libports/src/lib/libc/internal/pthread_pool.h:89
#7  Libc::Kernel::suspend (this=0xaff4988 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space>, check=..., timeout_ms=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/internal/kernel.h:506
#8  0x000000000af8477e in Libc::Vfs_plugin::read (this=0xaff6dc0 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space+9272>, fd=<optimized out>, buf=0x409fedd0, count=4096) at /home/user/genode/repos/libports/src/lib/libc/vfs_plugin.cc:885
#9  0x000000000af74178 in do_recvfrom (fd=<optimized out>, buf=0x409fedd0, len=4096, src_addr=<optimized out>, src_addrlen=0x0, flags=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/socket_fs_plugin.cc:756
#10 0x000000000af7424b in Libc::Socket_fs::Plugin::read (this=<optimized out>, fd=<optimized out>, buf=<optimized out>, count=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/socket_fs_plugin.cc:1086
#11 0x0000000001000709 in Echo_server::Server_main::connection_handler (arg=0x7ffff7f7a1b0) at /home/user/genode/repos/libports/src/test/tcp_echo_server/main.cc:65
#12 0x000000000af66fd1 in Libc::Pthread::Thread_object::entry (this=0x7ffff7f5dbd8) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:71
#13 0x0000000050095f35 in Genode::Thread::_thread_start () at /home/user/genode/repos/base/include/base/output.h:54
#14 0x00000000500ab752 in thread_start () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_clone.S:62

Thread 9 (LWP 76755):
#0  pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
#1  0x000000005004efdd in lx_futex (val=0, op=0, uaddr=<optimized out>) at /home/user/genode/repos/base-linux/src/lib/syscall/linux_syscalls.h:436
#2  thread_stop_myself (myself=<optimized out>) at /home/user/genode/repos/base-linux/src/include/base/internal/lock_helper.h:62
#3  Genode::Cancelable_lock::lock (this=this at entry=0x408fecd0, myself=...) at /home/user/genode/repos/base/src/lib/base/lock.cc:128
#4  0x000000005004f183 in Genode::Cancelable_lock::lock (this=this at entry=0x408fecd0) at /home/user/genode/repos/base/src/lib/base/lock.cc:66
#5  0x00000000500753f0 in Genode::Lock::lock (this=<optimized out>) at /home/user/genode/repos/base/include/base/lock.h:33
#6  Genode::Blockade::block (this=<optimized out>) at /home/user/genode/repos/base/include/base/blockade.h:31
#7  Genode::sleep_forever () at /home/user/genode/repos/base/src/lib/base/sleep.cc:21
#8  0x000000000af66f72 in pthread_exit (value_ptr=value_ptr at entry=0x0) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:558
#9  0x00000000010007d7 in Echo_server::Server_main::connection_handler (arg=0x7ffff7f7a160) at /home/user/genode/repos/libports/src/test/tcp_echo_server/main.cc:91
#10 0x000000000af66fd1 in Libc::Pthread::Thread_object::entry (this=0x7ffff7f5d9d8) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:71
#11 0x0000000050095f35 in Genode::Thread::_thread_start () at /home/user/genode/repos/base/include/base/output.h:54
#12 0x00000000500ab752 in thread_start () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_clone.S:62

Thread 8 (LWP 76749):
#0  pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
#1  0x000000005004efdd in lx_futex (val=0, op=0, uaddr=<optimized out>) at /home/user/genode/repos/base-linux/src/lib/syscall/linux_syscalls.h:436
#2  thread_stop_myself (myself=<optimized out>) at /home/user/genode/repos/base-linux/src/include/base/internal/lock_helper.h:62
#3  Genode::Cancelable_lock::lock (this=this at entry=0x407fe0a8, myself=...) at /home/user/genode/repos/base/src/lib/base/lock.cc:128
#4  0x000000005004f183 in Genode::Cancelable_lock::lock (this=this at entry=0x407fe0a8) at /home/user/genode/repos/base/src/lib/base/lock.cc:66
#5  0x000000000af520d6 in Genode::Lock::lock (this=this at entry=0x407fe0a8) at /home/user/genode/repos/base/include/base/lock.h:33
#6  0x000000000af60087 in Libc::Pthread_pool::suspend_myself (timeout_ms=<optimized out>, check=..., this=0xaff8878 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space+16112>) at /home/user/genode/repos/libports/src/lib/libc/internal/pthread_pool.h:89
#7  Libc::Kernel::suspend (this=0xaff4988 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space>, check=..., timeout_ms=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/internal/kernel.h:506
#8  0x000000000af83346 in Libc::Vfs_plugin::write (this=0xaff6dc0 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space+9272>, fd=0x7ffff7feaa70, buf=<optimized out>, count=50099) at /home/user/genode/repos/libports/src/lib/libc/vfs_plugin.cc:785
#9  0x000000000af71e27 in do_sendto (fd=<optimized out>, buf=0x7ffff7f2c010, len=50099, dest_addr=<optimized out>, dest_addrlen=<optimized out>, flags=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/socket_fs_plugin.cc:176
#10 0x000000000af73668 in Libc::Socket_fs::Plugin::write (this=<optimized out>, fd=<optimized out>, buf=<optimized out>, count=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/socket_fs_plugin.cc:1099
#11 0x0000000001000b4f in Echo_server::Server_main::connection_handler (arg=0x7ffff7f7a120) at /home/user/genode/contrib/stdcxx-79ad8821269a443aab21bfc1d0e9b8913332b6d1/include/stdcxx/bits/basic_string.h:2313
#12 0x000000000af66fd1 in Libc::Pthread::Thread_object::entry (this=0x7ffff7f5d7c8) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:71
#13 0x0000000050095f35 in Genode::Thread::_thread_start () at /home/user/genode/repos/base/include/base/output.h:54
#14 0x00000000500ab752 in thread_start () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_clone.S:62

Thread 7 (LWP 76748):
#0  pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
#1  0x000000005004efdd in lx_futex (val=0, op=0, uaddr=<optimized out>) at /home/user/genode/repos/base-linux/src/lib/syscall/linux_syscalls.h:436
#2  thread_stop_myself (myself=<optimized out>) at /home/user/genode/repos/base-linux/src/include/base/internal/lock_helper.h:62
#3  Genode::Cancelable_lock::lock (this=this at entry=0x406fecd0, myself=...) at /home/user/genode/repos/base/src/lib/base/lock.cc:128
#4  0x000000005004f183 in Genode::Cancelable_lock::lock (this=this at entry=0x406fecd0) at /home/user/genode/repos/base/src/lib/base/lock.cc:66
#5  0x00000000500753f0 in Genode::Lock::lock (this=<optimized out>) at /home/user/genode/repos/base/include/base/lock.h:33
#6  Genode::Blockade::block (this=<optimized out>) at /home/user/genode/repos/base/include/base/blockade.h:31
#7  Genode::sleep_forever () at /home/user/genode/repos/base/src/lib/base/sleep.cc:21
#8  0x000000000af66f72 in pthread_exit (value_ptr=value_ptr at entry=0x0) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:558
#9  0x00000000010007d7 in Echo_server::Server_main::connection_handler (arg=0x7ffff7f7a0d0) at /home/user/genode/repos/libports/src/test/tcp_echo_server/main.cc:91
#10 0x000000000af66fd1 in Libc::Pthread::Thread_object::entry (this=0x7ffff7f5d5c8) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:71
#11 0x0000000050095f35 in Genode::Thread::_thread_start () at /home/user/genode/repos/base/include/base/output.h:54
#12 0x00000000500ab752 in thread_start () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_clone.S:62

Thread 6 (LWP 76742):
#0  pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
#1  0x000000005004efdd in lx_futex (val=0, op=0, uaddr=<optimized out>) at /home/user/genode/repos/base-linux/src/lib/syscall/linux_syscalls.h:436
#2  thread_stop_myself (myself=<optimized out>) at /home/user/genode/repos/base-linux/src/include/base/internal/lock_helper.h:62
#3  Genode::Cancelable_lock::lock (this=this at entry=0x405fecd0, myself=...) at /home/user/genode/repos/base/src/lib/base/lock.cc:128
#4  0x000000005004f183 in Genode::Cancelable_lock::lock (this=this at entry=0x405fecd0) at /home/user/genode/repos/base/src/lib/base/lock.cc:66
#5  0x00000000500753f0 in Genode::Lock::lock (this=<optimized out>) at /home/user/genode/repos/base/include/base/lock.h:33
#6  Genode::Blockade::block (this=<optimized out>) at /home/user/genode/repos/base/include/base/blockade.h:31
#7  Genode::sleep_forever () at /home/user/genode/repos/base/src/lib/base/sleep.cc:21
#8  0x000000000af66f72 in pthread_exit (value_ptr=value_ptr at entry=0x0) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:558
#9  0x00000000010007d7 in Echo_server::Server_main::connection_handler (arg=0x7ffff7f7a090) at /home/user/genode/repos/libports/src/test/tcp_echo_server/main.cc:91
#10 0x000000000af66fd1 in Libc::Pthread::Thread_object::entry (this=0x7ffff7f5d3b8) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:71
#11 0x0000000050095f35 in Genode::Thread::_thread_start () at /home/user/genode/repos/base/include/base/output.h:54
#12 0x00000000500ab752 in thread_start () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_clone.S:62

Thread 5 (LWP 76736):
#0  pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
#1  0x000000005004efdd in lx_futex (val=0, op=0, uaddr=<optimized out>) at /home/user/genode/repos/base-linux/src/lib/syscall/linux_syscalls.h:436
#2  thread_stop_myself (myself=<optimized out>) at /home/user/genode/repos/base-linux/src/include/base/internal/lock_helper.h:62
#3  Genode::Cancelable_lock::lock (this=this at entry=0x418fecd0, myself=...) at /home/user/genode/repos/base/src/lib/base/lock.cc:128
#4  0x000000005004f183 in Genode::Cancelable_lock::lock (this=this at entry=0x418fecd0) at /home/user/genode/repos/base/src/lib/base/lock.cc:66
#5  0x00000000500753f0 in Genode::Lock::lock (this=<optimized out>) at /home/user/genode/repos/base/include/base/lock.h:33
#6  Genode::Blockade::block (this=<optimized out>) at /home/user/genode/repos/base/include/base/blockade.h:31
#7  Genode::sleep_forever () at /home/user/genode/repos/base/src/lib/base/sleep.cc:21
#8  0x000000000af66f72 in pthread_exit (value_ptr=value_ptr at entry=0x0) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:558
#9  0x00000000010007d7 in Echo_server::Server_main::connection_handler (arg=0x7ffff7f7a5e0) at /home/user/genode/repos/libports/src/test/tcp_echo_server/main.cc:91
#10 0x000000000af66fd1 in Libc::Pthread::Thread_object::entry (this=0x7ffff7f5ea18) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:71
#11 0x0000000050095f35 in Genode::Thread::_thread_start () at /home/user/genode/repos/base/include/base/output.h:54
#12 0x00000000500ab752 in thread_start () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_clone.S:62

Thread 4 (LWP 76239):
#0  pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
#1  0x000000005004efdd in lx_futex (val=0, op=0, uaddr=<optimized out>) at /home/user/genode/repos/base-linux/src/lib/syscall/linux_syscalls.h:436
#2  thread_stop_myself (myself=<optimized out>) at /home/user/genode/repos/base-linux/src/include/base/internal/lock_helper.h:62
#3  Genode::Cancelable_lock::lock (this=this at entry=0x404ff1f8, myself=...) at /home/user/genode/repos/base/src/lib/base/lock.cc:128
#4  0x000000005004f183 in Genode::Cancelable_lock::lock (this=this at entry=0x404ff1f8) at /home/user/genode/repos/base/src/lib/base/lock.cc:66
#5  0x000000000af520d6 in Genode::Lock::lock (this=this at entry=0x404ff1f8) at /home/user/genode/repos/base/include/base/lock.h:33
#6  0x000000000af60087 in Libc::Pthread_pool::suspend_myself (timeout_ms=<optimized out>, check=..., this=0xaff8878 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space+16112>) at /home/user/genode/repos/libports/src/lib/libc/internal/pthread_pool.h:89
#7  Libc::Kernel::suspend (this=0xaff4988 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space>, check=..., timeout_ms=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/internal/kernel.h:506
#8  0x000000000af8477e in Libc::Vfs_plugin::read (this=0xaff6dc0 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space+9272>, fd=<optimized out>, buf=0x404ff640, count=511) at /home/user/genode/repos/libports/src/lib/libc/vfs_plugin.cc:885
#9  0x000000000af7a781 in Libc::Socket_fs::Context::_read_socket_path (this=0x7ffff7f567e0) at /home/user/genode/repos/os/include/os/path.h:239
#10 Libc::Socket_fs::Context::Context (handle_fd=27, proto=Libc::Socket_fs::Context::TCP, this=0x7ffff7f567e0) at /home/user/genode/repos/libports/src/lib/libc/socket_fs_plugin.cc:190
#11 socket_fs_accept (libc_fd=libc_fd at entry=4, addr=addr at entry=0x404ffbd0, addrlen=addrlen at entry=0x404ff95c) at /home/user/genode/repos/libports/src/lib/libc/socket_fs_plugin.cc:563
#12 0x000000000af7d5d6 in __sys_accept (libc_fd=libc_fd at entry=4, addr=addr at entry=0x404ffbd0, addrlen=addrlen at entry=0x404ff95c) at /home/user/genode/repos/libports/src/lib/libc/socket_operations.cc:80
#13 0x00000000010083ee in Echo_server::Server_main::start_server (envp=<optimized out>) at /home/user/genode/repos/libports/src/test/tcp_echo_server/main.cc:135
#14 0x000000000af66fd1 in Libc::Pthread::Thread_object::entry (this=0x7ffff7f7d068) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:71
#15 0x0000000050095f35 in Genode::Thread::_thread_start () at /home/user/genode/repos/base/include/base/output.h:54
#16 0x00000000500ab752 in thread_start () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_clone.S:62

Thread 3 (LWP 76231):
#0  pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
#1  0x000000005004dee5 in lx_recvmsg (flags=0, msg=0x402ff9b0, sockfd=...) at /home/user/genode/repos/base-linux/src/lib/syscall/linux_syscalls.h:208
#2  Genode::ipc_call (dst=..., snd_msgbuf=..., rcv_msgbuf=...) at /home/user/genode/repos/base-linux/src/lib/base/ipc.cc:343
#3  0x0000000050091a47 in Genode::Capability<Genode::Signal_source>::_call<Genode::Signal_source::Rpc_wait_for_signal> (this=this at entry=0x501ba830 <unmanaged_singleton<Genode::Constructible<Signal_handler_thread>, 8>()::object_space+208>, args=...) at /home/user/genode/repos/base/include/base/native_capability.h:62
#4  0x0000000050091dad in Genode::Capability<Genode::Signal_source>::call<Genode::Signal_source::Rpc_wait_for_signal> (this=0x501ba830 <unmanaged_singleton<Genode::Constructible<Signal_handler_thread>, 8>()::object_space+208>) at /home/user/genode/repos/base/include/base/capability.h:163
#5  Genode::Rpc_client<Genode::Signal_source>::call<Genode::Signal_source::Rpc_wait_for_signal> (this=0x501ba828 <unmanaged_singleton<Genode::Constructible<Signal_handler_thread>, 8>()::object_space+200>) at /home/user/genode/repos/base/include/base/rpc_client.h:192
#6  Genode::Signal_source_client::wait_for_signal (this=0x501ba828 <unmanaged_singleton<Genode::Constructible<Signal_handler_thread>, 8>()::object_space+200>) at /home/user/genode/repos/base/src/include/signal_source/client.h:28
#7  Genode::Signal_receiver::dispatch_signals (signal_source=0x501ba828 <unmanaged_singleton<Genode::Constructible<Signal_handler_thread>, 8>()::object_space+200>) at /home/user/genode/repos/base/src/lib/base/signal.cc:304
#8  0x00000000500921c0 in Signal_handler_thread::entry (this=0x501ba760 <unmanaged_singleton<Genode::Constructible<Signal_handler_thread>, 8>()::object_space>) at /home/user/genode/repos/base/include/util/reconstructible.h:151
#9  0x0000000050095f35 in Genode::Thread::_thread_start () at /home/user/genode/repos/base/include/base/output.h:54
#10 0x00000000500ab752 in thread_start () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_clone.S:62

Thread 2 (LWP 76227):
#0  pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
#1  0x000000005004fee4 in lx_epoll_wait (timeout=-1, maxevents=1, events=0x401ff8a4, epoll=...) at /home/user/genode/repos/base-linux/src/lib/syscall/linux_syscalls.h:262
#2  Genode::Native_thread::Epoll::poll (this=this at entry=0x401fffe8) at /home/user/genode/repos/base-linux/src/lib/base/native_thread.cc:80
#3  0x000000005004e5b9 in Genode::ipc_reply_wait (last_caller=..., exc=..., exc at entry=..., reply_msg=..., request_msg=...) at /home/user/genode/repos/base-linux/src/lib/base/ipc.cc:396
#4  0x0000000050067714 in Genode::Rpc_entrypoint::entry (this=0x500fcea8 <Genode::bootstrap_component()::startup+360>) at /home/user/genode/repos/base/src/lib/base/rpc_dispatch_loop.cc:66
#5  0x0000000050095f35 in Genode::Thread::_thread_start () at /home/user/genode/repos/base/include/base/output.h:54
#6  0x00000000500ab752 in thread_start () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_clone.S:62

Thread 1 (LWP 76221):
#0  pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
#1  0x000000005004efdd in lx_futex (val=0, op=0, uaddr=<optimized out>) at /home/user/genode/repos/base-linux/src/lib/syscall/linux_syscalls.h:436
#2  thread_stop_myself (myself=<optimized out>) at /home/user/genode/repos/base-linux/src/include/base/internal/lock_helper.h:62
#3  Genode::Cancelable_lock::lock (this=this at entry=0x400ffad0, myself=...) at /home/user/genode/repos/base/src/lib/base/lock.cc:128
#4  0x000000005004f183 in Genode::Cancelable_lock::lock (this=this at entry=0x400ffad0) at /home/user/genode/repos/base/src/lib/base/lock.cc:66
#5  0x0000000050090c4a in Genode::Lock::lock (this=0x400ffad0) at /home/user/genode/repos/base/include/base/lock.h:33
#6  Genode::Blockade::block (this=0x400ffad0) at /home/user/genode/repos/base/include/base/blockade.h:31
#7  Genode::Semaphore::down (this=0x500fdaf0 <Genode::bootstrap_component()::startup+3504>, this at entry=0x400ffad0) at /home/user/genode/repos/base/include/base/semaphore.h:101
#8  Genode::Signal_receiver::block_for_signal (this=this at entry=0x500fdaf0 <Genode::bootstrap_component()::startup+3504>) at /home/user/genode/repos/base/src/lib/base/signal.cc:235
#9  0x00000000500480ca in Genode::Entrypoint::_process_incoming_signals (this=0x500fcea0 <Genode::bootstrap_component()::startup+352>) at /home/user/genode/repos/base/include/util/reconstructible.h:59
#10 0x0000000050048fcd in Genode::Entrypoint::Entrypoint (this=0x500fcea0 <Genode::bootstrap_component()::startup+352>, env=...) at /home/user/genode/repos/base/src/lib/base/entrypoint.cc:360
#11 0x000000005004061d in Genode::Startup::Startup (this=0x500fcd40 <Genode::bootstrap_component()::startup>) at /home/user/genode/repos/base/src/lib/base/component.cc:74
#12 Genode::bootstrap_component () at /home/user/genode/repos/base/src/lib/base/component.cc:254
#13 0x00000000500ab396 in _main () at /home/user/genode/repos/base/src/lib/startup/_main.cc:255
#14 0x0000000000000000 in ?? ()

```

Continuing and stopping the process over and over again results in the same backtraces of the threads. Looking at the backtraces, nothing seems unusual. A few are stopping, a few are reading and one is writing. None of the `Genode::Lock`s are pointing to the same address. I'm lost. Is there anything I'm missing? I start to believe `lwip` is the root cause here...


[1] https://github.com/sidhussmann/genode/tree/lwip-pthread-deadlock
[2] https://github.com/sidhussmann/grpc-connection-stress

Cheers,
Sid


-- 
Sid Hussmann
CTO & Founder
gapfruit AG
Baarerstrasse 135
6300 Zug - Switzerland
sid.hussmann at gapfruit.com


On 5/11/20 11:45 AM, Christian Helmuth wrote:
> Hello Sid,
> 
> that's an impressive backtrace ;-) which raises the following
> questions.
> 
> 1) Does the thread 13 really spin in the spinlock or does it reenter
>    the spinlcok aquisition quite rapidly?
> 2) Could this be data curruption due to invalid pointers or life-time
>    issues?
> 3) Does the C++ runtime support catching exceptions in Genode::Mutex
>    while unwinding the occurence of another exception (_Unwind_Find_FDE)?
> 
> Regarding your questions...
> 
> On Wed, May 06, 2020 at 16:14:22 CEST, Sid Hussmann wrote:
>> At this point, I also notice many threads being stuck after a `pthread_exit` call:
> 
> This is expected as the exiting threads block until they are destroyed
> in pthread_join() by another thread. A thread cannot destroy itself on
> most kernels supported.
> 
>> Despite our best efforts we have failed to pinpoint the problem so
>> far. A few questions:
>>
>> 1. Is the spinlock really a problem here?
> 
> The spinlock implementation should not be the problem, the storage
> location of the mutex might be.
> 
>> 2. I was able to reproduce this problem using the newest version of
>> gRPC [1] and the current `genodelabs/staging` branch. Reading [0]
>> makes me wonder if the problem can be seen as "yet another instance
>> where the VFS synchronization has bitten us".
> 
> This might be and we are working towards a single-threaded VFS usage
> in libc. Complex, multi-threaded scenarios like yours with third-party
> monster libraries like gRPC are quite a challenge we're ready to
> accept after this work is done.
> 
>> 3. Unlike `lwip`, `lxip` seems to give up after just one or two
>> connections with this warning: `_wait_and_dispatch_one_io_signal
>> called from non-entrypoint thread "pthread.5"`. May this issue be
>> related or should we just focus on `lwip`?
> 
> I'm afraid this stems from [1] and hints the execution-model
> assumptions of the current lxip port are violated. I have no quick fix
> for this.
> 
> [1] https://github.com/genodelabs/genode/blob/dd899fde29448e16c96b2860c391ddcbf2880a86/repos/dde_linux/src/lib/lxip/timer_handler.cc#L307-L322
> 
>> 4. In the GDB backtrace e.g. of `thread 13` at `#5` I notice  a path
>> to `libgcc` that does not belong to my machine. Is this neglegtable
>> or does it mess with my GDB output?
> 
> Please see my question 3) above.
> 
>> We just updated the `grpc` and `protobuf` libraries [1], which we
>> will publish shortly. The problem however persists with both
>> versions. To reproduce this issue, please checkout branch [2] and
>> run `grpc-server-lwip` under Linux. Starting multiple client
>> connections in parallel can be done by running
>> `./stress_parallel_execute.sh 10.0.2.55:8899` using [3]. Note that
>> the problem sometimes takes a few thousand of parallel connections
>> to reproduce.
> 
> I must admit that I currently lack capacities to look deeper into
> this on the practical side.
> 
> Greets
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.genode.org/pipermail/users/attachments/20200516/fae06070/attachment-0001.sig>


More information about the users mailing list