Dear Genodians,
during stress testing a gRPC server with parallel connections I witness a race condition where the gRPC server component becomes unresponsive, while being able to respond to Genode heartbeats. Examining the different threads using GDB after the deadlock occurs, I notice one thread waiting at `spinlock_lock` after `Lwip::File_system::write()` threw an `Insufficient_buffer` exception:
```gdb
(gdb) t 13 [Switching to thread 13 (LWP 520409)] #0 spinlock_lock (lock_variable=lock_variable@entry=0x501cad40 Linker::mutex()::_mutex) at /home/user/genode/repos/base/include/spec/x86/cpu/atomic.h:51 51 return tmp == cmp_val; (gdb) bt #0 spinlock_lock (lock_variable=lock_variable@entry=0x501cad40 Linker::mutex()::_mutex) at /home/user/genode/repos/base/include/spec/x86/cpu/atomic.h:51 #1 Genode::Cancelable_lock::lock (this=this@entry=0x501cad40 Linker::mutex()::_mutex, myself=...) at /home/user/genode/repos/base/src/lib/base/lock.cc:71 #2 0x000000005004fa02 in Genode::Mutex::acquire (this=this@entry=0x501cad40 Linker::mutex()::_mutex) at /home/user/genode/repos/base/src/lib/base/mutex.cc:27 #3 0x00000000500bb050 in Genode::Mutex::Guard::Guard (mutex=..., this=<synthetic pointer>) at /home/user/genode/repos/base/include/base/mutex.h:40 #4 dl_iterate_phdr (callback=callback@entry=0x500a3a40 <_Unwind_IteratePhdrCallback>, data=data@entry=0x41ffa1b0) at /home/user/genode/repos/base/src/lib/ldso/exception.cc:41 #5 0x00000000500a4caf in _Unwind_Find_FDE (pc= 0xaf828f2 <Libc::Vfs_plugin::write(Libc::File_descriptor*, void const*, unsigned long)+722>, bases=bases@entry=0x41ffa418) at /data/genode/contrib/gcc-893d85767720fc2b3c695c56d737d2307b1f2229/src/noux-pkg/gcc/libgcc/unwind-dw2-fde-dip.c:469 #6 0x00000000500a12a3 in uw_frame_state_for (context=context@entry=0x41ffa370, fs=fs@entry=0x41ffa460) at /data/genode/contrib/gcc-893d85767720fc2b3c695c56d737d2307b1f2229/src/noux-pkg/gcc/libgcc/unwind-dw2.c:1257 #7 0x00000000500a2bdb in _Unwind_RaiseException (exc=0x501be8e8 Genode::init_cxx_heap(Genode::Env&)::initial_block+7528) at /data/genode/contrib/gcc-893d85767720fc2b3c695c56d737d2307b1f2229/src/noux-pkg/gcc/libgcc/unwind.inc:104 #8 0x00000000500a98c5 in __cxa_throw () at /data/genode/contrib/gcc-893d85767720fc2b3c695c56d737d2307b1f2229/src/noux-pkg/gcc/libgcc/unwind-dw2-fde.h:157 #9 0x000000000ad17404 in non-virtual thunk to Lwip::File_system::write(Vfs::Vfs_handle*, char const*, unsigned long long, unsigned long long&) () from vfs_lwip.lib.so #10 0x000000000af828f3 in Libc::Vfs_plugin::write ( this=0xaff6dc0 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space+9272>, fd=0x7ffff7c586b0, buf=0x7ffff7abfd6e, count=3419) at /home/user/genode/repos/libports/src/lib/libc/vfs_plugin.cc:806 #11 0x000000000af711e7 in do_sendto (fd=<optimized out>, buf=0x7ffff7abfd6e, len=3419, dest_addr=<optimized out>, dest_addrlen=<optimized out>, flags=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/socket_fs_plugin.cc:176 #12 0x000000000af72a28 in Libc::Socket_fs::Plugin::write (this=<optimized out>, fd=<optimized out>, buf=<optimized out>, count=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/socket_fs_plugin.cc:1099 #13 0x00000000011c9f60 in tcp_send (fd=84, msg=msg@entry=0x41ffb2f0, additional_flags=additional_flags@entry=0) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/iomgr/tcp_posix.cc:958 #14 0x00000000011ca18a in tcp_flush (tcp=tcp@entry=0x7ffff7d7da80, error=error@entry=0x41fff238) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/iomgr/tcp_posix.cc:1499 #15 0x00000000011ca56b in tcp_write (ep=0x7ffff7d7da80, buf=<optimized out>, cb=<optimized out>, arg=<optimized out>) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/iomgr/tcp_posix.cc:1636 #16 0x000000000116c6d0 in write_action (gt=0x7ffff7e90e20) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/iomgr/closure.h:110 #17 write_action_begin_locked (gt=0x7ffff7e90e20) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc:956 #18 0x00000000011b25c1 in grpc_combiner_continue_exec_ctx () at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/iomgr/combiner.cc:252 #19 0x00000000011b9818 in grpc_core::ExecCtx::Flush (this=this@entry=0x41fff3b0) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/iomgr/exec_ctx.cc:156 #20 0x0000000001217b64 in grpc_core::ExecCtx::~ExecCtx (this=0x41fff3b0, __in_chrg=<optimized out>) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/iomgr/exec_ctx.h:124 #21 grpc_call_start_batch (call=<optimized out>, ops=0x41fff470, nops=3, tag=0x41fff8b0, reserved=<optimized out>) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/surface/call.cc:1973 #22 0x0000000001006eaa in grpc::internal::CallOpSet<grpc::internal::CallOpSendInitialMetadata, grpc::internal::CallOpSendMessage, grpc::internal::CallOpServerSendStatus, grpc::internal::CallNoOp<4>, grpc::internal::CallNoOp<5>, grpc::internal::CallNoOp<6> >::ContinueFillOpsAfterInterception (this=0x41fff8b0) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/include/grpcpp/impl/codegen/call.h:72 #23 0x00000000010067ec in grpc::internal::CallOpSet<grpc::internal::CallOpSendInitialMetadata, grpc::internal::CallOpSendMessage, grpc::internal::CallOpServerSendStatus, grpc::internal::CallNoOp<4>, grpc::internal::CallNoOp<5>, grpc::internal::CallNoOp<6> >::FillOps (this=0x41fff8b0, call=<optimized out>) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/include/grpcpp/impl/codegen/interceptor_common.h:235 #24 0x0000000001008aa4 in grpc::internal::Call::PerformOps (ops=0x41fff8b0, this=<optimized out>) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/include/grpcpp/impl/codegen/call.h:69 #25 grpc_impl::internal::RpcMethodHandler<helloworld::Greeter::Service, helloworld::HelloRequest, helloworld::HelloReply>::RunHandler (this=this@entry=0x7ffff7f55300, param=...) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/include/grpcpp/impl/codegen/method_handler_impl.h:90 #26 0x000000000126891a in grpc_impl::Server::SyncRequest::CallData::ContinueRunAfterInterception (this=this@entry=0x7ffff7cdb650) at /home/user/genode/contrib/stdcxx-79ad8821269a443aab21bfc1d0e9b8913332b6d1/include/stdcxx/bits/move.h:182 #27 0x000000000126916c in grpc_impl::Server::SyncRequest::CallData::Run (resources=true, global_callbacks=..., this=0x7ffff7cdb650) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/cpp/server/server_cc.cc:489 #28 grpc_impl::Server::SyncRequestThreadManager::DoWork (this=<optimized out>, tag=<optimized out>, ok=<optimized out>, resources=<optimized out>) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/cpp/server/server_cc.cc:900 #29 0x00000000012afa66 in grpc::ThreadManager::MainWorkLoop (this=0x7ffff7f4f460) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/cpp/thread_manager/thread_manager.cc:211 #30 0x00000000012afcb0 in grpc::ThreadManager::WorkerThread::Run (this=0x7ffff7eae6e0) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/cpp/thread_manager/thread_manager.cc:43 #31 0x00000000011ae0ac in grpc_core::(anonymous namespace)::ThreadInternalsPosix::<lambda(void*)>::operator() (__closure=0x0, v=<optimized out>) at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/gprpp/thd_posix.cc:140 #32 grpc_core::(anonymous namespace)::ThreadInternalsPosix::<lambda(void*)>::_FUN(void *) () at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/gprpp/thd_posix.cc:145 #33 0x000000000af66391 in Libc::Pthread::Thread_object::entry (this=0x7ffff7d374f8) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:71 #34 0x0000000050096295 in Genode::Thread::_thread_start () at /home/user/genode/repos/base/include/base/output.h:54 #35 0x00000000500abab2 in thread_start () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_clone.S:62 (gdb) ```
At this point, I also notice many threads being stuck after a `pthread_exit` call:
```gdb (gdb) t 10 [Switching to thread 10 (LWP 520056)] #0 pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29 29 ret /* Return to caller. */ (gdb) bt #0 pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29 #1 0x000000005004f3fd in lx_futex (val=0, op=0, uaddr=<optimized out>) at /home/user/genode/repos/base-linux/src/lib/syscall/linux_syscalls.h:436 #2 thread_stop_myself (myself=<optimized out>) at /home/user/genode/repos/base-linux/src/include/base/internal/lock_helper.h:62 #3 Genode::Cancelable_lock::lock (this=this@entry=0x426ffd60, myself=...) at /home/user/genode/repos/base/src/lib/base/lock.cc:128 #4 0x000000005004f5a3 in Genode::Cancelable_lock::lock (this=this@entry=0x426ffd60) at /home/user/genode/repos/base/src/lib/base/lock.cc:66 #5 0x0000000050075990 in Genode::Lock::lock (this=<optimized out>) at /home/user/genode/repos/base/include/base/lock.h:33 #6 Genode::Blockade::block (this=<optimized out>) at /home/user/genode/repos/base/include/base/blockade.h:31 #7 Genode::sleep_forever () at /home/user/genode/repos/base/src/lib/base/sleep.cc:21 #8 0x000000000af66332 in pthread_exit (value_ptr=0x0) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:558 #9 0x000000000af663a3 in Libc::Pthread::Thread_object::entry (this=0x7ffff7d7a078) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:71 #10 0x0000000050096295 in Genode::Thread::_thread_start () at /home/user/genode/repos/base/include/base/output.h:54 #11 0x00000000500abab2 in thread_start () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_clone.S:62 (gdb)
```
Despite our best efforts we have failed to pinpoint the problem so far. A few questions:
1. Is the spinlock really a problem here? 2. I was able to reproduce this problem using the newest version of gRPC [1] and the current `genodelabs/staging` branch. Reading [0] makes me wonder if the problem can be seen as "yet another instance where the VFS synchronization has bitten us". 3. Unlike `lwip`, `lxip` seems to give up after just one or two connections with this warning: `_wait_and_dispatch_one_io_signal called from non-entrypoint thread "pthread.5"`. May this issue be related or should we just focus on `lwip`? 4. In the GDB backtrace e.g. of `thread 13` at `#5` I notice a path to `libgcc` that does not belong to my machine. Is this neglegtable or does it mess with my GDB output?
We just updated the `grpc` and `protobuf` libraries [1], which we will publish shortly. The problem however persists with both versions. To reproduce this issue, please checkout branch [2] and run `grpc-server-lwip` under Linux. Starting multiple client connections in parallel can be done by running `./stress_parallel_execute.sh 10.0.2.55:8899` using [3]. Note that the problem sometimes takes a few thousand of parallel connections to reproduce.
[0] https://github.com/genodelabs/genode/issues/3738#issuecomment-620496337 [1] https://github.com/trimpim/genode-world/tree/xxxx-update_protobuf_grpc_port [2] https://github.com/sidhussmann/genode-world/tree/grpc-server-race-condition [3] https://github.com/sidhussmann/grpc-connection-stress
Cheers, Sid