Deadlock in combination with pthread, lwip and grpc

Sid Hussmann sid.hussmann at gapfruit.com
Wed May 6 16:14:22 CEST 2020


Dear Genodians,

during stress testing a gRPC server with parallel connections I witness a race condition where the gRPC server component becomes unresponsive, while being able to respond to Genode heartbeats. Examining the different threads using GDB after the deadlock occurs, I notice one thread waiting at `spinlock_lock` after `Lwip::File_system::write()` threw an `Insufficient_buffer` exception:

```gdb

(gdb) t 13
[Switching to thread 13 (LWP 520409)]
#0  spinlock_lock (lock_variable=lock_variable at entry=0x501cad40 <Linker::mutex()::_mutex>)
    at /home/user/genode/repos/base/include/spec/x86/cpu/atomic.h:51
51                      return tmp == cmp_val;
(gdb) bt
#0  spinlock_lock (lock_variable=lock_variable at entry=0x501cad40 <Linker::mutex()::_mutex>)
    at /home/user/genode/repos/base/include/spec/x86/cpu/atomic.h:51
#1  Genode::Cancelable_lock::lock (this=this at entry=0x501cad40 <Linker::mutex()::_mutex>, myself=...)
    at /home/user/genode/repos/base/src/lib/base/lock.cc:71
#2  0x000000005004fa02 in Genode::Mutex::acquire (this=this at entry=0x501cad40 <Linker::mutex()::_mutex>)
    at /home/user/genode/repos/base/src/lib/base/mutex.cc:27
#3  0x00000000500bb050 in Genode::Mutex::Guard::Guard (mutex=..., this=<synthetic pointer>)
    at /home/user/genode/repos/base/include/base/mutex.h:40
#4  dl_iterate_phdr (callback=callback at entry=0x500a3a40 <_Unwind_IteratePhdrCallback>, data=data at entry=0x41ffa1b0)
    at /home/user/genode/repos/base/src/lib/ldso/exception.cc:41
#5  0x00000000500a4caf in _Unwind_Find_FDE (pc=
    0xaf828f2 <Libc::Vfs_plugin::write(Libc::File_descriptor*, void const*, unsigned long)+722>, bases=bases at entry=0x41ffa418)
    at /data/genode/contrib/gcc-893d85767720fc2b3c695c56d737d2307b1f2229/src/noux-pkg/gcc/libgcc/unwind-dw2-fde-dip.c:469
#6  0x00000000500a12a3 in uw_frame_state_for (context=context at entry=0x41ffa370, fs=fs at entry=0x41ffa460)
    at /data/genode/contrib/gcc-893d85767720fc2b3c695c56d737d2307b1f2229/src/noux-pkg/gcc/libgcc/unwind-dw2.c:1257
#7  0x00000000500a2bdb in _Unwind_RaiseException (exc=0x501be8e8 <Genode::init_cxx_heap(Genode::Env&)::initial_block+7528>)
    at /data/genode/contrib/gcc-893d85767720fc2b3c695c56d737d2307b1f2229/src/noux-pkg/gcc/libgcc/unwind.inc:104
#8  0x00000000500a98c5 in __cxa_throw ()
    at /data/genode/contrib/gcc-893d85767720fc2b3c695c56d737d2307b1f2229/src/noux-pkg/gcc/libgcc/unwind-dw2-fde.h:157
#9  0x000000000ad17404 in non-virtual thunk to Lwip::File_system::write(Vfs::Vfs_handle*, char const*, unsigned long long, unsigned long long&) () from vfs_lwip.lib.so
#10 0x000000000af828f3 in Libc::Vfs_plugin::write (
    this=0xaff6dc0 <unmanaged_singleton<Libc::Kernel, 8, Genode::Env&, Genode::Allocator&>(Genode::Env&, Genode::Allocator&)::object_space+9272>, fd=0x7ffff7c586b0, buf=0x7ffff7abfd6e, count=3419)
    at /home/user/genode/repos/libports/src/lib/libc/vfs_plugin.cc:806
#11 0x000000000af711e7 in do_sendto (fd=<optimized out>, buf=0x7ffff7abfd6e, len=3419, dest_addr=<optimized out>,
    dest_addrlen=<optimized out>, flags=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/socket_fs_plugin.cc:176
#12 0x000000000af72a28 in Libc::Socket_fs::Plugin::write (this=<optimized out>, fd=<optimized out>, buf=<optimized out>,
    count=<optimized out>) at /home/user/genode/repos/libports/src/lib/libc/socket_fs_plugin.cc:1099
#13 0x00000000011c9f60 in tcp_send (fd=84, msg=msg at entry=0x41ffb2f0, additional_flags=additional_flags at entry=0)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/iomgr/tcp_posix.cc:958
#14 0x00000000011ca18a in tcp_flush (tcp=tcp at entry=0x7ffff7d7da80, error=error at entry=0x41fff238)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/iomgr/tcp_posix.cc:1499
#15 0x00000000011ca56b in tcp_write (ep=0x7ffff7d7da80, buf=<optimized out>, cb=<optimized out>, arg=<optimized out>)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/iomgr/tcp_posix.cc:1636
#16 0x000000000116c6d0 in write_action (gt=0x7ffff7e90e20)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/iomgr/closure.h:110
#17 write_action_begin_locked (gt=0x7ffff7e90e20)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc:956
#18 0x00000000011b25c1 in grpc_combiner_continue_exec_ctx ()
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/iomgr/combiner.cc:252
#19 0x00000000011b9818 in grpc_core::ExecCtx::Flush (this=this at entry=0x41fff3b0)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/iomgr/exec_ctx.cc:156
#20 0x0000000001217b64 in grpc_core::ExecCtx::~ExecCtx (this=0x41fff3b0, __in_chrg=<optimized out>)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/iomgr/exec_ctx.h:124
#21 grpc_call_start_batch (call=<optimized out>, ops=0x41fff470, nops=3, tag=0x41fff8b0, reserved=<optimized out>)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/surface/call.cc:1973
#22 0x0000000001006eaa in grpc::internal::CallOpSet<grpc::internal::CallOpSendInitialMetadata, grpc::internal::CallOpSendMessage, grpc::internal::CallOpServerSendStatus, grpc::internal::CallNoOp<4>, grpc::internal::CallNoOp<5>, grpc::internal::CallNoOp<6> >::ContinueFillOpsAfterInterception (this=0x41fff8b0)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/include/grpcpp/impl/codegen/call.h:72
#23 0x00000000010067ec in grpc::internal::CallOpSet<grpc::internal::CallOpSendInitialMetadata, grpc::internal::CallOpSendMessage, grpc::internal::CallOpServerSendStatus, grpc::internal::CallNoOp<4>, grpc::internal::CallNoOp<5>, grpc::internal::CallNoOp<6> >::FillOps (this=0x41fff8b0, call=<optimized out>)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/include/grpcpp/impl/codegen/interceptor_common.h:235
#24 0x0000000001008aa4 in grpc::internal::Call::PerformOps (ops=0x41fff8b0, this=<optimized out>)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/include/grpcpp/impl/codegen/call.h:69
#25 grpc_impl::internal::RpcMethodHandler<helloworld::Greeter::Service, helloworld::HelloRequest, helloworld::HelloReply>::RunHandler (this=this at entry=0x7ffff7f55300, param=...)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/include/grpcpp/impl/codegen/method_handler_impl.h:90
#26 0x000000000126891a in grpc_impl::Server::SyncRequest::CallData::ContinueRunAfterInterception (this=this at entry=0x7ffff7cdb650)
    at /home/user/genode/contrib/stdcxx-79ad8821269a443aab21bfc1d0e9b8913332b6d1/include/stdcxx/bits/move.h:182
#27 0x000000000126916c in grpc_impl::Server::SyncRequest::CallData::Run (resources=true, global_callbacks=...,
    this=0x7ffff7cdb650)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/cpp/server/server_cc.cc:489
#28 grpc_impl::Server::SyncRequestThreadManager::DoWork (this=<optimized out>, tag=<optimized out>, ok=<optimized out>,
    resources=<optimized out>)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/cpp/server/server_cc.cc:900
#29 0x00000000012afa66 in grpc::ThreadManager::MainWorkLoop (this=0x7ffff7f4f460)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/cpp/thread_manager/thread_manager.cc:211
#30 0x00000000012afcb0 in grpc::ThreadManager::WorkerThread::Run (this=0x7ffff7eae6e0)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/cpp/thread_manager/thread_manager.cc:43
#31 0x00000000011ae0ac in grpc_core::(anonymous namespace)::ThreadInternalsPosix::<lambda(void*)>::operator() (__closure=0x0,
    v=<optimized out>)
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/gprpp/thd_posix.cc:140
#32 grpc_core::(anonymous namespace)::ThreadInternalsPosix::<lambda(void*)>::_FUN(void *) ()
    at /home/user/genode/contrib/protobuf_grpc-3414c16ccc9b43fa990653590ac986adf8780fce/src/lib/grpc/src/core/lib/gprpp/thd_posix.cc:145
#33 0x000000000af66391 in Libc::Pthread::Thread_object::entry (this=0x7ffff7d374f8)
    at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:71
#34 0x0000000050096295 in Genode::Thread::_thread_start () at /home/user/genode/repos/base/include/base/output.h:54
#35 0x00000000500abab2 in thread_start () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_clone.S:62
(gdb)
```


At this point, I also notice many threads being stuck after a `pthread_exit` call:


```gdb
(gdb) t 10
[Switching to thread 10 (LWP 520056)]
#0  pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
29              ret            /* Return to caller.  */
(gdb) bt
#0  pseudo_end () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_syscall.S:29
#1  0x000000005004f3fd in lx_futex (val=0, op=0, uaddr=<optimized out>)
    at /home/user/genode/repos/base-linux/src/lib/syscall/linux_syscalls.h:436
#2  thread_stop_myself (myself=<optimized out>) at /home/user/genode/repos/base-linux/src/include/base/internal/lock_helper.h:62
#3  Genode::Cancelable_lock::lock (this=this at entry=0x426ffd60, myself=...)
    at /home/user/genode/repos/base/src/lib/base/lock.cc:128
#4  0x000000005004f5a3 in Genode::Cancelable_lock::lock (this=this at entry=0x426ffd60)
    at /home/user/genode/repos/base/src/lib/base/lock.cc:66
#5  0x0000000050075990 in Genode::Lock::lock (this=<optimized out>) at /home/user/genode/repos/base/include/base/lock.h:33
#6  Genode::Blockade::block (this=<optimized out>) at /home/user/genode/repos/base/include/base/blockade.h:31
#7  Genode::sleep_forever () at /home/user/genode/repos/base/src/lib/base/sleep.cc:21
#8  0x000000000af66332 in pthread_exit (value_ptr=0x0) at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:558
#9  0x000000000af663a3 in Libc::Pthread::Thread_object::entry (this=0x7ffff7d7a078)
    at /home/user/genode/repos/libports/src/lib/libc/pthread.cc:71
#10 0x0000000050096295 in Genode::Thread::_thread_start () at /home/user/genode/repos/base/include/base/output.h:54
#11 0x00000000500abab2 in thread_start () at /home/user/genode/repos/base-linux/src/lib/syscall/spec/x86_64/lx_clone.S:62
(gdb)

```


Despite our best efforts we have failed to pinpoint the problem so far. A few questions:

1. Is the spinlock really a problem here?
2. I was able to reproduce this problem using the newest version of gRPC [1] and the current `genodelabs/staging` branch. Reading [0] makes me wonder if the problem can be seen as "yet another instance where the VFS synchronization has bitten us".
3. Unlike `lwip`, `lxip` seems to give up after just one or two connections with this warning: `_wait_and_dispatch_one_io_signal called from non-entrypoint thread "pthread.5"`. May this issue be related or should we just focus on `lwip`?
4. In the GDB backtrace e.g. of `thread 13` at `#5` I notice  a path to `libgcc` that does not belong to my machine. Is this neglegtable or does it mess with my GDB output?

We just updated the `grpc` and `protobuf` libraries [1], which we will publish shortly. The problem however persists with both versions. To reproduce this issue, please checkout branch [2] and run `grpc-server-lwip` under Linux. Starting multiple client connections in parallel can be done by running `./stress_parallel_execute.sh 10.0.2.55:8899` using [3]. Note that the problem sometimes takes a few thousand of parallel connections to reproduce.

[0] https://github.com/genodelabs/genode/issues/3738#issuecomment-620496337
[1] https://github.com/trimpim/genode-world/tree/xxxx-update_protobuf_grpc_port
[2] https://github.com/sidhussmann/genode-world/tree/grpc-server-race-condition
[3] https://github.com/sidhussmann/grpc-connection-stress

Cheers,
Sid


-- 
Sid Hussmann
CTO & Founder
gapfruit AG
Baarerstrasse 135
6300 Zug - Switzerland
sid.hussmann at gapfruit.com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.genode.org/pipermail/users/attachments/20200506/7ae79e1b/attachment.sig>


More information about the users mailing list