What I found is a deadlock of recursive Linker::mutex call.
- If we have an exception in some code (e.g. code which call NOVA syscall, in my case this is attach_at() RPC call) then it somehow processed in caller. In particular, during processing it call the following stack from injected by gcc function _Unwind_Resume - pay attention to function dl_iterate_phdr():
#0 Linker::mutex () at /home/tor/gen/20.08/repos/base/src/lib/ldso/main.cc:68 #1 0x0000000000124997 in dl_iterate_phdr (callback=0x119e7a0 <_Unwind_IteratePhdrCallback>, data=0x403fdde0) at /home/tor/gen/20.08/repos/base/src/lib/ldso/exception.cc:41 #2 0x000000000119fa0f in _Unwind_Find_FDE (pc=0x119dc76 <_Unwind_Resume+54>, bases=bases@entry=0x403fe128) at /home/tor/gen/20.08/contrib/gcc-20345a83596fa42a25a85938329aea54bb4b2146/src/noux-pkg/gcc/libgcc/unwind-dw2-fde-dip.c:469 #3 0x000000000119bfc3 in uw_frame_state_for (context=context@entry=0x403fe080, fs=fs@entry=0x403fdec0) at /home/tor/gen/20.08/contrib/gcc-20345a83596fa42a25a85938329aea54bb4b2146/src/noux-pkg/gcc/libgcc/unwind-dw2.c:1257 #4 0x000000000119cfe0 in uw_init_context_1 (context=context@entry=0x403fe080, outer_cfa=outer_cfa@entry=0x403fe2b0, outer_ra=0x1000bcd <Genode::Region_map::attach_at(Genode::CapabilityGenode::Dataspace, unsigned long, unsigned long, long)+259>) at /home/tor/gen/20.08/contrib/gcc-20345a83596fa42a25a85938329aea54bb4b2146/src/noux-pkg/gcc/libgcc/unwind-dw2.c:1586 #5 0x000000000119dc77 in _Unwind_Resume (exc=0x1b41a8 Genode::init_cxx_heap(Genode::Env&)::initial_block+5256) at /home/tor/gen/20.08/contrib/gcc-20345a83596fa42a25a85938329aea54bb4b2146/src/noux-pkg/gcc/libgcc/unwind.inc:235 #6 0x0000000001000bcd in Genode::Region_map::attach_at (this=0x1304068 <vm_reg0+8648>, ds=..., local_addr=0x80000000, size=0x40000, offset=0x0) at /home/tor/gen/20.08/repos/base/include/region_map/region_map.h:127
The code of dl_iterate_phdr(): extern "C" int dl_iterate_phdr(int (*callback) (Phdr_info *info, size_t size, void *data), void *data) { int err = 0; Phdr_info info;
Mutex::Guard guard(mutex());
for (Object *e = obj_list_head();e; e = e->next_obj()) {
info.addr = e->reloc_base(); info.name = e->name(); info.phdr = e->file()->phdr.phdr; info.phnum = e->file()->phdr.count;
if (verbose_exception) log(e->name(), " reloc ", Hex(e->reloc_base()));
if ((err = callback(&info, sizeof(Phdr_info), data))) break; }
return err; }
Py attention that it take Linker::_mutex object (lock).
Inside, it call the callback() function for main C++ code which resolved to _Unwind_IteratePhdrCallback from contrib/gcc-20345a83596fa42a25a85938329aea54bb4b2146/src/noux-pkg/gcc/libgcc/unwind-dw2-fde-dip.c which internally call get_fde_encoding() and get_cie_encoding() which contain very simple line /home/tor/gen/20.08/contrib/gcc-20345a83596fa42a25a85938329aea54bb4b2146/src/noux-pkg/gcc/libgcc/unwind-dw2-fde.c:300
p = aug + strlen ((const char *)aug) + 1; /* Skip the augmentation string. */
strlen() is not inlined/instantiated here. In machine code it call strlen@plt which mean that strlen assumed in the shared library, and typically it should be processed by linker relocation code.
To find the code it call jmp_slot@PLT and, in turn, call from src/lib/ldso/main.cc:294 function Elf::Addr Ld::jmp_slot(Dependency const &dep, Elf::Size index) { Mutex::Guard guard(mutex());
if (verbose_relocation) …
Pay attention that it call the same Linker::_mutex object (lock) Voila! we have recursive call of the same linker mutex and deadlock in exception processing.
definitely key problem here is in the usage of linker mutex in Genode implementation of dl_iterate_phdr()
So, question: how to fix this? May be we need different mutexes for Ld::jmp_slot and for dl_iterate_phdr?
Sincerely, Alexander
Hallo Alexander,
On 10/12/20 11:30 PM, Alexander Tormasov via users wrote:
What I found is a deadlock of recursive Linker::mutex call.
- If we have an exception in some code (e.g. code which call NOVA syscall, in my case this is attach_at() RPC call) then it somehow processed in caller.
In particular, during processing it call the following stack from injected by gcc function _Unwind_Resume - pay attention to function dl_iterate_phdr():
#0 Linker::mutex () at /home/tor/gen/20.08/repos/base/src/lib/ldso/main.cc:68 #1 0x0000000000124997 in dl_iterate_phdr (callback=0x119e7a0 <_Unwind_IteratePhdrCallback>, data=0x403fdde0) at /home/tor/gen/20.08/repos/base/src/lib/ldso/exception.cc:41 #2 0x000000000119fa0f in _Unwind_Find_FDE (pc=0x119dc76 <_Unwind_Resume+54>, bases=bases@entry=0x403fe128) at /home/tor/gen/20.08/contrib/gcc-20345a83596fa42a25a85938329aea54bb4b2146/src/noux-pkg/gcc/libgcc/unwind-dw2-fde-dip.c:469 #3 0x000000000119bfc3 in uw_frame_state_for (context=context@entry=0x403fe080, fs=fs@entry=0x403fdec0) at /home/tor/gen/20.08/contrib/gcc-20345a83596fa42a25a85938329aea54bb4b2146/src/noux-pkg/gcc/libgcc/unwind-dw2.c:1257 #4 0x000000000119cfe0 in uw_init_context_1 (context=context@entry=0x403fe080, outer_cfa=outer_cfa@entry=0x403fe2b0, outer_ra=0x1000bcd <Genode::Region_map::attach_at(Genode::CapabilityGenode::Dataspace, unsigned long, unsigned long, long)+259>) at /home/tor/gen/20.08/contrib/gcc-20345a83596fa42a25a85938329aea54bb4b2146/src/noux-pkg/gcc/libgcc/unwind-dw2.c:1586 #5 0x000000000119dc77 in _Unwind_Resume (exc=0x1b41a8 Genode::init_cxx_heap(Genode::Env&)::initial_block+5256) at /home/tor/gen/20.08/contrib/gcc-20345a83596fa42a25a85938329aea54bb4b2146/src/noux-pkg/gcc/libgcc/unwind.inc:235 #6 0x0000000001000bcd in Genode::Region_map::attach_at (this=0x1304068 <vm_reg0+8648>, ds=..., local_addr=0x80000000, size=0x40000, offset=0x0) at /home/tor/gen/20.08/repos/base/include/region_map/region_map.h:127
The code of dl_iterate_phdr(): extern "C" int dl_iterate_phdr(int (*callback) (Phdr_info *info, size_t size, void *data), void *data) { int err = 0; Phdr_info info;
Mutex::Guard guard(mutex()); for (Object *e = obj_list_head();e; e = e->next_obj()) { info.addr = e->reloc_base(); info.name = e->name(); info.phdr = e->file()->phdr.phdr; info.phnum = e->file()->phdr.count; if (verbose_exception) log(e->name(), " reloc ", Hex(e->reloc_base())); if ((err = callback(&info, sizeof(Phdr_info), data))) break; } return err;
}
Py attention that it take Linker::_mutex object (lock).
Inside, it call the callback() function for main C++ code which resolved to _Unwind_IteratePhdrCallback from contrib/gcc-20345a83596fa42a25a85938329aea54bb4b2146/src/noux-pkg/gcc/libgcc/unwind-dw2-fde-dip.c which internally call get_fde_encoding() and get_cie_encoding() which contain very simple line /home/tor/gen/20.08/contrib/gcc-20345a83596fa42a25a85938329aea54bb4b2146/src/noux-pkg/gcc/libgcc/unwind-dw2-fde.c:300
p = aug + strlen ((const char *)aug) + 1; /* Skip the augmentation string. */
strlen() is not inlined/instantiated here. In machine code it call strlen@plt which mean that strlen assumed in the shared library, and typically it should be processed by linker relocation code.
To find the code it call jmp_slot@PLT and, in turn, call from src/lib/ldso/main.cc:294 function Elf::Addr Ld::jmp_slot(Dependency const &dep, Elf::Size index) { Mutex::Guard guard(mutex());
if (verbose_relocation)
…
Pay attention that it call the same Linker::_mutex object (lock) Voila! we have recursive call of the same linker mutex and deadlock in exception processing.
definitely key problem here is in the usage of linker mutex in Genode implementation of dl_iterate_phdr()
So, question: how to fix this? May be we need different mutexes for Ld::jmp_slot and for dl_iterate_phdr?
The 'strlen' function should be provided by the cxx library (repos/base/src/lib/cxx/misc.cc) at link time and this way not produce a jmp slot (i.e. strlen@plt). So, the problem here is that the jump slot is created. Is there a way to reproduce this easily?
Regards,
Sebastian