Hello
In a recent project (using Genode on top of OKL4 v2.1 on an ARM platform) I ran into the problem that in one protection domain (PD) occasionally the exception Stack_alloc_failed was thrown. Tracing this back it came out that this was a Region_conflict exception thrown by the RM session client code due to the return value given by core's RM session server. In that particular PD the main() function and the first started thread both created subthreads independently of each other. The exception occurred because an ATTACH command for the same thread stack address area was sent to core twice. The thread stack allocation is done by the constructor Thread_base::Thread_base() with its 2nd initializer element _context(_alloc_context(stack_size)). The method Thread_base::_alloc_context() handles the address assignment of a thread stack locally within a PD, which is done by the call _context_allocator()->alloc(this). Beginning from virtual address 0x4000.0000 for each new stack a segment of 1 MB is reserved at which end the stack area with the requested size is allocated. To find out whether a stack area is already in use, the so-called Context_allocator provides the method Thread_base::Context_allocator::_is_in_use(). The thread's context is maintained in structure Thread_base::Context, which is written at the top of each allocated stack. The class Thread_base owns the member _context to point to this data. The Context_allocator itself is a static object instantiated on the first call to function _context_allocator(). It provides a chained list (member _threads) of all created threads. The method _is_in_use() walks through the thread list and determines whether any of the existing threads points to the context address of the stack area in question (thread member _context equal to the context address at the top of stack area under consideration) and returns true, if this is the case. The method Thread_base::Context_allocator::alloc() iterates through the stack segments until it finds an unused stack area. It inserts the new thread into the chained list (call _threads.insert(&thread_base->_list_element);) and returns the stack's Context address to the caller Thread_base::_alloc_context(). The problem is that at this point a decision about a new stack allocation is made which is not yet visible in the chained list of threads, because the new thread is already in the list, but its member _context is not yet set. This happens on the return of Thread_base::_alloc_context() to the constructor, but before that an IPC is made to core to register the new stack allocation. On OKL4 each IPC invokes the scheduler, and what happens in the failure case, is that another process is scheduled which starts instantiating a further new thread. In this situation the method Thread_base::Context_allocator::_is_in_use() does not find the stack area of the previously created stack as occupied, and the caller tries to allocate the stack area another time. However, core detects the double allocation and returns a bad result code on the 2nd ATTACH command. To fix the problem a lock is required to cover the whole sequence from beginning the search for a free stack area up to the assignment of Thread_base::_context. The existing lock Thread_base::Context_allocator::_threads_lock, used in Thread_base::Context_allocator::alloc(), is not sufficient for this purpose. For a quick fix I inserted the following two lines at the beginning of Thread_base::_alloc_context(): static Lock alloc_lock; Lock::Guard _lock_guard(alloc_lock); However, this solution is not perfect, since the lock is released before the assignment to Thread_base::_context, which leaves a gap of a few machine instructions, where a preemptive scheduling still could trigger the exception. Practically it worked for my case.
During analyzing this problem I began wondering about the overall design of the thread stack allocation. Obtaining the stack area is done by the call env_context_area_rm_session()->attach_at(ds_cap, attach_addr, ds_size) within the method Thread_base::_alloc_context(). The parameter attach_addr is not the complete stack address base (for instance 0x400f.c000), but the offset to the PD's stack area base address (for instance 0xf.c000). On the other hand the function env_context_area_rm_session() instantiates a PD_wide RM session which attaches the whole address area of 256 MB at 0x4000.0000 for the PD. Doesn't that instruct the pager to provide memory on a page fault of any address between 0x4000.0000 and 0x4fff.ffff, this way making the system unable to detect any stack over- and underflow? Additionally memory mappings are created for the stack offset addresses which are not really used. Maybe I missed something. If so, please let me know.
Regards Frank
Hi Frank,
In that particular PD the /main()/ function and the first started thread both created subthreads independently of each other. The exception occurred because an /ATTACH/ command for the same thread stack address area was sent to /core/ twice.
thanks a lot for thoroughly analysing and describing the problem. Indeed, this is a race we need to fix. I will take your proposal as a starting point.
During analyzing this problem I began wondering about the overall design of the thread stack allocation. Obtaining the stack area is done by the call /env_context_area_rm_session()->attach_at(ds_cap, attach_addr, ds_size)/ within the method /Thread_base::_alloc_context()/. The parameter /attach_addr/ is not the complete stack address base (for instance 0x400f.c000), but the offset to the PD's stack area base address (for instance 0xf.c000). On the other hand the function /env_context_area_rm_session()/ instantiates a PD_wide RM session which attaches the whole address area of 256 MB at 0x4000.0000 for the PD. Doesn't that instruct the pager to provide memory on a page fault of any address between 0x4000.0000 and 0x4fff.ffff, this way making the system unable to detect any stack over- and underflow? Additionally memory mappings are created for the stack offset addresses which are not really used. Maybe I missed something. If so, please let me know.
What you are seeing is the use of a managed dataspace. The complete thread context area (starting at address 0x40000000) is spanned by a single managed dataspace, which actually is another RM session (let's call it sub rm-ression). A RAM-dataspace (i.e., a thread context including the stack) attached at offset X inside the sub rm-session will appear at 0x40000000 + X in the PD's address space. But the empty parts of the sub rm-session are not populated with actual memory. If a page fault occurs within a managed dataspace, core will traverse into the sub rm-session to find the actual backing store dataspace for the fault offset within the sub rm-session. If there is no dataspace attached at the fault offset within the sub rm-session (e.g., if a stack overflows), core will print an error message and the faulting thread will be put on halt - the same behaviour as with any other unresolved page fault. So the thread context area is a sparsely populated part of PD's address space. By using the managed dataspace, we prevent normal attachments (via env()->rm_session()) from colliding with the context area.
I hope, this explanation clears things up a bit. Thank you again for pointing us to the context allocation problem! :-)
Best regards Norman