Hello,
I got repeatedly asked about our plans to scale Genode towards hardware platforms with non-uniform memory architectures (NUMA), i.e., manycore systems. This posting is meant as a rough collection of ideas. It is not an immediate call for action, nor an actual plan. But maybe it will spawn a worthwhile discussion on the subject, so that we can develop a tangible way forward together.
Until now, multi-core platforms have not received much attention by Genode because the framework has primarily been used on hardware with only a few CPU cores and the work loads carried by the framework had been relatively light-weight. This manifests in the current state of the implementation. For example, on Fiasco.OC, we use a single pager thread within core to resolve all page faults in the system, which implies costly inter-processor interrupts (IPI) when page faults occur on CPUs that are remote to the pager thread. As another example for current deficiencies, several data structures within core are accessed in a serialized fashion. If threads of different CPUs need to access those data structures concurrently, those points of contention naturally become scalability bottlenecks.
Concurrent page-fault handling ------------------------------
On a multi-core system, CPU-local page fault handling is desired. Genode's core-internal page-fault handling could be changed relatively easily to a model where we use one page-fault handler per CPU. This way, the delivery of page-fault messages would not involve any IPIs. I think that this step is clearly beneficial. On NOVA, we already employ a scheme where each thread has a dedicated page-fault-handler in core. So we have already implemented the fine-grained synchronization of the data structures within core needed for that. Here, page faults caused by different threads are effectively handled in parallel. (we are not using multiple CPUs no NOVA yet, though) Applying a similar scheme to other kernels such as Fiasco.OC would be a relatively small step.
However, the page-fault handlers would still operate on the shared data structures such as the allocator for physical memory or RM sessions. So synchronization of those data structures is needed. To completely localize PF-handling and remove synchronization between different CPUs, we would need to localize data structures as well. This is a much harder problem, which can be tackled by replicating data or partitioning. Both approaches will ultimately increase the complexity of Genode's core and also introduce the need to feed core with platform parameters and policies. This is something that I'd like to avoid. After all, the low complexity of the Genode base system is one of its strongest points.
Challenges of NUMA ------------------
When looking a step further, to NUMA systems, we will further find that we also need to localize code. For example, the code and data that is involved in executing the page-fault handling should always be close to the core on which the faulting thread is running. With this in mind, the system will get even more complicated. These observations make me hesitant to extend core beyond the basic implementation of the multi-threaded page-fault handling.
VCore - Virtualizing Genode's core ----------------------------------
However, I see an alternative way, which is actually pretty similar to the concept employed in Barrelfish: Co-hosting multiple loosely coupled subsystems on one machine. We could try to leverage Genode's recursive nature to solve the problem atop of core by introducing CPU-local branches of the Genode process tree. The basic idea would be to virtualize core's CPU, RAM, and RM services by using a new component (let's call it vcore for now). Any number of vcores can run at a time whereas each vcore is responsible for a set of physical CPUs and their associated CPU-local memory resources. Each vcore is a runtime environment that can be supplied with a configuration that describes the subsystem to execute, similar to Genode's init process. In addition, the configuration comes with the information about the physical CPUs and memory ranges that the vcore instance should manage. For all Genode components running on top of a vcore instance, its respective vcore instance looks just like core.
When a vcore instance is started, it will read its configuration to obtain the ranges of physical RAM it should manage. It will then allocate the those ranges at core and map (and fault-in) them locally. This step may be slow but it is done only once at the startup of vcore. So basically, the vcore instance is sucking out all the RAM that belongs to its resource partition from core. With the current interface of Genode's core, this is not possible. So we need to slightly extend core to accommodate this use case. When starting its children, vcore will not hand out core's RAM and RM sessions to the child but implement those services itself. So each time a process of the subsystem performs a RAM allocation or attaches a dataspace to its RM session, those requests will be handled and monitored by vcore. By virtualizing the RM session, vcore can furthermore hook-in itself as the page-fault handler of those processes. Hence, page faults are always handled locally by the corresponding vcore. On Genode, the page-fault handling is actually a library, which principally allows for processing page-fault handling outside of core (although we have never attempted to use this library outside of core so far). So this idea seems feasible to me.
Now, with handling page faults local to each vcore, we naturally eliminate cross-CPU talk. However, we need to consider cases where a process inside a vcore environment wants to access a dataspace provided by a process outside of its vcore, for example, when using a resource multiplexer such as the nitpicker GUI server. In this case, vcore is unable to resolve the page fault because it has not created the corresponding dataspace (it was created by nitpicker using core's RAM service). However, there is always the real core underneath all vcore's, which has a complete view of the system. So vcore could forward such non-vcore-local page-faults to core. Nifty, isn't it? Naturally, such non-local page faults will carry an overhead by taking a hop though vcore. But access to those non-local resources is expected to be slow (and rare) anyway. There is still one constellation that cannot be accommodated this way, which is the direct sharing of dataspaces between different vcores.
Is this feasible? -----------------
Right now, the vcore idea is just a rough sketch. Admittedly I cannot give a substantial estimate on how successful it may be as I am lacking the experience in the domain of manycore systems, nor do I have access to a large NUMA machine. So I hope that someone of you may step in to share actual experiences or to point out flaws with this idea.
I already see a few limitations. For example, even though the vcore idea looks like a nice solution for the problem space covered by Genode, there are problems lying outside the scope of Genode that may severe impede the system's scalability even if the concept works out as desired, namely the kernel. The kernel needs to address the CPU locality problem for its own operations, in particular IPC, too. How to go about that problem?
Apart from that, I see additional challenges related to devices, such as the CPU-local handling of device interrupts or the access of MMIO device resources.
However, if going for the vcore concept, I see plenty of topics that could be pursued building upon that, for example the implementation of dynamic load balancing, or dynamically changing vcore policies, or extending vcore for power management.
What do you think? Would the vcore idea be worthwhile to explore? Those of you experienced in the field of manycore NUMA systems, do you see additional pitfalls? Or even better, does anyone has alternative ideas to explore? Also, I am very interested in ways to validate work in this domain. How can we measure our success?
Best regards Norman