Conquering NUMA land - users

18 Mar 2013

      Hello,
I got repeatedly asked about our plans to scale Genode towards hardware
platforms with non-uniform memory architectures (NUMA), i.e., manycore
systems. This posting is meant as a rough collection of ideas. It is not
an immediate call for action, nor an actual plan. But maybe it will
spawn a worthwhile discussion on the subject, so that we can develop a
tangible way forward together.
Until now, multi-core platforms have not received much attention by
Genode because the framework has primarily been used on hardware with
only a few CPU cores and the work loads carried by the framework had
been relatively light-weight. This manifests in the current state of the
implementation. For example, on Fiasco.OC, we use a single pager thread
within core to resolve all page faults in the system, which implies
costly inter-processor interrupts (IPI) when page faults occur on CPUs
that are remote to the pager thread. As another example for current
deficiencies, several data structures within core are accessed in a
serialized fashion. If threads of different CPUs need to access those
data structures concurrently, those points of contention naturally
become scalability bottlenecks.
Concurrent page-fault handling
------------------------------
On a multi-core system, CPU-local page fault handling is desired.
Genode's core-internal page-fault handling could be changed relatively
easily to a model where we use one page-fault handler per CPU. This way,
the delivery of page-fault messages would not involve any IPIs. I think
that this step is clearly beneficial. On NOVA, we already employ a
scheme where each thread has a dedicated page-fault-handler in core. So
we have already implemented the fine-grained synchronization of the data
structures within core needed for that. Here, page faults caused by
different threads are effectively handled in parallel. (we are not using
multiple CPUs no NOVA yet, though) Applying a similar scheme to other
kernels such as Fiasco.OC would be a relatively small step.
However, the page-fault handlers would still operate on the shared data
structures such as the allocator for physical memory or RM sessions. So
synchronization of those data structures is needed. To completely
localize PF-handling and remove synchronization between different CPUs,
we would need to localize data structures as well. This is a much harder
problem, which can be tackled by replicating data or partitioning. Both
approaches will ultimately increase the complexity of Genode's core and
also introduce the need to feed core with platform parameters and
policies. This is something that I'd like to avoid. After all, the low
complexity of the Genode base system is one of its strongest points.
Challenges of NUMA
------------------
When looking a step further, to NUMA systems, we will further find that
we also need to localize code. For example, the code and data that is
involved in executing the page-fault handling should always be close to
the core on which the faulting thread is running. With this in mind, the
system will get even more complicated. These observations make me
hesitant to extend core beyond the basic implementation of the
multi-threaded page-fault handling.
VCore - Virtualizing Genode's core
----------------------------------
However, I see an alternative way, which is actually pretty similar to
the concept employed in Barrelfish: Co-hosting multiple loosely coupled
subsystems on one machine. We could try to leverage Genode's recursive
nature to solve the problem atop of core by introducing CPU-local
branches of the Genode process tree. The basic idea would be to
virtualize core's CPU, RAM, and RM services by using a new component
(let's call it vcore for now). Any number of vcores can run at a time
whereas each vcore is responsible for a set of physical CPUs and their
associated CPU-local memory resources. Each vcore is a runtime
environment that can be supplied with a configuration that describes the
subsystem to execute, similar to Genode's init process. In addition, the
configuration comes with the information about the physical CPUs and
memory ranges that the vcore instance should manage. For all Genode
components running on top of a vcore instance, its respective vcore
instance looks just like core.
When a vcore instance is started, it will read its configuration to
obtain the ranges of physical RAM it should manage. It will then
allocate the those ranges at core and map (and fault-in) them locally.
This step may be slow but it is done only once at the startup of vcore.
So basically, the vcore instance is sucking out all the RAM that belongs
to its resource partition from core. With the current interface of
Genode's core, this is not possible. So we need to slightly extend core
to accommodate this use case. When starting its children, vcore will not
hand out core's RAM and RM sessions to the child but implement those
services itself. So each time a process of the subsystem performs a RAM
allocation or attaches a dataspace to its RM session, those requests
will be handled and monitored by vcore. By virtualizing the RM session,
vcore can furthermore hook-in itself as the page-fault handler of those
processes. Hence, page faults are always handled locally by the
corresponding vcore. On Genode, the page-fault handling is actually a
library, which principally allows for processing page-fault handling
outside of core (although we have never attempted to use this library
outside of core so far). So this idea seems feasible to me.
Now, with handling page faults local to each vcore, we naturally
eliminate cross-CPU talk. However, we need to consider cases where a
process inside a vcore environment wants to access a dataspace provided
by a process outside of its vcore, for example, when using a resource
multiplexer such as the nitpicker GUI server. In this case, vcore is
unable to resolve the page fault because it has not created the
corresponding dataspace (it was created by nitpicker using core's RAM
service). However, there is always the real core underneath all vcore's,
which has a complete view of the system. So vcore could forward such
non-vcore-local page-faults to core. Nifty, isn't it? Naturally, such
non-local page faults will carry an overhead by taking a hop though
vcore. But access to those non-local resources is expected to be slow
(and rare) anyway. There is still one constellation that cannot be
accommodated this way, which is the direct sharing of dataspaces between
different vcores.
Is this feasible?
-----------------
Right now, the vcore idea is just a rough sketch. Admittedly I cannot
give a substantial estimate on how successful it may be as I am lacking
the experience in the domain of manycore systems, nor do I have access
to a large NUMA machine. So I hope that someone of you may step in to
share actual experiences or to point out flaws with this idea.
I already see a few limitations. For example, even though the vcore idea
looks like a nice solution for the problem space covered by Genode,
there are problems lying outside the scope of Genode that may severe
impede the system's scalability even if the concept works out as
desired, namely the kernel. The kernel needs to address the CPU locality
problem for its own operations, in particular IPC, too. How to go about
that problem?
Apart from that, I see additional challenges related to devices, such as
the CPU-local handling of device interrupts or the access of MMIO device
resources.
However, if going for the vcore concept, I see plenty of topics that
could be pursued building upon that, for example the implementation of
dynamic load balancing, or dynamically changing vcore policies, or
extending vcore for power management.
What do you think? Would the vcore idea be worthwhile to explore? Those
of you experienced in the field of manycore NUMA systems, do you see
additional pitfalls? Or even better, does anyone has alternative ideas
to explore? Also, I am very interested in ways to validate work in this
domain. How can we measure our success?
Best regards
Norman
-- 
Dr.-Ing. Norman Feske
Genode Labs

http://www.genode-labs.com · http://genode.org

Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden
Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth