Hello,
I got repeatedly asked about our plans to scale Genode towards hardware platforms with non-uniform memory architectures (NUMA), i.e., manycore systems. This posting is meant as a rough collection of ideas. It is not an immediate call for action, nor an actual plan. But maybe it will spawn a worthwhile discussion on the subject, so that we can develop a tangible way forward together.
Until now, multi-core platforms have not received much attention by Genode because the framework has primarily been used on hardware with only a few CPU cores and the work loads carried by the framework had been relatively light-weight. This manifests in the current state of the implementation. For example, on Fiasco.OC, we use a single pager thread within core to resolve all page faults in the system, which implies costly inter-processor interrupts (IPI) when page faults occur on CPUs that are remote to the pager thread. As another example for current deficiencies, several data structures within core are accessed in a serialized fashion. If threads of different CPUs need to access those data structures concurrently, those points of contention naturally become scalability bottlenecks.
Concurrent page-fault handling ------------------------------
On a multi-core system, CPU-local page fault handling is desired. Genode's core-internal page-fault handling could be changed relatively easily to a model where we use one page-fault handler per CPU. This way, the delivery of page-fault messages would not involve any IPIs. I think that this step is clearly beneficial. On NOVA, we already employ a scheme where each thread has a dedicated page-fault-handler in core. So we have already implemented the fine-grained synchronization of the data structures within core needed for that. Here, page faults caused by different threads are effectively handled in parallel. (we are not using multiple CPUs no NOVA yet, though) Applying a similar scheme to other kernels such as Fiasco.OC would be a relatively small step.
However, the page-fault handlers would still operate on the shared data structures such as the allocator for physical memory or RM sessions. So synchronization of those data structures is needed. To completely localize PF-handling and remove synchronization between different CPUs, we would need to localize data structures as well. This is a much harder problem, which can be tackled by replicating data or partitioning. Both approaches will ultimately increase the complexity of Genode's core and also introduce the need to feed core with platform parameters and policies. This is something that I'd like to avoid. After all, the low complexity of the Genode base system is one of its strongest points.
Challenges of NUMA ------------------
When looking a step further, to NUMA systems, we will further find that we also need to localize code. For example, the code and data that is involved in executing the page-fault handling should always be close to the core on which the faulting thread is running. With this in mind, the system will get even more complicated. These observations make me hesitant to extend core beyond the basic implementation of the multi-threaded page-fault handling.
VCore - Virtualizing Genode's core ----------------------------------
However, I see an alternative way, which is actually pretty similar to the concept employed in Barrelfish: Co-hosting multiple loosely coupled subsystems on one machine. We could try to leverage Genode's recursive nature to solve the problem atop of core by introducing CPU-local branches of the Genode process tree. The basic idea would be to virtualize core's CPU, RAM, and RM services by using a new component (let's call it vcore for now). Any number of vcores can run at a time whereas each vcore is responsible for a set of physical CPUs and their associated CPU-local memory resources. Each vcore is a runtime environment that can be supplied with a configuration that describes the subsystem to execute, similar to Genode's init process. In addition, the configuration comes with the information about the physical CPUs and memory ranges that the vcore instance should manage. For all Genode components running on top of a vcore instance, its respective vcore instance looks just like core.
When a vcore instance is started, it will read its configuration to obtain the ranges of physical RAM it should manage. It will then allocate the those ranges at core and map (and fault-in) them locally. This step may be slow but it is done only once at the startup of vcore. So basically, the vcore instance is sucking out all the RAM that belongs to its resource partition from core. With the current interface of Genode's core, this is not possible. So we need to slightly extend core to accommodate this use case. When starting its children, vcore will not hand out core's RAM and RM sessions to the child but implement those services itself. So each time a process of the subsystem performs a RAM allocation or attaches a dataspace to its RM session, those requests will be handled and monitored by vcore. By virtualizing the RM session, vcore can furthermore hook-in itself as the page-fault handler of those processes. Hence, page faults are always handled locally by the corresponding vcore. On Genode, the page-fault handling is actually a library, which principally allows for processing page-fault handling outside of core (although we have never attempted to use this library outside of core so far). So this idea seems feasible to me.
Now, with handling page faults local to each vcore, we naturally eliminate cross-CPU talk. However, we need to consider cases where a process inside a vcore environment wants to access a dataspace provided by a process outside of its vcore, for example, when using a resource multiplexer such as the nitpicker GUI server. In this case, vcore is unable to resolve the page fault because it has not created the corresponding dataspace (it was created by nitpicker using core's RAM service). However, there is always the real core underneath all vcore's, which has a complete view of the system. So vcore could forward such non-vcore-local page-faults to core. Nifty, isn't it? Naturally, such non-local page faults will carry an overhead by taking a hop though vcore. But access to those non-local resources is expected to be slow (and rare) anyway. There is still one constellation that cannot be accommodated this way, which is the direct sharing of dataspaces between different vcores.
Is this feasible? -----------------
Right now, the vcore idea is just a rough sketch. Admittedly I cannot give a substantial estimate on how successful it may be as I am lacking the experience in the domain of manycore systems, nor do I have access to a large NUMA machine. So I hope that someone of you may step in to share actual experiences or to point out flaws with this idea.
I already see a few limitations. For example, even though the vcore idea looks like a nice solution for the problem space covered by Genode, there are problems lying outside the scope of Genode that may severe impede the system's scalability even if the concept works out as desired, namely the kernel. The kernel needs to address the CPU locality problem for its own operations, in particular IPC, too. How to go about that problem?
Apart from that, I see additional challenges related to devices, such as the CPU-local handling of device interrupts or the access of MMIO device resources.
However, if going for the vcore concept, I see plenty of topics that could be pursued building upon that, for example the implementation of dynamic load balancing, or dynamically changing vcore policies, or extending vcore for power management.
What do you think? Would the vcore idea be worthwhile to explore? Those of you experienced in the field of manycore NUMA systems, do you see additional pitfalls? Or even better, does anyone has alternative ideas to explore? Also, I am very interested in ways to validate work in this domain. How can we measure our success?
Best regards Norman
On Mon, 18 Mar 2013 11:49:40 +0100 Norman Feske (NF) wrote:
[Details snipped]
NF> What do you think? Would the vcore idea be worthwhile to explore? Those NF> of you experienced in the field of manycore NUMA systems, do you see NF> additional pitfalls? Or even better, does anyone has alternative ideas NF> to explore? Also, I am very interested in ways to validate work in this NF> domain. How can we measure our success?
There are also use cases where you don't want to partition. One example is a multi-core VM, where each virtual CPU could run on a different physical core and yet all of those virtual CPUs share the same memory.
Rather than going for an extreme design point, where virtually nothing is shared (e.g., Barrelfish), I think it would be better to provide an interface where the user has precise control over what is shared and what isn't.
I'd go for concurrent invocation of services first. Then you'll know what data structures you have contention on. And then you can decide whether you want that data replicated (read-mostly) or shared (frequently written).
IMHO, dealing with replicas, distributed protocols, consensus and all that is a lot harder than implementing a few locks or atomic ops on pieces of shared memory. Especially now that we have HLE and TSX coming really soon.
Cheers, Udo
Hi Norman,
I agree with Udo's comment about needing to be more flexible about resource partitioning across v-cores. Of course you might use configuration scripts to help configure v-cores initially, but it is important to be able to dynamically adjust partitions at run-time according to workload needs. You should strive to separate the concerns of enabling NUMA and the policies behind NUMA.
I think that you can only evaluate the success of your NUMA capabilities with real applications. Hopefully we can help you there.
Daniel
On 03/18/2013 04:40 AM, Udo Steinberg wrote:
On Mon, 18 Mar 2013 11:49:40 +0100 Norman Feske (NF) wrote:
[Details snipped]
NF> What do you think? Would the vcore idea be worthwhile to explore? Those NF> of you experienced in the field of manycore NUMA systems, do you see NF> additional pitfalls? Or even better, does anyone has alternative ideas NF> to explore? Also, I am very interested in ways to validate work in this NF> domain. How can we measure our success?
There are also use cases where you don't want to partition. One example is a multi-core VM, where each virtual CPU could run on a different physical core and yet all of those virtual CPUs share the same memory.
Rather than going for an extreme design point, where virtually nothing is shared (e.g., Barrelfish), I think it would be better to provide an interface where the user has precise control over what is shared and what isn't.
I'd go for concurrent invocation of services first. Then you'll know what data structures you have contention on. And then you can decide whether you want that data replicated (read-mostly) or shared (frequently written).
IMHO, dealing with replicas, distributed protocols, consensus and all that is a lot harder than implementing a few locks or atomic ops on pieces of shared memory. Especially now that we have HLE and TSX coming really soon.
Cheers, Udo
Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar
Genode-main mailing list Genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi Udo,
There are also use cases where you don't want to partition. One example is a multi-core VM, where each virtual CPU could run on a different physical core and yet all of those virtual CPUs share the same memory.
I agree that a vcore instance should definitely be able to manage sets of CPUs. The partitioning policy should be up to the user.
Rather than going for an extreme design point, where virtually nothing is shared (e.g., Barrelfish), I think it would be better to provide an interface where the user has precise control over what is shared and what isn't.
I'd go for concurrent invocation of services first. Then you'll know what data structures you have contention on. And then you can decide whether you want that data replicated (read-mostly) or shared (frequently written).
IMHO, dealing with replicas, distributed protocols, consensus and all that is a lot harder than implementing a few locks or atomic ops on pieces of shared memory. Especially now that we have HLE and TSX coming really soon.
The vcore approach should indeed not hold us back from making core more scalable. The latter should be the ultimate goal and if new Intel technologies can help us, that's great.
One thing left me wondering, don't you see the different access latencies to local vs. remote memory in NUMA systems as a pressing problem that needs a solution by the OS? The consideration of memory locality was actually the driving motivation behind the vcore idea.
Cheers Norman
- -- Dr.-Ing. Norman Feske Genode Labs
http://www.genode-labs.com · http://genode.org
Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Tue, 19 Mar 2013 14:36:17 +0100 Norman Feske (NF) wrote:
NF> One thing left me wondering, don't you see the different access NF> latencies to local vs. remote memory in NUMA systems as a pressing NF> problem that needs a solution by the OS? The consideration of memory NF> locality was actually the driving motivation behind the vcore idea.
Definitely. But all cores that are on the same socket typically share the LLC and the memory controller and therefore belong to the same NUMA domain. For those cores shared memory is much less painful than if you go off-socket.
So for a multi-core VM, you would like to acquire physical cores that are all on the same socket. If that doesn't work for whatever reason, then you have to pay the price of going cross-socket (and likely into a different NUMA domain). The system should discourage, but not prevent that.
Applications probably want interfaces like: * give me local memory for private use that is cheap to access * give me memory that can be cheaply shared with cores X, Y, and Z * give me globally shared memory
I don't think you would want to educate every application about NUMA, core proximity and the like. Only few memory managers and schedulers in the system need to know about this stuff and can then make allocation and placement decisions based on their knowledge and the application requests they receive.
Cheers, Udo