Hi Norman, thanks for answer. Some thoughts below.
it in interesting to learn more about the context of your work with Go.
couple years ago I do have a project to implement docker support on some new microkernel OS. as a starting point I need to try this ontop of Genode (because project owners do not have OS sources available for me that time). Initially I think that I need to implement integration of partitioning with containers, like having single container per OS partition. Finally I need to support only a set of containers inside single OS partition with provided by OS linux emulation layer. Later I found that main problem was not in fixing kernel, drivers /etc - problem that all docker support was implemented using golang. So, I need to port a couple millions LOC written on Go (AKA docker support) , and start from porting runtime of golang itself (another 1.2m LOC on golang and C, which touch ALL available syscalls/services/etc of underlying OS, and requires good understanding of small differences between different OS APIs). I have this work half-done for genode and then switches back to main OS (where I finished later everything - port of runtime and port of docker inside single partition).
In this moment I returned back from old project and want to finish undone work for genode as a testbed for initial idea of integration of docker and OS partition in 1 <-> 1, probably using libc port. I do not have formal customers for it, just my curiosity.
You said that you are not a Go programmer yourself. But to you happen to have users of your Go runtime to get their feedback?
About users - not sure, recently I publish my patches and do not have any feedback yet. Go actively used by the developers, I hope that it will be easy to bring some application software to Genode (e.g. different handling of http staff). Anyway, current lack of customers will not stop me from the second part of my research.
I have to compile and run inside genode docker support code - couple millions golang lines which heavily use system OS API including posix and dialects.
So, I consider to have "go build» to run natively inside genode inside qemu. First step was to have TCP support inside integrated with golang - done. Next will be gccgo native (non-cross) support.
Like namespaces based isolation (read: ability to have same names/id’s/etc in different domains for objects and anything provided by the Genode to user apps, together with additional related API). At least for app snapshotting, migration and persistency this is «the must». They are not so necessary for containers themselves, there are support of some platforms without it, as well without dedicated layered FS (unions and similar like auFS/btrfs/zfs/etc - while it is good to have it).
I think the two aspects OS-level virtualization and snapshotting/persistency should best be looked at separately.
Regarding OS-level virtualization, Genode's protection domains already provide the benefit of being light-weight - like namespaces when compared to virtual machines - while providing much stronger isolation. Each Genode component has its private capability space after all with no sharing by default. Hence, OS-level virtualization on Genode comes down to hosting two regular Genode sub systems side by side.
General note. Initially, when in swsoft/virtuozzo/parallels we do create a container-based Os virtualisation (we call it "virtual environment" in 2000), we assume 3 main pillars (having in mind that we want to use it as a base for hosting in hostile environment with open unlimited access from Internet to containers):
1. namespace virtualisation not only to isolate resources but to be able to have the same pid and related resources in different containers (for Unix we think about emulation of init process with pre-defined pid 1, at least)
2. file system virtualisation to allow COW and transparent sharing of the same files (e.g. executable for apache between 100’th of containers instances) to preserve kernel memory and objects space (oppose to VM where you can’t able to share efficiently files and data structures between different VM instances) - key to containers high scalability and performance , and for docker it also a key for "encapsulation of changes" paradigm. Sharing using single kernel instance is a wide paradigm - it allow to optimise kernel structure allocation, resources sharing, single instance of memory allocator/etc.
3. ALL resources limitations on per-container base (we call this userbeancounters) which prevent any attempts to make DoS attack from one container to another or to the host.
every container initially should be like remotely-accessible complete instance of linux with root access and init process, but without ability to load own device drivers. we do implement this first for linux, later for FreeBSD/Solaris (partially), Windows (based on hot patching technique and their Terminal Server), consider mach/MacOS Darwin (just experiments). for linux and Windows it was a commercial-grade implementation and still used by millions of customers.
Now all these features (may be except file systems while zfs/brtfs/overlayfs/etc have something similar) are became a part of most of commercial OS available on the mass market. IMHO they can be cheaply implemented from the very beginning of OS kernels development - everything was in place, except understanding why this is necessary outside of simple testing environments for developers. May be it is also a time for Genode to think into this direction?
returning to Genode.
reasons for existence of namespaces (ns) is not only isolation, it is a bit wider. One thing is an ability to disallow «manual construction» of objects ids/handles/capabilities/references/etc to access something which should not be visible at all.
For example, in ns isolated containers I should not be able to send signal to arbitrary process by name (pid in our case) even if it exists in the kernel. Or vice versa - use some pre-defined processes ids to do something (e.g. unix like to attach orphans to pid 1, and later try to enumerate them, this should be emulated somehow during user-level software port, e.g. for linux docker this is important).
in case of genode probably I can created and keep capability (with data pointer inside) and perform some operations with it, if I do store it somewhere. if this capability be virtualised then we will have additional level of control over it (by creation of pre-defined caps and explicit level of limitations even if it is intentionally shared during initialisation procedure which could be a part of legacy software being ported to genode).
For better understanding - use case. imagine that you want to port application which use 3-d party library which do init some exotic file descriptors. good example is a docker itself - when you exec process inside docker container you typically don’t want to have your main process opened descriptors including stdin/out (typically it’s achieved via CLOEXEC flag - but lets consider its absence, you can just don’t know about descriptors existence). Technically it is a code in your application while it was initialised by 3-d party library linked to it, and you do not have easy way to control it.
ns implementation has simple rules to maintain group isolation, and it is not considered as unnecessary even in linux kernel with their own capabilities set. I think that namespace is a convent way to handle legacy-related question and it worth to have in Genode level where you already have wrappers around native low level kernel calls.
And, for snapshotting (see comment below) this is a must - I need to re-create all objects with the same id even if they do exists in other threads/sessions/processes because they could be stored in «numerical form» inside user thread memory.
as for file system like overlayfs - not sure, I assume that it is possible to port some known fs into genode, while it is not a first priority task (Windows docker does not have it).
for resource counting and limitations - I do not tackle this topic at all for genode.
The snaphotting/persistency topic is not yet covered. But I see a rather clear path towards it, at least for applications based on Genode's libc. In fact, the libc already has the ability to replicate the state of its application as part of the fork mechanism. Right now, this mechanism is only used internally. But it could be taken as the basis for, e.g., serializing the application state into snapshot file. Vice versa, similar to how a forked process obtains its state from the forking process, the libc could support the ability to import a snapshot file at startup. All this can be implemented in the libc without changing Genode's base framework.
That being said, there is an elephant in the room, namely how POSIX threads fit into the picture. How can the state of a multi-threaded application be serialized in a consistent way? That would be an interesting topic to research.
I think we can follow the ideas developed for CRIU patch for linux [1], no need to invent something too complex: It can freeze a running container (or an individual application) and checkpoint its state to disk. The data saved can be used to restore the application and run it exactly as it was during the time of the freeze. Using this functionality, application or container live migration, snapshots, remote debugging, and many other things are now possible.
in short, they do utilise existent linux kernel sys calls like ptrace and add very small subset of absent ones to enumerate process-related objects [2]. This does not mean that you need to have ptrace - it just used as a kind of auxillary interface to obtain info about processes, it can be implemented in different ways.
to stop (freeze) a set of related processes (tree) even with POSIX they use feature (can be considered as a part of ns virtualisation) known as cgroups [3]: The freezer allows the checkpoint code to obtain a consistent image of the tasks by attempting to force the tasks in a cgroup into a quiescent state. Once the tasks are quiescent another task can walk /proc or invoke a kernel interface to gather information about the quiesced tasks. Checkpointed tasks can be restarted later should a recoverable error occur. This also allows the checkpointed tasks to be migrated between nodes in a cluster by copying the gathered information to another node and restarting the tasks there.
Seems that similar to ns and cgroup features should be a first part/base of checkpoint/restore implementation. OF course, part of serialisation related to fork/libc, as you mention, also could be an another pillar.
In general, I think that to implement snapshotting we need 1. freeze set of threads (or make them COW, e.g. for memory changes) 2. enumerate threads 3. enumerate related objects/states (e.g. file descriptors/pipes/etc) 4. enumerate virt mem areas, and related «shared resources» between threads 5. enumerate network stack/sockets states (a bit different beast) 5. dump everything
for restore we need not only create some objects with the same numerical values of id (even same memory layout), we need to have an api to force every object to have the same content/state and related security/ns, and force (restore) «sharing» and parent/child relations if any of object between different threads/processes/sessions/etc
related to this topic: we also need to be able to bring some drivers to the same state, because during checkpointing/restore we assume external connections in the known state (e.g. imagine video drivers and application which draw on the screen. content of their video memory is a part of its state while it is not stored in application). Probably it is related to restartable drivers feature (and related fault tolerance questions).
Note: By the way, now of the key problem of CRIU patch in this moment is inability to restore graphical screen for X/etc. We can restore related sockets, while protocol parts themselves which need to be repeated are not known when you order checkpoint.I think there are no people now available who know real X protocol details necessary for the operations… but this is different story not directly related to genode questions.
[1] https://criu.org/Main_Page [2] https://criu.org/Checkpoint/Restore [3] https://www.kernel.org/doc/Documentation/cgroup-v1/freezer-subsystem.txt
Sincerely, Alexander