Roadmap 2022

Thu Jan 6 13:51:12 CET 2022

Hi Alexander,

thanks for bringing up the discussion. I'm thrilled by the idea of
hosting containerised apps on Genode. Apparently, I haven't digged into
this topic as deep as you already have, hence excuse my naive view on
it. Maybe you can clarify where I'm missing some details since.

I tried to read up on the the container topic and had my own thoughts
and ideas on how containers could end up on Genode. 
As far as I understood, a container is basically a filesystem image
with some configuration of how to set things up. The container runtime
will read the configuration and prepare everything depending on the
target system before it launches the process defined by the container.
After that, the started container is merely a standard process that has
been encapsulated with namespaces, cgroup and other isolation
mechanisms. The process performs syscalls just like a non-containerised
process would do.

By the way, I found [1] particularly helpful for reading up on the
topic and recommend this to anyone who is keen on following this
discussion.

[1]
https://developers.redhat.com/blog/2018/02/22/container-terminology-practical-introduction

Thus, when thinking about running a container on Genode, I noticed we
have most ingredients already in stock since a Genode component is a
sandboxed process with its resource quota and local namespace.

Regarding the file system virtualisation, we have the VFS and can even
host a shared VFS in a dedicated server component. I'm not sure about a
copy-on-write feature, though.

In my (current) point of view, enabling containerised workloads on
Genode probably requires three ingredients:

1. Implementing additional VFS plugins for mounting container images,
   overlays, and cow functionality.
2. Adding missing plugins for special file nodes in devfs, sysfs or
   procfs. This highly depends on what the particular container process
   expects, though.
3. Implementing a container runtime for Genode that sets up a sub-init
   to launch the container process with the appropriate VFS and helper
   components according to the container configuration.

Re. 3., I'm uncertain whether this is best approached from scratch or by
porting existing runtimes such as runc or crun. The downside of the
latter approach is that it requires us to provide all the
Linux management interfaces such as cgroup, namespaces, etc. and map
these to Genode sub-init configuration. Parsing the
container configuration and applying the appropriate actions directly
seems more natural to me at the moment.

@Alexander: What do you think are the major road blocks for running a
first container image on Genode?

Again, please excuse my naive view on the matters. I feel I merely
climbed Mount Stupid when it comes to containers.

Cheers
Johannes

On Wed, 5 Jan 2022 19:23:19 +0000
Alexander Tormasov via users <users at lists.genode.org> wrote:

>  Hi Norman,
> thanks for answer. Some thoughts below.
> 
> > 
> > it in interesting to learn more about the context of your work with
> > Go.
> > 
> > 
> couple years ago I do have a project to implement docker support on
> some new microkernel OS. as a starting point I need to try this ontop
> of Genode (because project owners do not have OS sources available
> for me that time). Initially I think that I need to implement
> integration of partitioning with containers, like having single
> container per OS partition.  Finally I need to support only a set of
> containers inside single OS partition with provided by OS linux
> emulation layer. Later I found that main problem was not in fixing
> kernel, drivers /etc - problem that all docker support was
> implemented using golang. So, I need to port a couple millions LOC
> written on Go (AKA docker support) , and start from porting runtime
> of golang itself (another 1.2m LOC on golang and C, which touch ALL
> available syscalls/services/etc of underlying OS, and requires good
> understanding of small differences between different OS APIs). I have
> this work half-done for genode and then switches back to main OS
> (where I finished later everything - port of runtime and port of
> docker inside single partition).
> 
> In this moment I returned back from old project and want to finish
> undone work for genode as a testbed for initial idea of integration
> of docker and OS partition in 1 <-> 1, probably using libc port. I do
> not have formal customers for it, just my curiosity.
> 
> > You said that you are not a Go programmer yourself. But to you
> > happen to have users of your Go runtime to get their feedback?
> > 
> 
> About users - not sure, recently I publish my patches and do not have
> any feedback yet. Go actively used by the developers, I hope that it
> will be easy to bring some application software to Genode (e.g.
> different handling of http staff). Anyway, current lack of customers
> will not stop me from the second part of my research.
> 
> I have to compile and run inside genode docker support code - couple
> millions golang lines which heavily use system OS API including posix
> and dialects.
> 
> So, I consider to have "go build» to run natively inside genode
> inside qemu. First step was to have TCP support inside integrated
> with golang - done. Next will be gccgo native (non-cross) support.
> 
> >> Like namespaces based isolation (read: ability to have same
> >> names/id’s/etc in different domains for objects and anything
> >> provided by the Genode to user apps, together with additional
> >> related API). At least for app snapshotting, migration and
> >> persistency this is «the must». They are not so necessary for
> >> containers themselves, there are support of some platforms without
> >>  it, as well without dedicated layered FS (unions and similar like
> >> auFS/btrfs/zfs/etc - while it is good to have it).
> > 
> > I think the two aspects OS-level virtualization and
> > snapshotting/persistency should best be looked at separately.
> > 
> > Regarding OS-level virtualization, Genode's protection domains
> > already provide the benefit of being light-weight - like namespaces
> > when compared to virtual machines - while providing much stronger
> > isolation. Each Genode component has its private capability space
> > after all with no sharing by default. Hence, OS-level
> > virtualization on Genode comes down to hosting two regular Genode
> > sub systems side by side.
> 
> General note. 
> Initially, when in swsoft/virtuozzo/parallels we do create a
> container-based Os virtualisation (we call it "virtual environment"
> in 2000),  we assume 3 main pillars (having in mind that we want to
> use it as a base for hosting in hostile environment with open
> unlimited access from Internet to containers):
> 
> 1. namespace virtualisation not only to isolate resources but to be
> able to have the same pid and related resources in different
> containers (for Unix we think about emulation of init process with
> pre-defined pid 1, at least)
> 
> 2. file system virtualisation to allow COW and transparent sharing of
> the same files (e.g. executable for apache between 100’th of
> containers instances) to preserve kernel memory and objects space
> (oppose to VM where you can’t able to share efficiently files and
> data structures between different VM instances) - key to containers
> high scalability and performance , and for docker it also a key for
> "encapsulation of changes" paradigm. Sharing using single kernel
> instance is a wide paradigm - it allow to optimise kernel structure
> allocation, resources sharing, single instance of memory
> allocator/etc.
> 
> 3. ALL resources limitations on per-container base (we call this
> userbeancounters) which prevent any attempts to make DoS attack from
> one container to another or to the host.
> 
> every container initially should be like remotely-accessible complete
> instance of linux with root access and init process, but without
> ability to load own device drivers. we do implement this first for
> linux, later for FreeBSD/Solaris (partially), Windows (based on hot
> patching technique and their Terminal Server), consider mach/MacOS
> Darwin (just experiments). for linux and Windows it was a
> commercial-grade implementation and still used by millions of
> customers.
> 
> Now all these features (may be except file systems while
> zfs/brtfs/overlayfs/etc have something similar) are became a part of
> most of commercial OS available on the mass market. IMHO they can be
> cheaply implemented from the very beginning of OS kernels development
> - everything was in place, except understanding why this is necessary
> outside of simple testing environments for developers. May be it is
> also a time for Genode to think into this direction?
> 
> returning to Genode.
> 
> reasons for existence of namespaces (ns) is not only isolation, it is
> a bit wider. One thing is an ability to disallow «manual
> construction» of objects ids/handles/capabilities/references/etc to
> access something which should not be visible at all. 
> 
> For example, in ns isolated containers I should not be able to send
> signal to arbitrary process by name (pid in our case) even if it
> exists in the kernel. Or vice versa - use some pre-defined processes
> ids to do something (e.g. unix like to attach orphans to pid 1, and
> later try to enumerate them, this should be emulated somehow during
> user-level software port, e.g. for linux docker this is important).
> 
> in case of genode  probably I can created and keep capability (with
> data pointer inside) and perform some operations with it, if I do
> store it somewhere. if this capability be virtualised then we will
> have additional level of control over it (by creation of pre-defined
> caps and explicit level of limitations even if it is intentionally
> shared during initialisation procedure which could be a part of
> legacy software being ported to genode).
> 
> For better understanding - use case. 
> imagine that you want to port application which use 3-d party library
> which do init some exotic file descriptors. good example is a docker
> itself - when you exec process inside docker container you typically
> don’t want to have your main process opened descriptors including
> stdin/out (typically it’s achieved via CLOEXEC flag - but lets
> consider its absence, you can just don’t know about descriptors
> existence). Technically it is a code in your application while it was
> initialised by 3-d party library linked to it, and you do not have
> easy way to control it.
> 
> ns implementation has simple rules to maintain group  isolation, and
> it is not considered as unnecessary even in linux kernel with their
> own capabilities set. I think that namespace is a convent way to
> handle legacy-related question and it worth to have in Genode level
> where you already have wrappers around native low level kernel calls.
> 
> And, for snapshotting (see comment below) this is a must - I need to
> re-create all objects with the same id even if they do exists in
> other threads/sessions/processes because they could be stored in
> «numerical form» inside user thread memory.
> 
> as for file system like overlayfs - not sure, I assume that it is
> possible to port some known fs into genode, while it is not a first
> priority task (Windows docker does not have it).
> 
> for resource counting and limitations - I do not tackle this topic at
> all for genode.
> 
> > 
> > The snaphotting/persistency topic is not yet covered. But I see a
> > rather clear path towards it, at least for applications based on
> > Genode's libc. In fact, the libc already has the ability to
> > replicate the state of its application as part of the fork
> > mechanism. Right now, this mechanism is only used internally. But
> > it could be taken as the basis for, e.g., serializing the
> > application state into snapshot file. Vice versa, similar to how a
> > forked process obtains its state from the forking process, the libc
> > could support the ability to import a snapshot file at startup. All
> > this can be implemented in the libc without changing Genode's base
> > framework.
> > 
> > That being said, there is an elephant in the room, namely how POSIX
> > threads fit into the picture. How can the state of a multi-threaded
> > application be serialized in a consistent way? That would be an
> > interesting topic to research.
> 
> I think we can follow the ideas developed for CRIU patch for linux
> [1], no need to invent something too complex: It can freeze a running
> container (or an individual application) and checkpoint its state to
> disk. The data saved can be used to restore the application and run
> it exactly as it was during the time of the freeze. Using this
> functionality, application or container live migration, snapshots,
> remote debugging, and many other things are now possible.
> 
> in short, they do utilise existent linux kernel sys calls like ptrace
> and add very small subset of absent ones to enumerate process-related
> objects [2]. This does not mean that you need to have ptrace - it
> just used as a kind of auxillary interface to obtain info about
> processes, it can be implemented in different ways.
> 
> to stop (freeze) a set of related processes (tree) even with POSIX
> they use feature (can be considered as a part of ns virtualisation)
> known as cgroups [3]: The freezer allows the checkpoint code to
> obtain a consistent image of the tasks by attempting to force the
> tasks in a cgroup into a quiescent state. Once the tasks are
> quiescent another task can walk /proc or invoke a kernel interface to
> gather information about the quiesced tasks. Checkpointed tasks can
> be restarted later should a recoverable error occur. This also allows
> the checkpointed tasks to be migrated between nodes in a cluster by
> copying the gathered information to another node and restarting the
> tasks there.
> 
> Seems that similar to ns and cgroup features should be a first
> part/base of checkpoint/restore implementation. OF course, part of
> serialisation related to fork/libc, as you mention, also could be an
> another pillar.
> 
> In general, I think that to implement snapshotting we need
> 1. freeze set of threads (or make them COW, e.g. for memory changes)
> 2. enumerate threads
> 3. enumerate related objects/states (e.g. file descriptors/pipes/etc)
> 4. enumerate virt mem areas, and related «shared resources» between
> threads 5. enumerate network stack/sockets states (a bit different
> beast) 5. dump everything
> 
> for restore we need not only create some objects with the same
> numerical values of id (even same memory layout), we need to have an
> api to force every object to have the same content/state and related
> security/ns, and force (restore) «sharing» and parent/child relations
> if any of object between different threads/processes/sessions/etc
> 
> related to this topic: 
> we also need to be able to bring some drivers to the same state,
> because during checkpointing/restore we assume external connections
> in the known state (e.g. imagine video drivers and application which
> draw on the screen. content of their video memory is a part of its
> state while it is not stored in application). Probably it is related
> to restartable drivers feature (and related fault tolerance
> questions).
> 
> Note: By the way, now of the key problem of CRIU patch in this moment
> is inability to restore graphical screen for X/etc. We can restore
> related sockets, while protocol parts themselves which need to be
> repeated are not known when you order checkpoint.I think there are no
> people now available who know real X protocol details necessary for
> the operations… but this is different story not directly related to
> genode questions.
> 
> [1] https://criu.org/Main_Page
> [2] https://criu.org/Checkpoint/Restore
> [3]
> https://www.kernel.org/doc/Documentation/cgroup-v1/freezer-subsystem.txt
> 
> Sincerely,
> 	Alexander
> 
> 
> 
> _______________________________________________
> Genode users mailing list
> users at lists.genode.org
> https://lists.genode.org/listinfo/users