Restoring child with checkpointed state

Sun Dec 11 13:01:07 CET 2016

Hello Norman,

> What you observe here is the ELF loading of the child's binary. As part
> of the 'Child' object, the so-called '_process' member is constructed.
> You can find the corresponding code at
> 'base/src/lib/base/child_process.cc'. The code parses the ELF executable
> and loads the program segments, specifically the read-only text segment
> and the read-writable data/bss segment. For the latter, a RAM dataspace
> is allocated and filled with the content of the ELF binary's data. In
> your case, when resuming, this procedure is wrong. After all, you want
> to supply the checkpointed data to the new child, not the initial data
> provided by the ELF binary.
>
> Fortunately, I encountered the same problem when implementing fork for
> noux. I solved it by letting the 'Child_process' constructor accept an
> invalid dataspace capability as ELF argument. This has two effects:
> First, the ELF loading is skipped (obviously - there is no ELF to load).
> And second the creation of the initial thread is skipped as well.
>
> In short, by supplying an invalid dataspace capability as binary for the
> new child, you avoid all those unwanted operations. The new child will
> not start at 'Component::construct'. You will have to manually create
> and start the threads of the new child via the PD and CPU session
> interfaces.

Thank you for the hint. I will try out your approach

> The approach looks good. I presume that you encounter base-foc-specific
> peculiarities of the thread-creation procedure. I would try to follow
> the code in 'base-foc/src/core/platform_thread.cc' to see what the
> interaction of core with the kernel looks like. The order of operations
> might be important.
>
> One remaining problem may be that - even though you may by able the
> restore most part of the thread state - the kernel-internal state cannot
> be captured. E.g., think of a thread that was blocking in the kernel via
> 'l4_ipc_reply_and_wait' when checkpointed. When resumed, the new thread
> can naturally not be in this blocking state because the kernel's state
> is not part of the checkpointed state. The new thread would possibly
> start its execution at the instruction pointer of the syscall and issue
> system call again, but I am not sure what really happens in practice.

Is there a way to avoid this situation? Can I postpone the checkpoint by 
letting the entrypoint thread finish the intercepted RPC function call, 
then increment the ip of child's thread to the next command?

> I think that you don't need the LOG-session quirk if you follow my
> suggestion to skip the ELF loading for the restored component
> altogether. Could you give it a try?

You are right, the LOG-session quirk seems a bit clumsy. I like your 
idea of skipping the ELF loading and automated creation of CPU threads 
more, because it gives me the control to create and start the threads 
from the stored ip and sp.

Best regards,
Denis