Restoring child with checkpointed state

Stefan Kalkowski stefan.kalkowski at ...1...
Wed Mar 29 14:05:22 CEST 2017


Hello Dennis,

On 03/27/2017 04:14 PM, Denis Huber wrote:
> Dear Genode community,
> 
> Preliminary: We implemented a Checkpoint/Restore mechanism on basis of 
> Genode/Fiasco.OC (Thanks to the great help of you all). We store the 
> state of the target component by monitoring its RPC function calls which 
> go through the parent component (= our Checkpoint/Restore component). 
> The capability space is indirectly checkpointed through the capability map.
> The restoring of the state of the target is done by restoring the RPC 
> objects used by the target component (e.g. PD session, dataspaces, 
> region maps, etc.). The capabilities of the restored objects have to be 
> also restored in the capability space (kernel) and in the capability map 
> (userspace).
> 
> For restoring the target component Norman suggested the usage of the 
> Genode::Child constructor with an invalid ROM dataspace capability which 
> does not trigger the bootstrap mechanism. Thus, we have the full control 
> of inserting the capabilities of the restored RPC objects into the 
> capability space/map.
> 
> Our problem is the following: We restore the RPC objects and insert them 
> into the capability map and then in the capability space. From the 
> kernel point of view these capabilities are all "IPC Gates". 
> Unfortunately, there was also an IRQ kernel object created by the 
> bootstrap mechanism. The following table shows the kernel debugger 
> output of the capability space of the freshly bootstraped target component:
> 
> 000204 :0016e* Gate   0015f* Gate   00158* Gate   00152* Gate
> 000208 :00154* Gate   0017e* Gate   0017f* Gate   00179* Gate
> 00020c :00180* Gate   00188* Gate          --            --
> 000210 :       --            --     0018a* Gate   0018c* Gate
> 000214 :0018e* Gate   00196* Gate   00145* Gate   00144* IRQ
> 000218 :00198* Gate          --            --            --
> 00021c :       --     0019c* Gate          --            --
> 
> At address 000217 you can see the IRQ kernel object. What does this 
> object do, how can we store/monitor it, and how can it be restored? 
> Where can we find the source code which creates this object in Genode's 
> bootstrap code?

The IRQ kernel object you refer to is used by the "signal_handler"
thread to block for signals of core's corresponding service. It is a
base-foc specific internal core RPC object[1] that is used by the signal
handler[2] and the related capability gets returned by the call to
'alloc_signal_source()' provided by the PD session[3].

I have to admit, I did not follow your current implementation approach
in depth. Thereby, I do not know how to exactly handle this specific
signal hander thread and its semaphore-like IRQ object, but maybe the
references already help you further.

Regards
Stefan

[1] repos/base-foc/src/core/signal_source_component.cc
[2] repos/base-foc/src/lib/base/signal_source_client.cc
[3] repos/base/src/core/include/pd_session_component.h
> 
> 
> Best regards,
> Denis
> 
> On 11.12.2016 13:01, Denis Huber wrote:
>> Hello Norman,
>>
>>> What you observe here is the ELF loading of the child's binary. As part
>>> of the 'Child' object, the so-called '_process' member is constructed.
>>> You can find the corresponding code at
>>> 'base/src/lib/base/child_process.cc'. The code parses the ELF executable
>>> and loads the program segments, specifically the read-only text segment
>>> and the read-writable data/bss segment. For the latter, a RAM dataspace
>>> is allocated and filled with the content of the ELF binary's data. In
>>> your case, when resuming, this procedure is wrong. After all, you want
>>> to supply the checkpointed data to the new child, not the initial data
>>> provided by the ELF binary.
>>>
>>> Fortunately, I encountered the same problem when implementing fork for
>>> noux. I solved it by letting the 'Child_process' constructor accept an
>>> invalid dataspace capability as ELF argument. This has two effects:
>>> First, the ELF loading is skipped (obviously - there is no ELF to load).
>>> And second the creation of the initial thread is skipped as well.
>>>
>>> In short, by supplying an invalid dataspace capability as binary for the
>>> new child, you avoid all those unwanted operations. The new child will
>>> not start at 'Component::construct'. You will have to manually create
>>> and start the threads of the new child via the PD and CPU session
>>> interfaces.
>>
>> Thank you for the hint. I will try out your approach
>>
>>> The approach looks good. I presume that you encounter base-foc-specific
>>> peculiarities of the thread-creation procedure. I would try to follow
>>> the code in 'base-foc/src/core/platform_thread.cc' to see what the
>>> interaction of core with the kernel looks like. The order of operations
>>> might be important.
>>>
>>> One remaining problem may be that - even though you may by able the
>>> restore most part of the thread state - the kernel-internal state cannot
>>> be captured. E.g., think of a thread that was blocking in the kernel via
>>> 'l4_ipc_reply_and_wait' when checkpointed. When resumed, the new thread
>>> can naturally not be in this blocking state because the kernel's state
>>> is not part of the checkpointed state. The new thread would possibly
>>> start its execution at the instruction pointer of the syscall and issue
>>> system call again, but I am not sure what really happens in practice.
>>
>> Is there a way to avoid this situation? Can I postpone the checkpoint by
>> letting the entrypoint thread finish the intercepted RPC function call,
>> then increment the ip of child's thread to the next command?
>>
>>> I think that you don't need the LOG-session quirk if you follow my
>>> suggestion to skip the ELF loading for the restored component
>>> altogether. Could you give it a try?
>>
>> You are right, the LOG-session quirk seems a bit clumsy. I like your
>> idea of skipping the ELF loading and automated creation of CPU threads
>> more, because it gives me the control to create and start the threads
>> from the stored ip and sp.
>>
>>
>> Best regards,
>> Denis
>>
>> ------------------------------------------------------------------------------
>> Developer Access Program for Intel Xeon Phi Processors
>> Access to Intel Xeon Phi processor-based developer platforms.
>> With one year of Intel Parallel Studio XE.
>> Training and support from Colfax.
>> Order your platform today.http://sdm.link/xeonphi
>> _______________________________________________
>> genode-main mailing list
>> genode-main at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/genode-main
>>
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> genode-main mailing list
> genode-main at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/genode-main
> 

-- 
Stefan Kalkowski
Genode Labs

https://github.com/skalk ยท http://genode.org/




More information about the users mailing list