Hello Genodians
While searching for a solution to the problem of misbehaving components I stumbled upon `test-fault_detection` which gave me the idea to let init monitor the PD/CPU session of its children and report the error.
The idea is that a monitor component can then decide what to do about the failure.
I have drafted what I envision in [1]. It is not yet complete, as I'm not sure on how to safely remove the signal handlers if in a config change the monitoring is disabled.
If you think this feature would be useful, I'd create an issue where the implementation details could be discussed.
On the other hand maybe someone has an other idea how a similar functionality can be achieved.
[1] https://github.com/trimpim/genode/tree/sandbox-fault_detection
Hello Pirmin,
On Tue, Jul 05, 2022 at 15:54:19 CEST, Duss Pirmin wrote:
While searching for a solution to the problem of misbehaving components I stumbled upon `test-fault_detection` which gave me the idea to let init monitor the PD/CPU session of its children and report the error.
The idea is that a monitor component can then decide what to do about the failure.
This is an interesting idea indeed but I'm not quite sure if Init is the right component for monitoring such errors. Currently, Init already provides the monitoring of component liveliness by the "hearbeat" feature. Your current implementation augments this with info about CPU exceptions but any other details about the exceptions or even which threads faults is still missing. I'd expect developers instantly demanding more information from the monitoring feature, which would bloat Init more and more.
I have drafted what I envision in [1]. It is not yet complete, as I'm not sure on how to safely remove the signal handlers if in a config change the monitoring is disabled.
If you think this feature would be useful, I'd create an issue where the implementation details could be discussed.
On the other hand maybe someone has an other idea how a similar functionality can be achieved.
Please go forward and open the issue, and maybe leave open if Init (while a natural first attempt) is the right place to implement the feature finally. In past offline discussions we identified POSIX components as the most valuable target for monitoring. We envisioned to extend the LibC runtime by a local exception handler for the component, which could optionally reveal more details about the fault and even provide a stack backtrace. Back then, I always thought about logging the information, after your proposal I'm convinced using a Report session is much more appropriate.
Regards