Restarting siblings vs. restarting nieces

List overview All Threads
Download

newer

older

Running genode on QEMU having...

Genode on ARM Systems

Sid Hussmann

25 Jan 2023 25 Jan '23

3:23 p.m.

Dear Genodians,

I noticed a behavior that I would like to understand. Let's assume we have the following parent-child relationship of a sub-system, omitting some ancestors for simplicity:

``` . └── init ├── fs │ ├── init │ │ └── rump │ ├── fs_mgr │ └── report_rom ├── fs_client1 └── fs_client2 ```

In this example, the config for the first `init` is written via the depot query/deploy mechanism. Thus, `fs`, `fs_client1`, and `fs_client2` are pkg archives. The `runtime` file from pkg archive `fs` contains a management component `fs_mgr` that generates an `init` config which starts a `vfs` server with a `rump` plugin within that `init`. For context, we use the `fs_mgr` to instrument it to start `fsck` or `mkfs` depending on its configuration. But that's not relevant right now.

Let's also assume the routing is valid so that `fs_client1` and `fs_client2` are up and running, and they both consume the `File_system` session provided by `rump`.

Now to my observation:

When I restart `fs` by incrementing its version attribute in the deploy file, `fs`, `fs_client1`, and `fs_client2` all get restarted. That is what I would expect.

However, the behavior is different when I instrument the `fs_mgr` to restart `rump`. Most of the time, both `fs_client1` and `fs_client2` don't seem to notice.

Is this behavior by design? Is `init` designed to behave differently regarding client/server dependency when siblings get restarted vs. when a niece is restarted?

Is there a way to instrument `fs_mgr` to write an `init` config, so that `fs_client1` and `fs_client2` get restarted?

Granted, I observed this in a more complex scenario. I once had the following errors when restarting `rump` on `arm_v8a/base-hw`:

``` Kernel: init -> init -> fs_client1 -> ep: cannot send to unknown recipient ... Kernel: Cpu 0 error: re-entered lock. Kernel exception?! ```

Would it make sense for me to create a simplified scenario of it to dig into this behavior further?

Kind regards, Sid

Attachments:

OpenPGP_0x4BC4E441F1068163.asc (application/pgp-keys — 8.3 KB)
OpenPGP_signature.sig (application/pgp-signature — 833 bytes)

Show replies by date

Christian Helmuth

27 Jan 27 Jan

12:22 p.m.

Hello Sid,

On Wed, Jan 25, 2023 at 16:23:34 CET, Sid Hussmann wrote:

...

└── init ├── fs │ ├── init │ │ └── rump │ ├── fs_mgr │ └── report_rom ├── fs_client1 └── fs_client2
Now to my observation:

When I restart `fs` by incrementing its version attribute in the deploy
file, `fs`, `fs_client1`, and `fs_client2` all get restarted. That is
what I would expect.

However, the behavior is different when I instrument the `fs_mgr` to
restart `rump`. Most of the time, both `fs_client1` and `fs_client2`
don't seem to notice.

Is this behavior by design? Is `init` designed to behave differently
regarding client/server dependency when siblings get restarted vs. when
a niece is restarted?

The observed behavior is as intended. "init" is ruled by its configuration only. So, if a version update of a component's start node or the change of a routing policy directs a restart, it happens. On the other hand, "init" is not ruled by its children, in fact it doesn't even care about grand-children restarting like in your example. This design adheres to the principle of least surprise/astonishment and supports the expectation of a developer/integrator that init children are dominated by the given configuration only and do never magically disappear or do other funny stuff.

...

Is there a way to instrument `fs_mgr` to write an `init` config, so that `fs_client1` and `fs_client2` get restarted?

The explanation above reflects the strictness of "init" as a tool, but does not mean your system design cannot utilize it for your desired purpose. Let us just rephrase

instrument `fs_mgr` to write an `init` config

in your sentence above to

implement fs_mgr to report a state

In this light, your domain logic (implemented in a component beside "init") is then in the position to incorporate the fs_mgr report in its decision to change the init configuration and restart fs_client1/2 appropriately. Beyond dispute, this component is quite powerful and also affected by information originating in descendants of the controlled "init". But, your design requires a component in charge for this purpose and Genode enables you to implement it with minimal complexity by just monitoring some ROMs and reporting an updated init/deploy config. In a way, Sculpt manager is such a component too but implements what we desired Sculpt to work like.

...

Granted, I observed this in a more complex scenario. I once had the following errors when restarting `rump` on `arm_v8a/base-hw`:
Kernel: init -> init -> fs_client1 -> ep: cannot send to unknown recipient
...
Kernel: Cpu 0 error: re-entered lock. Kernel exception?!
Would it make sense for me to create a simplified scenario of it to dig into this behavior further?

I'd like to ask you to repeat your tests with our current master branch as we addressed some issues in base-hw that could be related to this.

Best regards

-- Christian Helmuth Genode Labs https://www.genode-labs.com/ · https://genode.org/ https://twitter.com/GenodeLabs · https://genodians.org/ Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth

Sid Hussmann

3 Feb 3 Feb

1:19 p.m.

Hi Christian

On 1/27/23 13:22, Christian Helmuth wrote:

...

...
Is this behavior by design? Is `init` designed to behave differently regarding client/server dependency when siblings get restarted vs. when a niece is restarted?

The observed behavior is as intended. "init" is ruled by its configuration only. So, if a version update of a component's start node or the change of a routing policy directs a restart, it happens. On the other hand, "init" is not ruled by its children, in fact it doesn't even care about grand-children restarting like in your example. This design adheres to the principle of least surprise/astonishment and supports the expectation of a developer/integrator that init children are dominated by the given configuration only and do never magically disappear or do other funny stuff.

Thank you for taking the time to dive into this. After reading your answer, I notice that I did not state my motivation precise enough. Here are my needs: I would like for the scenario to recover if a component such as `init -> fs -> init -> rump` is being restarted due to a malfunctioning or by an external condition such as a factory-reset flag provided to `init -> fs -> fs_mgr` via a ROM session. In case for the factory-reset flag, `fs_mgr` would first run `mkfs`, and then start `rump` again.

One way to get back to the desired state, is that `fs_client1` and `fs_client2` also get restarted. We covered that.

Another approach would be that the two clients can handle the "outage" of the `File_system` session. I know this generally is implementation specific. E.g. when consuming a ROM session, a client can catch the Exception and try to re-establish the session. The two clients use the `vfs` plugin mechanism to access the `File_system` session provided by `rump`. During a restart of `rump` the clients sometimes page-fault and sometimes "live on" while responding to heartbeats but otherwise seem to be malfunctioning. Is the `vfs` plugin mechanism designed to handle these outages?

...

...
Is there a way to instrument `fs_mgr` to write an `init` config, so that `fs_client1` and `fs_client2` get restarted?

The explanation above reflects the strictness of "init" as a tool, but does not mean your system design cannot utilize it for your desired purpose. Let us just rephrase

instrument `fs_mgr` to write an `init` config

in your sentence above to

implement fs_mgr to report a state

In this light, your domain logic (implemented in a component beside "init") is then in the position to incorporate the fs_mgr report in its decision to change the init configuration and restart fs_client1/2 appropriately. Beyond dispute, this component is quite powerful and also affected by information originating in descendants of the controlled "init". But, your design requires a component in charge for this purpose and Genode enables you to implement it with minimal complexity by just monitoring some ROMs and reporting an updated init/deploy config. In a way, Sculpt manager is such a component too but implements what we desired Sculpt to work like.

Thank you for your explanation! Knowing how the Sculpt manager controls itself and the rest of the system is good. In my case of a factory reset, the solution is more robust when combined with a reboot.

...

I'd like to ask you to repeat your tests with our current master branch as we addressed some issues in base-hw that could be related to this.

Please see my answer to Martin.

...

Best regards

Cheers, Sid

Martin Stein

28 Jan 28 Jan

6:24 a.m.

Hi Sid,

There is a pretty recent commit series on genodelabs/master for the HW IPC mechanism that might be of interest for you:

1151706243 hw: rename functions of Ipc_node class signature fd3c70ec5b hw: mark threads as dead in case of ipc violations fc690f1c47 hw: re-work the ipc node's internal state machine

AFAIK, fc690f1c47 fixes at least two bugs with IPC on component exit. I hope that helps you.

Cheers, Martin

On 25.01.23 16:23, Sid Hussmann wrote:

...

Dear Genodians,

I noticed a behavior that I would like to understand. Let's assume we have the following parent-child relationship of a sub-system, omitting some ancestors for simplicity:
.
└── init
   ├── fs
   │   ├── init
   │   │   └── rump
   │   ├── fs_mgr
   │   └── report_rom
   ├── fs_client1
   └── fs_client2
In this example, the config for the first `init` is written via the depot query/deploy mechanism. Thus, `fs`, `fs_client1`, and `fs_client2` are pkg archives. The `runtime` file from pkg archive `fs` contains a management component `fs_mgr` that generates an `init` config which starts a `vfs` server with a `rump` plugin within that `init`. For context, we use the `fs_mgr` to instrument it to start `fsck` or `mkfs` depending on its configuration. But that's not relevant right now.

Let's also assume the routing is valid so that `fs_client1` and `fs_client2` are up and running, and they both consume the `File_system` session provided by `rump`.

Now to my observation:

When I restart `fs` by incrementing its version attribute in the deploy file, `fs`, `fs_client1`, and `fs_client2` all get restarted. That is what I would expect.

However, the behavior is different when I instrument the `fs_mgr` to restart `rump`. Most of the time, both `fs_client1` and `fs_client2` don't seem to notice.

Is this behavior by design? Is `init` designed to behave differently regarding client/server dependency when siblings get restarted vs. when a niece is restarted?

Is there a way to instrument `fs_mgr` to write an `init` config, so that `fs_client1` and `fs_client2` get restarted?

Granted, I observed this in a more complex scenario. I once had the following errors when restarting `rump` on `arm_v8a/base-hw`:
Kernel: init -> init -> fs_client1 -> ep: cannot send to unknown recipient
...
Kernel: Cpu 0 error: re-entered lock. Kernel exception?!
Would it make sense for me to create a simplified scenario of it to dig into this behavior further?

Kind regards, Sid

Genode users mailing list users@lists.genode.org https://lists.genode.org/listinfo/users

Sid Hussmann

3 Feb 3 Feb

1:19 p.m.

Hi Martin,

Thank you very much for the list of commits! As we are still dealing with a driver issue with the Genode 22.11 release [1], I cherry-picked these commits to our fork based on the 22.08 release.

I'm not sure how much value it has to you when my tests are based on 22.08 (with the commits you mentioned), but in case you are curious here are my findings.

After running my scenario multiple times, the system does not behave the same way in each iteration. There are two different behaviors that I noticed:

1. The two client crash - which for the overall system is good as a heartbeat monitor can recover the system into the desired state again: ``` no RM attachment (READ pf_addr=0x100004 pf_ip=0x10e3e194 from pager_object: pd='init -> init -> fs_client1' thread='ep') Warning: page fault, pager_object: pd='init -> init -> fs_client1' thread='ep' ip=0x10e3e194 fault-addr=0x100004 type=no-page [init -> init] Error: A fault in the pd of child 'fs_client1' was detected Kernel: IPC await request: bad state, will block Warning: page fault, pager_object: pd='init -> init -> fs_client2' thread='pthread.0' ip=0x6f4b0 fault-addr=0x403befd0 type=no-page [init -> init -> fs -> init] Error: Uncaught exception of type 'Genode::Id_spaceGenode::Parent::Client::Unknown_id' [init -> init -> fs -> init] Warning: abort called - thread: ep ```

2. `rump` (short for a `vfs_server` with `rump` plugin) restarts while the rest of the system does not print any log messages. In this case we cannot recover via the heartbeat monitor as there is no change in the `init` state report. Further, the clients don't seem to be functioning correctly. E.g. one of them being a TCP server that won't respond to networking traffic anymore. Could it be that somehow the `vfs` plugin can't handle the interruption of the `File_system` session? ``` [init -> init -> fs -> init -> rump] rump: /genode: file system not clean; please fsck(8) ```

I'm not sure if this information is of value to you. Especially when my scenario is based on Genode 22.08. I will test this again once we have the 22.11 or the 23.02 release in.

[1] https://lists.genode.org/pipermail/users/2023-January/008356.html

Cheers, Sid

Martin Stein

5 Feb 5 Feb

12:42 p.m.

Hi Sid,

Let me share the outcome of our offline discussion with the mailing list:

As far as I understand it, while Genode's init has the feature of restarting direct service clients when their session disappears, this doesn't apply in your scenario because of the server being wrapped in an additional sub-init.

In such case, you have to manually take care for restarting your clients. A client, AFAIK deliberately doesn't consider the case that the outside world is terminating its session. So, it seems natural to me that you run into unpredictable behavior if you don't have some kind of manager that kills the client before terminating its session.

I hope this is of help?

Cheers, Martin

On 03.02.23 14:19, Sid Hussmann wrote:

...

Hi Martin,

Thank you very much for the list of commits! As we are still dealing with a driver issue with the Genode 22.11 release [1], I cherry-picked these commits to our fork based on the 22.08 release.

I'm not sure how much value it has to you when my tests are based on 22.08 (with the commits you mentioned), but in case you are curious here are my findings.

After running my scenario multiple times, the system does not behave the same way in each iteration. There are two different behaviors that I noticed:

The two client crash - which for the overall system is good as a heartbeat monitor can recover the system into the desired state again:
no RM attachment (READ pf_addr=0x100004 pf_ip=0x10e3e194 from pager_object: pd='init -> init -> fs_client1' thread='ep')    
Warning: page fault, pager_object: pd='init -> init -> fs_client1' thread='ep' ip=0x10e3e194 fault-addr=0x100004 type=no-page  
[init -> init] Error: A fault in the pd of child 'fs_client1' was detected  
Kernel: IPC await request: bad state, will block  
Warning: page fault, pager_object: pd='init -> init -> fs_client2' thread='pthread.0' ip=0x6f4b0 fault-addr=0x403befd0 type=no-page  
[init -> init -> fs -> init] Error: Uncaught exception of type 'Genode::Id_space<Genode::Parent::Client>::Unknown_id'  
[init -> init -> fs -> init] Warning: abort called - thread: ep
`rump` (short for a `vfs_server` with `rump` plugin) restarts while the rest of the system does not print any log messages. In this case we cannot recover via the heartbeat monitor as there is no change in the `init` state report. Further, the clients don't seem to be functioning correctly. E.g. one of them being a TCP server that won't respond to networking traffic anymore. Could it be that somehow the `vfs` plugin can't handle the interruption of the `File_system` session?
[init -> init -> fs -> init -> rump] rump: /genode: file system not clean; please fsck(8)
I'm not sure if this information is of value to you. Especially when my scenario is based on Genode 22.08. I will test this again once we have the 22.11 or the 23.02 release in.

[1] https://lists.genode.org/pipermail/users/2023-January/008356.html

Cheers, Sid

Genode users mailing list users@lists.genode.org https://lists.genode.org/listinfo/users

Christian Helmuth

6 Feb 6 Feb

6:37 a.m.

Hello,

On Sun, Feb 05, 2023 at 13:42:06 CET, Martin Stein wrote:

...

Let me share the outcome of our offline discussion with the mailing list:

As far as I understand it, while Genode's init has the feature of restarting direct service clients when their session disappears, this doesn't apply in your scenario because of the server being wrapped in an additional sub-init.

In such case, you have to manually take care for restarting your clients. A client, AFAIK deliberately doesn't consider the case that the outside world is terminating its session. So, it seems natural to me that you run into unpredictable behavior if you don't have some kind of manager that kills the client before terminating its session.

Thanks for this wrapup, Martin, it perfectly reflects my stance on the matter too.

Regarding the followong question...

...

Exception and try to re-establish the session. The two clients use the `vfs` plugin mechanism to access the `File_system` session provided by `rump`. During a restart of `rump` the clients sometimes page-fault and sometimes "live on" while responding to heartbeats but otherwise seem to be malfunctioning. Is the `vfs` plugin mechanism designed to handle these outages?

We refrained from implementing "probing" or "automatic retry" in many places where the it is hard to nail down a sensible default policy. I expect the VFS plugin misses your desired recover feature for this reason.

Regards

Sid Hussmann

7:52 a.m.

Thank you, Martin and Christian, for the time and care you took for your explanation.

I now better understand how to design that part of our system in a deterministic and reliable way.

Kind regards, Sid

On 2/6/23 07:37, Christian Helmuth wrote:

...

Hello,

On Sun, Feb 05, 2023 at 13:42:06 CET, Martin Stein wrote:

...
Let me share the outcome of our offline discussion with the mailing list:

As far as I understand it, while Genode's init has the feature of restarting direct service clients when their session disappears, this doesn't apply in your scenario because of the server being wrapped in an additional sub-init.

In such case, you have to manually take care for restarting your clients. A client, AFAIK deliberately doesn't consider the case that the outside world is terminating its session. So, it seems natural to me that you run into unpredictable behavior if you don't have some kind of manager that kills the client before terminating its session.

Thanks for this wrapup, Martin, it perfectly reflects my stance on the matter too.

Regarding the followong question...

...
Exception and try to re-establish the session. The two clients use the `vfs` plugin mechanism to access the `File_system` session provided by `rump`. During a restart of `rump` the clients sometimes page-fault and sometimes "live on" while responding to heartbeats but otherwise seem to be malfunctioning. Is the `vfs` plugin mechanism designed to handle these outages?

We refrained from implementing "probing" or "automatic retry" in many places where the it is hard to nail down a sensible default policy. I expect the VFS plugin misses your desired recover feature for this reason.

Regards

884

Age (days ago)

896

Last active (days ago)

users@lists.genode.org

7 comments

3 participants

tags (0)

participants (3)

Christian Helmuth
Martin Stein
Sid Hussmann