The Genode book suggests that an RPC caller can protect itself from blocking in a stalled server by creating a watchdog thread to monitor the process of the call, and cancel it if it takes too long.
Is there a robust/canonical example of using cancel_blocking in this way?
My experiment with this (Genode 17.08, x86_32) seems to work as expected--but, only with the OKL4 kernel!? With nova, hw, and seL4, the cancel_blocking() method executes but seemingly to no effect: the thread continues to wait on the (contrived) very slow RPC call, which eventually completes.
Suggestions?
// Steve Harp
Hi Steve,
On 12.10.2017 00:26, Steven Harp wrote:
The Genode book suggests that an RPC caller can protect itself from blocking in a stalled server by creating a watchdog thread to monitor the process of the call, and cancel it if it takes too long.
Is there a robust/canonical example of using cancel_blocking in this way?
I am afraid that the book misled you towards an outdated direction. The cancel-blocking mechanism was introduced very early at a time when we routinely designed inter-component interfaces that were blocking at the server side. At that time, L4 kernels did not support any means of asynchronous notifications, thereby luring us into this direction. Later, we realized this mistake and successively redesigned the interfaces [1] to use a combination of synchronous RPCs that immediately return and asynchronous notifications for blocking at the client side. We announced this transition in May last year [2] and finished it in May this year.
[1] https://genode.org/documentation/release-notes/13.02#Timer interface turned into asynchronous mode of operation [2] https://genode.org/documentation/release-notes/16.05#The_great_API_renovatio...
For modern components, the cancel-blocking mechanism is no longer used. We still keep it around to uphold compatibility but I hope to eventually remove it from the API in the not-too-distant future.
My experiment with this (Genode 17.08, x86_32) seems to work as expected--but, only with the OKL4 kernel!? With nova, hw, and seL4, the cancel_blocking() method executes but seemingly to no effect: the thread continues to wait on the (contrived) very slow RPC call, which eventually completes.
Suggestions?
When a client calls a server, it ultimately yields the flow of control to the server until the server replies. Because a misbehaving server may never reply, e.g., because of a bug, the client could get stuck at that point. There is no counter measure for this situation. We found that potential counter measures like IPC timeouts or the cancel-blocking mechanism that are intuitively tempting are bug prone and lead to indeterministic system behavior. A client unconditionally expects that the server replies to an RPC request. From a client's perspective, a server called via RPC functions is similar to a regular third-party library. When calling a library function, one can never be sure that the function will eventually return. It could get stuck in the library. Therefore, we devise the best practice to implement complex (bug-prone) software as mere clients, not servers. Please consider Section 3.2.4. "Client-server relationship" of the book for a succinct characterization of the client and server roles within Genode.
The canonical example of this best practice is the window manager, which is a composition of the low-complexity 'wm' component (that acts as a server) and the potentially high-complexity (and more bug prone) layouter and window decorator components. The latter two components are mere clients of the 'wm' server. Another good example is the way how the (trusted) report_rom server decouples the producers and consumers of state information. Both the producer ('Report' session client) and consumer ('ROM' session client) are clients of the report_rom server. They both trust the report_rom server but they don't need to trust each other, nor does the report_rom server need to trust any of them.
Note that throughout Genode, there are still several places where we don't fully adhere to this practice yet. I.e., NIC drivers (like the highly complex wifi_drv) still act as servers. But we will ultimately change this in a way that NIC drivers will become clients of the low complexity nic_router component.
When following this route, there is no need for the cancel-blocking mechanism. Your observation that the cancel-blocking mechanism works for RPCs on OKL4 is just an artifact from the past.
Sorry that the book guided you in the wrong direction. Could you please point me to the particular part so that I can revise it?
Cheers Norman
Norman:
Thank you for the very thoughtful explanation. I found no useful examples of cancel_blocking in core Genode, so perhaps it is safe enough to remove, at least from the documentation; or, emit a warning in the implementations.
The final sentence of Section 4.7.6 "Enslaving services" is the one that suggested my experiment (17.05 edition). Possibly I took this out of context. Overall the Foundations book is of excellent quality, and sets a standard for systems of this type.
Arranging component relationships so that client and server correspond to a natural asymmetry of trustworthiness is sometimes straightforward, but sometimes ambiguous. E.g. should one trust calls to a log service? What if the log service gets upgraded to log to a network host that falls under control of an attacker? The attacker exploits a vulnerability and owns the logger; some critical component then halts the next time it issues a logging call. Yes, you can e.g. redesign the logger as a client--I've done this, but it adds to the complexity of other components.
In some cases, RPC might not be the most natural communications solution. Is asynchronous message-passing (using only signals and shared memory) feasible in Genode? Maybe something similar to "vchan" in Xen/Qubes. Perhaps this exists?
// Steve
On 10/17/2017 04:32 AM, Norman Feske wrote:
Hi Steve,
On 12.10.2017 00:26, Steven Harp wrote:
The Genode book suggests that an RPC caller can protect itself from blocking in a stalled server by creating a watchdog thread to monitor the process of the call, and cancel it if it takes too long.
Is there a robust/canonical example of using cancel_blocking in this way?
I am afraid that the book misled you towards an outdated direction. The cancel-blocking mechanism was introduced very early at a time when we routinely designed inter-component interfaces that were blocking at the server side. At that time, L4 kernels did not support any means of asynchronous notifications, thereby luring us into this direction. Later, we realized this mistake and successively redesigned the interfaces [1] to use a combination of synchronous RPCs that immediately return and asynchronous notifications for blocking at the client side. We announced this transition in May last year [2] and finished it in May this year.
[1] https://genode.org/documentation/release-notes/13.02#Timer interface turned into asynchronous mode of operation [2] https://genode.org/documentation/release-notes/16.05#The_great_API_renovatio...
For modern components, the cancel-blocking mechanism is no longer used. We still keep it around to uphold compatibility but I hope to eventually remove it from the API in the not-too-distant future.
My experiment with this (Genode 17.08, x86_32) seems to work as expected--but, only with the OKL4 kernel!? With nova, hw, and seL4, the cancel_blocking() method executes but seemingly to no effect: the thread continues to wait on the (contrived) very slow RPC call, which eventually completes.
Suggestions?
When a client calls a server, it ultimately yields the flow of control to the server until the server replies. Because a misbehaving server may never reply, e.g., because of a bug, the client could get stuck at that point. There is no counter measure for this situation. We found that potential counter measures like IPC timeouts or the cancel-blocking mechanism that are intuitively tempting are bug prone and lead to indeterministic system behavior. A client unconditionally expects that the server replies to an RPC request. From a client's perspective, a server called via RPC functions is similar to a regular third-party library. When calling a library function, one can never be sure that the function will eventually return. It could get stuck in the library. Therefore, we devise the best practice to implement complex (bug-prone) software as mere clients, not servers. Please consider Section 3.2.4. "Client-server relationship" of the book for a succinct characterization of the client and server roles within Genode.
The canonical example of this best practice is the window manager, which is a composition of the low-complexity 'wm' component (that acts as a server) and the potentially high-complexity (and more bug prone) layouter and window decorator components. The latter two components are mere clients of the 'wm' server. Another good example is the way how the (trusted) report_rom server decouples the producers and consumers of state information. Both the producer ('Report' session client) and consumer ('ROM' session client) are clients of the report_rom server. They both trust the report_rom server but they don't need to trust each other, nor does the report_rom server need to trust any of them.
Note that throughout Genode, there are still several places where we don't fully adhere to this practice yet. I.e., NIC drivers (like the highly complex wifi_drv) still act as servers. But we will ultimately change this in a way that NIC drivers will become clients of the low complexity nic_router component.
When following this route, there is no need for the cancel-blocking mechanism. Your observation that the cancel-blocking mechanism works for RPCs on OKL4 is just an artifact from the past.
Sorry that the book guided you in the wrong direction. Could you please point me to the particular part so that I can revise it?
Cheers Norman
Hi Steve,
thank you for the nice comment about the book!
Arranging component relationships so that client and server correspond to a natural asymmetry of trustworthiness is sometimes straightforward, but sometimes ambiguous. E.g. should one trust calls to a log service?
However the answer to this question might be, it should not be the concern of the log client. The log client considers the log-session interface as a contract. Since it got the session handed out by its ultimately trusted parent, the client is not in the position to question it anyway. If the log server misbehaves, the client is not responsible - the parent is.
Consequently, the answer to the question comes down to a judgment of risk by the integrator of the system scenario, not the implementor of the log client.
What if the log service gets upgraded to log to a network host that falls under control of an attacker? The attacker exploits a vulnerability and owns the logger; some critical component then halts the next time it issues a logging call. Yes, you can e.g. redesign the logger as a client--I've done this, but it adds to the complexity of other components.
To counter this risk, one may insert a trusted component in-between the log client and the network-facing "log streamer" component. E.g., by directing the log messages via fs_log to a ram_fs component, the log client only needs to trust those two low-complexity components. The log streamer (which we assume to be easily compromised) would access the log via a read-only file-system session from the ram_fs. It depends on the ram_fs but the ram_fs does not depend on the on the log streamer. So here, the ram_fs acts as a firewall between the log client and the network.
Btw, in practice, the log-over-the-network scenario raises further questions. In particular, how to handle the case where the log data fails to get out of the system? Should the system continue to operate without capturing any trace of its behavior? Maybe it is preferable to immediately stop, reboot, or fall back into a special fail-secure mode? If the log client implemented defensive measures to deal with a unresponsive log server, the client's implementation would implicitly take a policy decision. But by making the liveliness of the log streamer a responsibility of the common parent of both the log client and log streamer, the parent is naturally in the position to take an explicit and more educated policy decision. It is always good to have clear-cut responsibilities.
In some cases, RPC might not be the most natural communications solution. Is asynchronous message-passing (using only signals and shared memory) feasible in Genode? Maybe something similar to "vchan" in Xen/Qubes. Perhaps this exists?
It exists in the form of the so-called "packet stream". For example, the NIC session interface involves synchronous RPCs at session-creation time but all network traffic flows through shared memory and signals. Section 3.6. "Inter-component communication" explains the different inter-component communication patterns. The flavor you mentioned is described in Section 3.6.6.
Cheers Norman