Hi Steve,
On 12.10.2017 00:26, Steven Harp wrote:
The Genode book suggests that an RPC caller can protect itself from blocking in a stalled server by creating a watchdog thread to monitor the process of the call, and cancel it if it takes too long.
Is there a robust/canonical example of using cancel_blocking in this way?
I am afraid that the book misled you towards an outdated direction. The cancel-blocking mechanism was introduced very early at a time when we routinely designed inter-component interfaces that were blocking at the server side. At that time, L4 kernels did not support any means of asynchronous notifications, thereby luring us into this direction. Later, we realized this mistake and successively redesigned the interfaces [1] to use a combination of synchronous RPCs that immediately return and asynchronous notifications for blocking at the client side. We announced this transition in May last year [2] and finished it in May this year.
[1] https://genode.org/documentation/release-notes/13.02#Timer interface turned into asynchronous mode of operation [2] https://genode.org/documentation/release-notes/16.05#The_great_API_renovatio...
For modern components, the cancel-blocking mechanism is no longer used. We still keep it around to uphold compatibility but I hope to eventually remove it from the API in the not-too-distant future.
My experiment with this (Genode 17.08, x86_32) seems to work as expected--but, only with the OKL4 kernel!? With nova, hw, and seL4, the cancel_blocking() method executes but seemingly to no effect: the thread continues to wait on the (contrived) very slow RPC call, which eventually completes.
Suggestions?
When a client calls a server, it ultimately yields the flow of control to the server until the server replies. Because a misbehaving server may never reply, e.g., because of a bug, the client could get stuck at that point. There is no counter measure for this situation. We found that potential counter measures like IPC timeouts or the cancel-blocking mechanism that are intuitively tempting are bug prone and lead to indeterministic system behavior. A client unconditionally expects that the server replies to an RPC request. From a client's perspective, a server called via RPC functions is similar to a regular third-party library. When calling a library function, one can never be sure that the function will eventually return. It could get stuck in the library. Therefore, we devise the best practice to implement complex (bug-prone) software as mere clients, not servers. Please consider Section 3.2.4. "Client-server relationship" of the book for a succinct characterization of the client and server roles within Genode.
The canonical example of this best practice is the window manager, which is a composition of the low-complexity 'wm' component (that acts as a server) and the potentially high-complexity (and more bug prone) layouter and window decorator components. The latter two components are mere clients of the 'wm' server. Another good example is the way how the (trusted) report_rom server decouples the producers and consumers of state information. Both the producer ('Report' session client) and consumer ('ROM' session client) are clients of the report_rom server. They both trust the report_rom server but they don't need to trust each other, nor does the report_rom server need to trust any of them.
Note that throughout Genode, there are still several places where we don't fully adhere to this practice yet. I.e., NIC drivers (like the highly complex wifi_drv) still act as servers. But we will ultimately change this in a way that NIC drivers will become clients of the low complexity nic_router component.
When following this route, there is no need for the cancel-blocking mechanism. Your observation that the cancel-blocking mechanism works for RPCs on OKL4 is just an artifact from the past.
Sorry that the book guided you in the wrong direction. Could you please point me to the particular part so that I can revise it?
Cheers Norman