Hi Norman
Thanks for your explanation.
We think the only proper fix would be a change in the file system session to allow the write operation to report back the number of bytes written. Is there a specific reason to keep the write 'fire-and-forget', other than simplicity?
I'm afraid that your assessment is not correct. There is indeed a "specific reason" behind the design. Let me explain.
Intuitively, letting the write operation return the number of written bytes would be a no-brainer. This is what is suggested by the POSIX write function after all. In practice, however, this approach implies two hard problems:
To know the number of bytes written, the caller has to wait for the acknowledgement of the write operation. This synchronization point adds the end-to-end latency of the write operation to each individual write request. This is particularly bad when issuing sequences of write operations. Effectively, the need for waiting for the acknowledgement removes the opportunity of batching file- system requests. In a component-based system where we need to consider a chain of file-system servers, this problem is amplified.
In contrast, our design facilitates the hiding of write latency using the principal approach of pipelining. The contract of the write operation is simple: When the client was able to successfully enqueue the write request into the file-system session's packet stream, the operation is successful. The client can immediately resume execution regardless of the latency of the write operation.
Very true, but this applies also to the read operation where such behavior cannot be avoided.
Assuming that we reflected the number of written bytes to the client, what would a client do with this information? There are two likely answers.
(a) The client ignores this information. This is what happens in the real world for most users of the POSIX write call. Partial writes are generally not anticipated by application software because they don't happen on commodity systems. In this case, data would go lost.
Anecdotally, we have repeatedly encountered this problem with ported software using previous versions of our VFS/libc, which happened to reflect partial writes to the applications at that time.
True, but Linux will not perform the non-blocking write to a full pipe at all and set errno to EAGAIN.
(b) The client would respond by issuing another write operation with the remaining content. In a scenario where a write operation can only be done partially (e.g., pipe is full like in your example), this approach would ultimately result in a busy loop.
There are good reasons to implement an application with non-blocking write and use select. The most common is not complicating the code with thread synchronization.
In short, our design tries the leverage async I/O for hiding the latency of write operations, and it yields the desired behavior of blocking instead of busylooping when no progress can be made.
As far as I understand, this leads to a blocking write when all buffers are full even when the application intended a non-blocking write. I'm not convinced this is a good solution for best compatibility for porting posix/libc applications.
Also, when the write fails for another reason, such as a lack of disk space the application can't detect that problem and data may be lost.
We are still trying to introduce a named pipe for communication between libc components and pure genode components. Our vfs fifo pipe now works with multiple threads in the same process, but only when it's hosted in the vfs of that same component. As soon as we host it in separate vfs component it stalls randomly.
My interpretation of the scenario:
In general, multiple operations can be enqueued to a single file- system session. The VFS server processes the submitted operations strictly in order. The file-system session is a serialization point.
The VFS pipe plugin introduces a dependency of write operations from read operations. If the pipe is full, a write operation has to stall until a reader has consumed data from the pipe.
The pipe buffer is bounded.
Your client uses a single file system session to submit both read and write requests to the VFS server.
What happens:
The pipe buffer is saturated by previous write operations.
The client issues a write operation that exceeds the remaining capacity of the pipe buffer. Consequently, the write request stays in the packet stream to be picked up by the VFS server the next time data can be consumed. Each time the VFS server observes I/O, it tries to resume the write operation. This is done piece by piece until the entire request is completely processed. In your case, the write would stall until the pipe buffer has gained some new room.
As file operations are processed strictly in order, the partially processed write operation clogs up the file-system sessions packet stream. This is because the file-system session is a serialization point.
The client submits a read operation to the file-system session. Even though the operation got enqueued, the VFS server never 1ooks at it because it is still concerned with the not-yet-completed write operation.
A deadlock occurs because the read operation - which is second in the queue - would be required for the progress of the write operation (first in the queue).
What can you do about it?
The interlocking of inter-dependent read and write in one data channel must be avoided. In a multi-component scenario, each the reader and writer are separate components with each having a distinct file system session. So this situation does not occur.
For your single-component scenario, you may consider using two file-system sessions, one for using the reading and one for the writing end of the pipe. Both sessions would be routed to the same VFS server.
<vfs> ... <dir name="reader"> <fs label="pipe"/> </dir> <dir name="writer"> <fs label="pipe"/> </dir> </vfs>
This way, read and write operations cannot interlock.
Thanks, this solves our problem.
This leads to a permanent blocking of the vfs component as no read operation can alleviate the full buffer of the fifo pipe.
The statement irritates me because the VFS server must never block. Are you sure that the server is blocking, not merely stalling a single session? Please connect an unrelated component to the VFS server to see whether it remains responsive or not. The latter case would be a bug (of the VFS server, the VFS library, or one of the used plugins).
You are of course correct, the vfs component doesn't block but retries the same write later with not success. Read from another session works just fine and lets the write succeed.
Bests Stefan