Hello Genodians
We are still trying to introduce a named pipe for communication between libc components and pure genode components. Our vfs fifo pipe now works with multiple threads in the same process, but only when it's hosted in the vfs of that same component. As soon as we host it in separate vfs component it stalls randomly.
Our analysis shows that the reason for this lies in the implementation of the file system session interface, where the write operation doesn't report back the number of bytes written but the receiving vfs component instead retries the write operation. This leads to a permanent blocking of the vfs component as no read operation can alleviate the full buffer of the fifo pipe.
We think the only proper fix would be a change in the file system session to allow the write operation to report back the number of bytes written. Is there a specific reason to keep the write 'fire-and-forget', other than simplicity?
Best regards
Stefan
Hi Stefan,
We think the only proper fix would be a change in the file system session to allow the write operation to report back the number of bytes written. Is there a specific reason to keep the write 'fire-and-forget', other than simplicity?
I'm afraid that your assessment is not correct. There is indeed a "specific reason" behind the design. Let me explain.
Intuitively, letting the write operation return the number of written bytes would be a no-brainer. This is what is suggested by the POSIX write function after all. In practice, however, this approach implies two hard problems:
1. To know the number of bytes written, the caller has to wait for the acknowledgement of the write operation. This synchronization point adds the end-to-end latency of the write operation to each individual write request. This is particularly bad when issuing sequences of write operations. Effectively, the need for waiting for the acknowledgement removes the opportunity of batching file- system requests. In a component-based system where we need to consider a chain of file-system servers, this problem is amplified.
In contrast, our design facilitates the hiding of write latency using the principal approach of pipelining. The contract of the write operation is simple: When the client was able to successfully enqueue the write request into the file-system session's packet stream, the operation is successful. The client can immediately resume execution regardless of the latency of the write operation.
2. Assuming that we reflected the number of written bytes to the client, what would a client do with this information? There are two likely answers.
(a) The client ignores this information. This is what happens in the real world for most users of the POSIX write call. Partial writes are generally not anticipated by application software because they don't happen on commodity systems. In this case, data would go lost.
Anecdotally, we have repeatedly encountered this problem with ported software using previous versions of our VFS/libc, which happened to reflect partial writes to the applications at that time.
(b) The client would respond by issuing another write operation with the remaining content. In a scenario where a write operation can only be done partially (e.g., pipe is full like in your example), this approach would ultimately result in a busy loop.
In short, our design tries the leverage async I/O for hiding the latency of write operations, and it yields the desired behavior of blocking instead of busylooping when no progress can be made.
We are still trying to introduce a named pipe for communication between libc components and pure genode components. Our vfs fifo pipe now works with multiple threads in the same process, but only when it's hosted in the vfs of that same component. As soon as we host it in separate vfs component it stalls randomly.
My interpretation of the scenario:
- In general, multiple operations can be enqueued to a single file- system session. The VFS server processes the submitted operations strictly in order. The file-system session is a serialization point.
- The VFS pipe plugin introduces a dependency of write operations from read operations. If the pipe is full, a write operation has to stall until a reader has consumed data from the pipe.
- The pipe buffer is bounded.
- Your client uses a single file system session to submit both read and write requests to the VFS server.
What happens:
The pipe buffer is saturated by previous write operations.
The client issues a write operation that exceeds the remaining capacity of the pipe buffer. Consequently, the write request stays in the packet stream to be picked up by the VFS server the next time data can be consumed. Each time the VFS server observes I/O, it tries to resume the write operation. This is done piece by piece until the entire request is completely processed. In your case, the write would stall until the pipe buffer has gained some new room.
As file operations are processed strictly in order, the partially processed write operation clogs up the file-system sessions packet stream. This is because the file-system session is a serialization point.
The client submits a read operation to the file-system session. Even though the operation got enqueued, the VFS server never 1ooks at it because it is still concerned with the not-yet-completed write operation.
A deadlock occurs because the read operation - which is second in the queue - would be required for the progress of the write operation (first in the queue).
What can you do about it?
The interlocking of inter-dependent read and write in one data channel must be avoided. In a multi-component scenario, each the reader and writer are separate components with each having a distinct file system session. So this situation does not occur.
For your single-component scenario, you may consider using two file-system sessions, one for using the reading and one for the writing end of the pipe. Both sessions would be routed to the same VFS server.
<vfs> ... <dir name="reader"> <fs label="pipe"/> </dir> <dir name="writer"> <fs label="pipe"/> </dir> </vfs>
This way, read and write operations cannot interlock.
This leads to a permanent blocking of the vfs component as no read operation can alleviate the full buffer of the fifo pipe.
The statement irritates me because the VFS server must never block. Are you sure that the server is blocking, not merely stalling a single session? Please connect an unrelated component to the VFS server to see whether it remains responsive or not. The latter case would be a bug (of the VFS server, the VFS library, or one of the used plugins).
Regards Norman
Hi Norman
Thanks for your explanation.
We think the only proper fix would be a change in the file system session to allow the write operation to report back the number of bytes written. Is there a specific reason to keep the write 'fire-and-forget', other than simplicity?
I'm afraid that your assessment is not correct. There is indeed a "specific reason" behind the design. Let me explain.
Intuitively, letting the write operation return the number of written bytes would be a no-brainer. This is what is suggested by the POSIX write function after all. In practice, however, this approach implies two hard problems:
To know the number of bytes written, the caller has to wait for the acknowledgement of the write operation. This synchronization point adds the end-to-end latency of the write operation to each individual write request. This is particularly bad when issuing sequences of write operations. Effectively, the need for waiting for the acknowledgement removes the opportunity of batching file- system requests. In a component-based system where we need to consider a chain of file-system servers, this problem is amplified.
In contrast, our design facilitates the hiding of write latency using the principal approach of pipelining. The contract of the write operation is simple: When the client was able to successfully enqueue the write request into the file-system session's packet stream, the operation is successful. The client can immediately resume execution regardless of the latency of the write operation.
Very true, but this applies also to the read operation where such behavior cannot be avoided.
Assuming that we reflected the number of written bytes to the client, what would a client do with this information? There are two likely answers.
(a) The client ignores this information. This is what happens in the real world for most users of the POSIX write call. Partial writes are generally not anticipated by application software because they don't happen on commodity systems. In this case, data would go lost.
Anecdotally, we have repeatedly encountered this problem with ported software using previous versions of our VFS/libc, which happened to reflect partial writes to the applications at that time.
True, but Linux will not perform the non-blocking write to a full pipe at all and set errno to EAGAIN.
(b) The client would respond by issuing another write operation with the remaining content. In a scenario where a write operation can only be done partially (e.g., pipe is full like in your example), this approach would ultimately result in a busy loop.
There are good reasons to implement an application with non-blocking write and use select. The most common is not complicating the code with thread synchronization.
In short, our design tries the leverage async I/O for hiding the latency of write operations, and it yields the desired behavior of blocking instead of busylooping when no progress can be made.
As far as I understand, this leads to a blocking write when all buffers are full even when the application intended a non-blocking write. I'm not convinced this is a good solution for best compatibility for porting posix/libc applications.
Also, when the write fails for another reason, such as a lack of disk space the application can't detect that problem and data may be lost.
We are still trying to introduce a named pipe for communication between libc components and pure genode components. Our vfs fifo pipe now works with multiple threads in the same process, but only when it's hosted in the vfs of that same component. As soon as we host it in separate vfs component it stalls randomly.
My interpretation of the scenario:
In general, multiple operations can be enqueued to a single file- system session. The VFS server processes the submitted operations strictly in order. The file-system session is a serialization point.
The VFS pipe plugin introduces a dependency of write operations from read operations. If the pipe is full, a write operation has to stall until a reader has consumed data from the pipe.
The pipe buffer is bounded.
Your client uses a single file system session to submit both read and write requests to the VFS server.
What happens:
The pipe buffer is saturated by previous write operations.
The client issues a write operation that exceeds the remaining capacity of the pipe buffer. Consequently, the write request stays in the packet stream to be picked up by the VFS server the next time data can be consumed. Each time the VFS server observes I/O, it tries to resume the write operation. This is done piece by piece until the entire request is completely processed. In your case, the write would stall until the pipe buffer has gained some new room.
As file operations are processed strictly in order, the partially processed write operation clogs up the file-system sessions packet stream. This is because the file-system session is a serialization point.
The client submits a read operation to the file-system session. Even though the operation got enqueued, the VFS server never 1ooks at it because it is still concerned with the not-yet-completed write operation.
A deadlock occurs because the read operation - which is second in the queue - would be required for the progress of the write operation (first in the queue).
What can you do about it?
The interlocking of inter-dependent read and write in one data channel must be avoided. In a multi-component scenario, each the reader and writer are separate components with each having a distinct file system session. So this situation does not occur.
For your single-component scenario, you may consider using two file-system sessions, one for using the reading and one for the writing end of the pipe. Both sessions would be routed to the same VFS server.
<vfs> ... <dir name="reader"> <fs label="pipe"/> </dir> <dir name="writer"> <fs label="pipe"/> </dir> </vfs>
This way, read and write operations cannot interlock.
Thanks, this solves our problem.
This leads to a permanent blocking of the vfs component as no read operation can alleviate the full buffer of the fifo pipe.
The statement irritates me because the VFS server must never block. Are you sure that the server is blocking, not merely stalling a single session? Please connect an unrelated component to the VFS server to see whether it remains responsive or not. The latter case would be a bug (of the VFS server, the VFS library, or one of the used plugins).
You are of course correct, the vfs component doesn't block but retries the same write later with not success. Read from another session works just fine and lets the write succeed.
Bests Stefan
Hi Stefan,
There are good reasons to implement an application with non-blocking write and use select. The most common is not complicating the code with thread synchronization.
In short, our design tries the leverage async I/O for hiding the latency of write operations, and it yields the desired behavior of blocking instead of busylooping when no progress can be made.
As far as I understand, this leads to a blocking write when all buffers are full even when the application intended a non-blocking write. I'm not convinced this is a good solution for best compatibility for porting posix/libc applications.
Please don't mistake my statement as an argument against non-blocking writes. I'm with you.
My explanation referred to the file-system session level because you questioned the design of the file-system session.
At the libc level, there is of course the distinction between blocking and non-blocking writes (see [1]) in place.
[1] https://github.com/genodelabs/genode/blob/master/repos/libports/src/lib/libc...
Also, when the write fails for another reason, such as a lack of disk space the application can't detect that problem and data may be lost.
This would be modeled as an I/O error, which can indeed be propagated via the acknowledgement of a write operation today. To avoid data loss under such circumstances, the condition of a full disk could be propagated as an I/O error in advance of the catastrophic condition, when reaching a certain disk-usage threshold. The file system could asynchronously report the error via a write-acknowledgement while still successfully completing the pending requests.
Currently, such situations would trigger a diagnostic message only. However, we consider flagging the corresponding libc FD when observing this condition. This way, a subsequent attempt to use the file descriptor would return an error. That said, this mechanism is not implemented as of now. But the design holds.
You are of course correct, the vfs component doesn't block but retries the same write later with not success. Read from another session works just fine and lets the write succeed.
That's good. Thank you for reporting back.
Regards Norman