Hi Alexander,
My questions are the following:
- Which ways to improve nic performance can you suggest?
I suspect that the NIC performance is dominated by a systematic issue that is not addressable by micro-optimizations. Recently, we have started to look closely into the network performance and gathered some interesting insights already:
* We conducted a benchmark of the raw driver throughput w/o the NIC-session interface, or a TCP/IP stack. We found the ipxe driver to perform quite well. If you like to base your experiments on our tests, the following commit might be of interest:
https://github.com/genodelabs/genode/commit/b5d3be9c859e24cd331e94b001a3d0e0...
* Separating the test program from the driver via the NIC session interface had a negligible effect on the throughput. From this experiment, we gather that micro-optimizing the NIC session mechanism will probably not yield a significant improvement.
The problem must lie somewhere else, coming into effect at a higher level. So Alexander Böttcher ported netperf to Genode in order to systematically compare application-level performance of Genode against Linux.
https://github.com/genodelabs/genode/commit/86e428cd64770270811821757730c84c...
By comparing the benchmark when using different kernels, we observed that the multi-processor version of Fiasco.OC suffered compared to the uni-processor version. We identified an issue in the kernel that may play a role here:
https://github.com/genodelabs/genode/issues/712
In short, the system executes a CPU-bounded workload spends a lot of time in the idle thread, which is a contradiction. On CPU-bounded work loads, the idle thread should never be selected by the scheduler. For your tests, you may check how much time was spent in the idle thread via the Fiasco.OC kernel debugger (use 'lp', select the idle thread). As long as we see the idle thread being scheduled, we should not look too much into micro-optimizing any code paths.
Another systematic issue might arise from the TCP protocol, i.e., from the size of TCP receive windows and retransmission buffers. These effects are nicely illustrated by our recent realization of how to boost the performance of lwIP on Genode:
https://github.com/genodelabs/genode/commit/3a884bd87306ced84eae68f1ff3258b4...
This was really an eye opener for us, prompting us to focus more on possible protocol-parametrization issues than on profiling code paths.
- Why does the nic rx ack the packets? since packet order should be
handled by TCP, can we split packet acking and submitting the "current" packet into separate threads?
The acknowledgement is needed to tell the source (the driver) that the buffer of the network packet can be freed and used for another incoming packet. For understanding the packet-stream interface, please have a look at:
http://genode.org/documentation/release-notes/9.11#Packet-stream_interface
Regarding the question about introducing more threads for the handling of the packet stream protocol, we are currently going exactly into the opposite direction: We try to reduce the number of threads in the NIC handling. I.e., we moved DDE-Linux to single-threaded mode of operation. By doing that, we reduce the number context switches and alleviate the need for synchronization. Because of the good experience we had with DDE-Linux, we plan to change other components (such as nic_bridge, part_blk) that operate on packet streams in a similar way, modelling them as state machines instead of using multiple blocking threads.
In short, we think that there is not much to gain (in terms of performance) from distributing I/O bounded work to multiple CPU threads.
- What is the suggested "best" way to debug IPC and profile applications?
We used to use the Fiasco.OC trace buffer for that, in particular the IPC-tracing feature. But admittedly, it is quite difficult to map the low-level information gathered by the kernel to the high-level application view. E.g., the kernel stores the kernel-internal names of kernel objects, which are unrelated to how the kernel objects are named at the user land. So looking at the trace buffer is confusing. Not to speak of interpreting RPC protocols - that is hardly feasible for a work load of modest complexity.
Fortunately, I can report that there is work under way to equip Genode with a tracing infrastructure, which will allow us to gather traces about RPCs (including the actual RPC function names), contended locks, and signals. The new facility is currently developed by Josef Söntgen (cnuke at GitHub). We plan to include it in Genode 13.08.
Cheers Norman