Hi, Genode hackers!
I've been debugging the performance of the network under Genode. I am using Genode on X86 with the dde_ipxe nic driver.
So far, I have figured out the following: 1. Memcpy and malloc calls in the nic driver take a negligible amount of time. I'm using X86 rdtsc to measure tick count, and memcpy usually amounts to 5-20 ticks 2. Most of time is spent in the rx_handler: the _alloc.submit() call (260 ticks)
Inside the submit() routine (in the os/include/nic/component.h), around 80 percent of time is occupied by the "_rx.source()->submit_packet()" call. The submit() routine first checks for acknowledged packets from the client, and then calls the submit_packet() function.
The Tx_thread::entry() gets packets from the client, sends it to the nic, and acks.
My questions are the following: 1. Which ways to improve nic performance can you suggest? 2. Why does the nic rx ack the packets? since packet order should be handled by TCP, can we split packet acking and submitting the "current" packet into separate threads? 3. What is the suggested "best" way to debug IPC and profile applications?
Thank you for your attention. Have fun!
Hi Alexander,
My questions are the following:
- Which ways to improve nic performance can you suggest?
I suspect that the NIC performance is dominated by a systematic issue that is not addressable by micro-optimizations. Recently, we have started to look closely into the network performance and gathered some interesting insights already:
* We conducted a benchmark of the raw driver throughput w/o the NIC-session interface, or a TCP/IP stack. We found the ipxe driver to perform quite well. If you like to base your experiments on our tests, the following commit might be of interest:
https://github.com/genodelabs/genode/commit/b5d3be9c859e24cd331e94b001a3d0e0...
* Separating the test program from the driver via the NIC session interface had a negligible effect on the throughput. From this experiment, we gather that micro-optimizing the NIC session mechanism will probably not yield a significant improvement.
The problem must lie somewhere else, coming into effect at a higher level. So Alexander Böttcher ported netperf to Genode in order to systematically compare application-level performance of Genode against Linux.
https://github.com/genodelabs/genode/commit/86e428cd64770270811821757730c84c...
By comparing the benchmark when using different kernels, we observed that the multi-processor version of Fiasco.OC suffered compared to the uni-processor version. We identified an issue in the kernel that may play a role here:
https://github.com/genodelabs/genode/issues/712
In short, the system executes a CPU-bounded workload spends a lot of time in the idle thread, which is a contradiction. On CPU-bounded work loads, the idle thread should never be selected by the scheduler. For your tests, you may check how much time was spent in the idle thread via the Fiasco.OC kernel debugger (use 'lp', select the idle thread). As long as we see the idle thread being scheduled, we should not look too much into micro-optimizing any code paths.
Another systematic issue might arise from the TCP protocol, i.e., from the size of TCP receive windows and retransmission buffers. These effects are nicely illustrated by our recent realization of how to boost the performance of lwIP on Genode:
https://github.com/genodelabs/genode/commit/3a884bd87306ced84eae68f1ff3258b4...
This was really an eye opener for us, prompting us to focus more on possible protocol-parametrization issues than on profiling code paths.
- Why does the nic rx ack the packets? since packet order should be
handled by TCP, can we split packet acking and submitting the "current" packet into separate threads?
The acknowledgement is needed to tell the source (the driver) that the buffer of the network packet can be freed and used for another incoming packet. For understanding the packet-stream interface, please have a look at:
http://genode.org/documentation/release-notes/9.11#Packet-stream_interface
Regarding the question about introducing more threads for the handling of the packet stream protocol, we are currently going exactly into the opposite direction: We try to reduce the number of threads in the NIC handling. I.e., we moved DDE-Linux to single-threaded mode of operation. By doing that, we reduce the number context switches and alleviate the need for synchronization. Because of the good experience we had with DDE-Linux, we plan to change other components (such as nic_bridge, part_blk) that operate on packet streams in a similar way, modelling them as state machines instead of using multiple blocking threads.
In short, we think that there is not much to gain (in terms of performance) from distributing I/O bounded work to multiple CPU threads.
- What is the suggested "best" way to debug IPC and profile applications?
We used to use the Fiasco.OC trace buffer for that, in particular the IPC-tracing feature. But admittedly, it is quite difficult to map the low-level information gathered by the kernel to the high-level application view. E.g., the kernel stores the kernel-internal names of kernel objects, which are unrelated to how the kernel objects are named at the user land. So looking at the trace buffer is confusing. Not to speak of interpreting RPC protocols - that is hardly feasible for a work load of modest complexity.
Fortunately, I can report that there is work under way to equip Genode with a tracing infrastructure, which will allow us to gather traces about RPCs (including the actual RPC function names), contended locks, and signals. The new facility is currently developed by Josef Söntgen (cnuke at GitHub). We plan to include it in Genode 13.08.
Cheers Norman
2013/5/18 Norman Feske <norman.feske@...1...>
Hi Alexander,
Hello Norman!
Thank you for your comprehensive email which answered many questions I had about Genode current state!
In short, the system executes a CPU-bounded workload spends a lot of time in the idle thread, which is a contradiction. On CPU-bounded work loads, the idle thread should never be selected by the scheduler. For your tests, you may check how much time was spent in the idle thread via the Fiasco.OC kernel debugger (use 'lp', select the idle thread). As long as we see the idle thread being scheduled, we should not look too much into micro-optimizing any code paths.
Do you have a proposed solution? Like, modifying Fiasco.OC scheduler to be completely timer-driven and preemptive?
In short, we think that there is not much to gain (in terms of performance) from distributing I/O bounded work to multiple CPU threads.
This is very nice! I didn't quite like the packet_stream interface from the start, because in traditional monolithic kernels we're used to static allocate-once memory blocks, but even the current Genode block I/O implementation does not show severe performance degradation, so debugging and profiling is the best way to judge implementations.
Fortunately, I can report that there is work under way to equip Genode with a tracing infrastructure, which will allow us to gather traces about RPCs (including the actual RPC function names), contended locks, and signals. The new facility is currently developed by Josef Söntgen (cnuke at GitHub). We plan to include it in Genode 13.08.
This sounds very promising. I tried porting gprof, but as I didn't take my time to properly study how it is implemented, I did not finish that. Maybe we'll try doing it as a Summer School task at Ksys Labs. My investigations left me with the following conclusions 1. You need to enable the freebsd libc-gprof (easy task, a couple makefiles and header symlinks) 2. Implement the "profil" and "mcount" functions I think we can rely on the POSIX emulation layer for writing the gmon.out So, we could probably use gprof for some apps. Does it sound like a nice idea?
While searching for more info, today I've found the following link about gprof. http://sourceware.org/binutils/docs/gprof/Implementation.html
Cheers Norman
-- Dr.-Ing. Norman Feske Genode Labs
http://www.genode-labs.com · http://genode.org
Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth
AlienVault Unified Security Management (USM) platform delivers complete security visibility with the essential security capabilities. Easily and efficiently configure, manage, and operate all of your security controls from a single console and one unified framework. Download a free trial. http://p.sf.net/sfu/alienvault_d2d _______________________________________________ Genode-main mailing list Genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main