Nic Driver performance

Sat May 18 15:04:01 CEST 2013

Hi Alexander,

> My questions are the following:
> 1. Which ways to improve nic performance can you suggest?

I suspect that the NIC performance is dominated by a systematic issue
that is not addressable by micro-optimizations. Recently, we have
started to look closely into the network performance and gathered some
interesting insights already:

* We conducted a benchmark of the raw driver throughput w/o the
  NIC-session interface, or a TCP/IP stack. We found the ipxe driver
  to perform quite well. If you like to base your experiments on our
  tests, the following commit might be of interest:

https://github.com/genodelabs/genode/commit/b5d3be9c859e24cd331e94b001a3d0e044c5c56b

* Separating the test program from the driver via the NIC session
  interface had a negligible effect on the throughput. From this
  experiment, we gather that micro-optimizing the NIC session
  mechanism will probably not yield a significant improvement.

The problem must lie somewhere else, coming into effect at a higher
level. So Alexander Böttcher ported netperf to Genode in order to
systematically compare application-level performance of Genode against
Linux.

https://github.com/genodelabs/genode/commit/86e428cd64770270811821757730c84c42f820e7

By comparing the benchmark when using different kernels, we observed
that the multi-processor version of Fiasco.OC suffered compared to the
uni-processor version. We identified an issue in the kernel that may
play a role here:

  https://github.com/genodelabs/genode/issues/712

In short, the system executes a CPU-bounded workload spends a lot of
time in the idle thread, which is a contradiction. On CPU-bounded work
loads, the idle thread should never be selected by the scheduler. For
your tests, you may check how much time was spent in the idle thread via
the Fiasco.OC kernel debugger (use 'lp', select the idle thread). As
long as we see the idle thread being scheduled, we should not look too
much into micro-optimizing any code paths.

Another systematic issue might arise from the TCP protocol, i.e., from
the size of TCP receive windows and retransmission buffers. These
effects are nicely illustrated by our recent realization of how to boost
the performance of lwIP on Genode:

https://github.com/genodelabs/genode/commit/3a884bd87306ced84eae68f1ff3258b40888f58e

This was really an eye opener for us, prompting us to focus more on
possible protocol-parametrization issues than on profiling code paths.

> 2. Why does the nic rx ack the packets? since packet order should be
> handled by TCP, can we split packet acking and submitting the "current"
> packet into separate threads?

The acknowledgement is needed to tell the source (the driver) that the
buffer of the network packet can be freed and used for another incoming
packet. For understanding the packet-stream interface, please have a
look at:

  http://genode.org/documentation/release-notes/9.11#Packet-stream_interface

Regarding the question about introducing more threads for the handling
of the packet stream protocol, we are currently going exactly into the
opposite direction: We try to reduce the number of threads in the NIC
handling. I.e., we moved DDE-Linux to single-threaded mode of operation.
By doing that, we reduce the number context switches and alleviate the
need for synchronization. Because of the good experience we had with
DDE-Linux, we plan to change other components (such as nic_bridge,
part_blk) that operate on packet streams in a similar way, modelling
them as state machines instead of using multiple blocking threads.

In short, we think that there is not much to gain (in terms of
performance) from distributing I/O bounded work to multiple CPU threads.

> 3. What is the suggested "best" way to debug IPC and profile applications?

We used to use the Fiasco.OC trace buffer for that, in particular the
IPC-tracing feature. But admittedly, it is quite difficult to map the
low-level information gathered by the kernel to the high-level
application view. E.g., the kernel stores the kernel-internal names of
kernel objects, which are unrelated to how the kernel objects are named
at the user land. So looking at the trace buffer is confusing. Not to
speak of interpreting RPC protocols - that is hardly feasible for a work
load of modest complexity.

Fortunately, I can report that there is work under way to equip Genode
with a tracing infrastructure, which will allow us to gather traces
about RPCs (including the actual RPC function names), contended locks,
and signals. The new facility is currently developed by Josef Söntgen
(cnuke at GitHub). We plan to include it in Genode 13.08.

Cheers
Norman

-- 
Dr.-Ing. Norman Feske
Genode Labs

http://www.genode-labs.com · http://genode.org

Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden
Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth