Hi,
While investigating the cause for the performance discrepancy of Virtualbox on NOVA vs hw_x86_64[_muen] [1], I have determined, with help from Alexander, that Virtualbox thread priorities [2] are not applied on base-hw. Even though priorities are specified when constructing the Cpu_connection [3], they have no effect since no quota is subsequently assigned/transferred. According to the 14.11 release notes [4], threads without quota have no guarantee to be scheduled and their priority is ignored. I believe this is the main cause for the difference in execution speed of Virtualbox on NOVA and base-hw.
I was wondering if you could confirm my analysis of the issue and provide some pointers on how to best achieve the proper application of Virtualbox thread priorities on base-hw.
Regards, Adrian
[1] - Even in recompiler-only mode the guest execution on NOVA is considerably faster than on hw_x86_64 or hw_x86_64_muen [2] - https://www.virtualbox.org/browser/vbox/trunk/include/iprt/thread.h#L90 [3] - https://github.com/genodelabs/genode/blob/master/repos/ports/src/virtualbox/... [4] - http://genode.org/documentation/release-notes/14.11#Trading_CPU_time_between...
Hi Adrian,
On 02.07.2015 12:02, Adrian-Ken Rueegsegger wrote:
Hi,
While investigating the cause for the performance discrepancy of Virtualbox on NOVA vs hw_x86_64[_muen] [1], I have determined, with help from Alexander, that Virtualbox thread priorities [2] are not applied on base-hw. Even though priorities are specified when constructing the Cpu_connection [3], they have no effect since no quota is subsequently assigned/transferred. According to the 14.11 release notes [4], threads without quota have no guarantee to be scheduled and their priority is ignored. I believe this is the main cause for the difference in execution speed of Virtualbox on NOVA and base-hw.
First of all, be aware that in the 15.05 release the integration of the base-hw scheduling model has changed a bit [1]. Let's say you configure a component with something like '<resource name="CPU" quantum="30"/>'. Before 15.05, these 30% of the parent quota were merely transferred to the env()->cpu_session() of the new component without any effect until you explicitely say "thread X shall receive Y percent of this quota" inside the component.
Since 15.05, all quota of a CPU session is automatically and permanently distributed to the threads of the session. So as long as there is only the main thread in our example component, this thread receives 30% of the parent quota. After starting two additional threads, every thread would receive 10% and so on.
To influence this distribution, there are the so-called thread weights. By default, every thread gets the same weight 10. But if you create a thread with a higher weight, it receives relatively more quota from its session. So, in the above example, if there is the main thread (default weight 10) and an additional thread with weight 90, the main thread receives 3% of the parent quota while the other thread 27%.
However, you're right in general. CPU quota is our way to restrict the time that a priority is allowed to get effective during a scheduling super-period. So, if a prioritized thread has quota 0, the priority is never applied by the scheduler.
I was wondering if you could confirm my analysis of the issue and provide some pointers on how to best achieve the proper application of Virtualbox thread priorities on base-hw.
AFAIK, in the virtualbox component, we create a CPU session for each priority level. To get these priorities active, the following would have to be done:
* Configure the vbox component to receive some quota from init (add <resource name="CPU" quantum="${PERCENTAGE}"/> as subnode to the vbox start node)
* Set the reference account of every new CPU connection in vbox to the env session by doing my_cpu_connection.ref_account(Genode::env()->cpu_session()). This enables you to transfer quota from the reference account to your CPU connection.
* Let env()->cpu_session() transfer its quota to the new CPU connections. This is a bit tricky. When transferring, you state the percentage of the source-session quota scaled up to Cpu_session::QUOTA_LIMIT. So, if you state (Cpu_session::QUOTA_LIMIT / 2), this means 50% of the source-session quota while (Cpu_session::QUOTA_LIMIT / 3) means 33% and so on. Furthermore, the value refers to the quota that is left at the source session not its initial quota. Thus, if you want to distribute quota evenly to 3 other sessions A, B, C, start with transferring 33% to A, then transfer 50% to B, and then transfer 100% to C". For an example see Init::Child::Resources::transfer_cpu_quota in os/include/init/child.h. It takes the percentages that refer to the initial env quota from the init config and translates them to percentages that refer to the latest env quota.
* Optional: Specify individual weights for the vbox threads when constructing them. If you roughly know, which thread needs more time for its task and which one needs less, specifying individual weights may optimize the overall performance. It goes without saying that this also depends on how you've distributed quota in the previous steps. Thread weights influence only the session-local distribution.
Now, let's say you have 4 threads. Thread A needs low latencies but has little to do when it gets scheduled. Thread B also needs low latencies but normally has a bigger work load than thread A. Thread C does not need low latencies but should have the guarantee to be scheduled frequently nevertheless. In addition, thread C normally has the biggest workload. Thread D needs no guarantee about whether it gets scheduled frequently and with which latency. So, an exemplary configuration would be the following:
CPU connection 1: 40% of env quota, higher prio thread A: weight 1 (10% of env quota) thread B: weight 3 (30% of env quota)
CPU connection 2: 60% of env quota, lower prio thread C: default weight (60% of env quota)
CPU connection 3: 0% of env quota, prio doesn't matter thread D: default weight (0% of env quota)
Thread D could also be created at the env session if you distribute all env quota. Furthermore, if you want to do this distribution on a more reasonable basis, you can use Cpu_session::quota(). It tells you, how long a super period is in microseconds and how much of that time your session owns.
Be aware that there might be a problem with CPU quotas on hw_x86_64 currently. The test os/cpu_quota.run doesn't succeed on hw_x86_64 (it states that it is successful but that is a bug because err in check_counter in cpu_quota.run is 1 and not 0.01). There might be a problem with the timer or the quota calculation (64 bit). I'll have a look at this ASAP. However, the test finishes and shows a correct CPU-utilization stat over all counter threads. The sourcecode might also serve as show case for how CPU quotas (should) work.
If you have further questions, don't hesitate to ask ;)
Cheers, Martin
[1] http://genode.org/documentation/release-notes/15.05#Dynamic_thread_weights
Hello Martin,
Thank you for your insightful comments.
On 07/02/2015 04:48 PM, Martin Stein wrote:
Hi Adrian,
[...]
- Configure the vbox component to receive some quota from init (add
<resource name="CPU" quantum="${PERCENTAGE}"/> as subnode to the vbox start node)
- Set the reference account of every new CPU connection in vbox to the
env session by doing my_cpu_connection.ref_account(Genode::env()->cpu_session()). This enables you to transfer quota from the reference account to your CPU connection.
- Let env()->cpu_session() transfer its quota to the new CPU
connections. This is a bit tricky. When transferring, you state the percentage of the source-session quota scaled up to Cpu_session::QUOTA_LIMIT. So, if you state (Cpu_session::QUOTA_LIMIT / 2), this means 50% of the source-session quota while (Cpu_session::QUOTA_LIMIT / 3) means 33% and so on. Furthermore, the value refers to the quota that is left at the source session not its initial quota. Thus, if you want to distribute quota evenly to 3 other sessions A, B, C, start with transferring 33% to A, then transfer 50% to B, and then transfer 100% to C". For an example see Init::Child::Resources::transfer_cpu_quota in os/include/init/child.h. It takes the percentages that refer to the initial env quota from the init config and translates them to percentages that refer to the latest env quota.
I have implemented the above mentioned steps, see [1], and the boot time of a custom Linux (buildroot) from the bootloader menu to the prompt has been halved from ~1 min 03 seconds down to ~33 seconds. However it seems that tweaking/playing around with the quota values [2] has no further effect on the execution time. Changing the base-hw super period to 100ms and the timeslice to 1ms in [3] reduces the boot time to ~19 seconds. I am not quite sure about the exact reason for the speedup but I presume it is due to the fact that the super period is much shorter and thus the quota of each thread is refilled more frequently.
[...]
Be aware that there might be a problem with CPU quotas on hw_x86_64 currently. The test os/cpu_quota.run doesn't succeed on hw_x86_64 (it states that it is successful but that is a bug because err in check_counter in cpu_quota.run is 1 and not 0.01). There might be a problem with the timer or the quota calculation (64 bit). I'll have a look at this ASAP. However, the test finishes and shows a correct CPU-utilization stat over all counter threads. The sourcecode might also serve as show case for how CPU quotas (should) work.
I gave the cpu_quota scenario including your changes regarding #1616 [4] a try on hw_x86_64 but it seems that it does not complete. Should the test pass successfully with your latest changes?
As this execution time is still a lot slower than e.g. NOVA/Virtualbox, which boots the same system in about ~7 seconds, there still seems to be lingering issue(s) with regards to the base-hw scheduling. I would be glad if you could investigate this problem.
Thanks and regards, Adrian
[1] - https://github.com/codelabs-ch/genode/blob/hw_x86_64_muen-vbox/repos/ports/s...
[2] - https://github.com/codelabs-ch/genode/commit/1ffa792da3fdf9850ebf4252f34a4dc... [3] - https://github.com/codelabs-ch/genode/blob/hw_x86_64_muen-vbox/repos/base-hw... [4] - https://github.com/m-stein/genode/commits/1616_cpu_quota_run_some_issues
Hi Adrian,
On 06.07.2015 15:07, Adrian-Ken Rueegsegger wrote:
I have implemented the above mentioned steps, see [1], and the boot time of a custom Linux (buildroot) from the bootloader menu to the prompt has been halved from ~1 min 03 seconds down to ~33 seconds.
Great to hear that the configuration already had a positive effect :)
However it seems that tweaking/playing around with the quota values [2] has no further effect on the execution time.
There is a bug [1] in the current master branch that prevents the initial quota configuration on the construction of a thread. Thus, threads merely receive quota when another thread at the session gets created or destructed. Could you please apply the commit in the referenced issue and try again.
Changing the base-hw super period to 100ms and the timeslice to 1ms in [3] reduces the boot time to ~19 seconds. I am not quite sure about the exact reason for the speedup but I presume it is due to the fact that the super period is much shorter and thus the quota of each thread is refilled more frequently.
I've talked with Alexander about that. Our main ideas regarding this:
* The shorter super period mainly is a benefit for threads with quota. It makes it less propable that quota threads don't use there whole quota during a super period and, as you mentioned, quota more often gets refilled. Thus, non-quota threads more often have only the unassigned quota left for their purposes. The shorter time slice lowers the latency during non-quota execution but also the throughput during this mode. In fact, finding the best super period and time slice is an open problem because we have not much "real world" data regarding the HW-scheduling by now.
* For comparison with NOVA, it might be a good idea to assign 100% of the CPU quota, because then, priorities are absolute also on base-hw. Later, you may switch the best-effort parts of the scenario back to a non-quota mode and see if the performance benefits from it.
* For a better debugging basis it might be useful to get the CPU utilization stats of the scenario. This can be achieved in the kernel pretty simple and unadulterately. You can find a basic implementation and demonstration on my branch [2] (all .* commits). The output is triggered via Kernel::print_char(0) that can be called from everywhere in the userland but you may also print it e.g. on console-input IRQ through THREAD_QUOTA_STATS_IRQ in the last commit. The printed "spent" values are the timer tics that a thread has used for its own or spent helping another thread. The printed "used" values are the timer tics that the thread was really executed (on its own or through a helper).
* Beside vbox, also Genode device drivers (especially the timer) should receive a reasonable amount of quota. On X86 the timer is also pretty time intensive. It should be able to update its state at least 19 times during a super period. In my qemu tests with cpu_quota.run, 10% of the super period (5 ms per update) were definitely enough for the timer but I assume it also works with less quota.
I gave the cpu_quota scenario including your changes regarding #1616 [4] a try on hw_x86_64 but it seems that it does not complete. Should the test pass successfully with your latest changes?
I think the problem is the timer. On ARM the timer is once configured to an X-seconds timeout and then sleeps until this timeout is over. On X86, however, the timer needs the guarantee to be scheduled frequently also during the timeout as mentioned above. Thus, the test simply takes to long although the measured results are good in principal. I'm currently working on this and will keep you up to date.
As this execution time is still a lot slower than e.g. NOVA/Virtualbox, which boots the same system in about ~7 seconds, there still seems to be lingering issue(s) with regards to the base-hw scheduling. I would be glad if you could investigate this problem.
According to my above writings, I think that we can still raise the performance through modifications at the application level. It would also be helpful to see your CPU-utilization stats if the scheudling remains a problem.
Cheers, Martin
[1] https://github.com/genodelabs/genode/issues/1620 [2] https://github.com/m-stein/genode/tree/hw_cpu_quota_stats
Hi Adrian,
On 07.07.2015 15:26, Martin Stein wrote:
I think the problem is the timer. On ARM the timer is once configured to an X-seconds timeout and then sleeps until this timeout is over. On X86, however, the timer needs the guarantee to be scheduled frequently also during the timeout as mentioned above. Thus, the test simply takes to long although the measured results are good in principal. I'm currently working on this and will keep you up to date.
I've pushed a commit [1] that solves the problems with the cpu_quota test on X86.
Cheers, Martin
Hello Martin,
On 07/07/2015 05:27 PM, Martin Stein wrote:
Hi Adrian,
[...]
I've pushed a commit [1] that solves the problems with the cpu_quota test on X86.
I just gave the current staging branch (4b7b440...) a try and while the cpu_quota scenario completes successfully on QEMU, it seems to be stuck after the test_[fast|midl|slow] results are logged. It appears that the "Conclusion" part of the script [1] is never run.
Regards, Adrian
[1] - https://github.com/genodelabs/genode/blob/staging/repos/os/run/cpu_quota.run...
On 07/08/2015 06:07 PM, Adrian-Ken Rueegsegger wrote:
I just gave the current staging branch (4b7b440...) a try and while the cpu_quota scenario completes successfully on QEMU, it seems to be stuck after the test_[fast|midl|slow] results are logged. It appears that the "Conclusion" part of the script [1] is never run.
I forgot to mention that the cpu_quota run script gets stuck when executed on real hardware (Lenovo T430s).
Regards, Adrian
Hi Adrian,
On 08.07.2015 18:07, Adrian-Ken Rueegsegger wrote:
I just gave the current staging branch (4b7b440...) a try and while the cpu_quota scenario completes successfully on QEMU, it seems to be stuck after the test_[fast|midl|slow] results are logged. It appears that the "Conclusion" part of the script [1] is never run.
Unfortunately, I was not able to reproduce this behavior with 4b7b440. My attempt to run cpu_quota on hw_x86_64 fully succeeded. Although I don't expect this to be the problem, which version of TCL, expect and qemu you are using? (I'm using qemu 2.0.0, expect 5.45 and TCL 8.6)
Cheers, Martin
Hello Martin,
On 07/10/2015 01:02 PM, Martin Stein wrote:
Hi Adrian,
On 08.07.2015 18:07, Adrian-Ken Rueegsegger wrote:
I just gave the current staging branch (4b7b440...) a try and while the cpu_quota scenario completes successfully on QEMU, it seems to be stuck after the test_[fast|midl|slow] results are logged. It appears that the "Conclusion" part of the script [1] is never run.
Unfortunately, I was not able to reproduce this behavior with 4b7b440. My attempt to run cpu_quota on hw_x86_64 fully succeeded. Although I don't expect this to be the problem, which version of TCL, expect and qemu you are using? (I'm using qemu 2.0.0, expect 5.45 and TCL 8.6)
Sorry for the inaccurate wording which I tried to clarify in the follow-up email. It works for me on QEMU but the cpu_quota run script does not complete if I run it on real hardware (Lenovo T430s). There I see that the results are logged and "done" messages appear but the analysis of the results is never performed so I do not see a "Test succeeded" nor "Test failed" message.
Regards, Adrian
Hey Adrian,
On 10.07.2015 14:37, Adrian-Ken Rueegsegger wrote:
Sorry for the inaccurate wording which I tried to clarify in the follow-up email. It works for me on QEMU but the cpu_quota run script does not complete if I run it on real hardware (Lenovo T430s). There I see that the results are logged and "done" messages appear but the analysis of the results is never performed so I do not see a "Test succeeded" nor "Test failed" message.
Ah ok. Have you propagated the serial output to the run tool somehow? If not the script is not able to detect whether the test succeeded or even when to start detection. The run tool comes with several methods that can be used to achieve this but one can even add a new method to capture the output. The existing methods are in tool/run/log and can be included via --include log/<METHOD> in your <BUILDIR>/etc/build.conf. You must also set the parameters of the chosen method in there. For example via direct serial connection:
RUN_OPT += \ --include log/serial \ --log-serial-cmd 'picocom -b 115200 /dev/<SERIALDEV>'
Or via ethernet:
RUN_OPT += \ --include log/serial \ --log-serial-cmd 'socat -s - tcp:<HOST>:<PORT>' \ --log-serial-filter 'stdbuf -o0 tr "\200-\377" "."'
The filter in the second example is used because in this method sometimes bad chars (\200...\377) get generated that could crash the capture tool.
After configuring your build.conf accordingly, the script should stay active, listening on the declared interface, detect the "done" messages and then continue with the conclusion part.
Cheers, Martin
Hello Martin,
On 07/10/2015 03:44 PM, Martin Stein wrote:
Hey Adrian,
On 10.07.2015 14:37, Adrian-Ken Rueegsegger wrote:
Sorry for the inaccurate wording which I tried to clarify in the follow-up email. It works for me on QEMU but the cpu_quota run script does not complete if I run it on real hardware (Lenovo T430s). There I see that the results are logged and "done" messages appear but the analysis of the results is never performed so I do not see a "Test succeeded" nor "Test failed" message.
Ah ok. Have you propagated the serial output to the run tool somehow? If
Sorry, I wrongfully assumed that the evaluation of the results from the log was done by the test binary on the host, which is obviously not the case. However, after wiring up the log output to the script using AMT, I get the following results:
[init -> init_2 -> test_fast] quota super period 1000000 [init -> init_2 -> test_fast] quota 599975 [init -> init_2 -> test_fast] counter A 2264346538 [init -> init_2 -> test_fast] counter B 3664974879 [init -> init_2 -> test_fast] done [init -> init_2 -> test_midl] quota super period 1000000 [init -> init_2 -> test_midl] quota 199981 [init -> init_2 -> test_midl] counter A 2867295315 [init -> init_2 -> test_midl] counter B 306787595 [init -> init_2 -> test_midl] done [init -> init_1 -> test_slow] quota super period 1000000 [init -> init_1 -> test_slow] quota 49987 [init -> init_1 -> test_slow] counter A 1458851167 [init -> init_1 -> test_slow] counter B 1609324261 [init -> init_1 -> test_slow] done Bad: Slow counter A received 11.986 % CPU (goal 4.75 %) Bad: Slow counter B received 13.222 % CPU (goal 5.25 %) Bad: Middle counter A received 23.557 % CPU (goal 9.625 %) Bad: Middle counter B received 2.521 % CPU (goal 15.375 %) Bad: Fast counter A received 18.604 % CPU (goal 22.625 %) Bad: Fast counter B received 30.111 % CPU (goal 42.375 %) Test failed because of 6 errors
I ran the scenario for hw_x86_64 multiple times and the results are consistent with only very minor variance for Middle counter B (2.521% vs. 2.496%) and Fast counter B (30.111% vs 30.136%). The Genode version used is 4b7b440... and the hardware a Lenovo T430s. In QEMU the same scenario completes successfully:
[init -> init_2 -> test_fast] quota super period 1000000 [init -> init_2 -> test_fast] quota 599975 [init -> init_2 -> test_fast] counter A 32250924 [init -> init_2 -> test_fast] counter B 61848263 [init -> init_2 -> test_fast] done [init -> init_2 -> test_midl] quota super period 1000000 [init -> init_2 -> test_midl] quota 199981 [init -> init_2 -> test_midl] counter A 13401918 [init -> init_2 -> test_midl] counter B 22814749 [init -> init_2 -> test_midl] done [init -> init_1 -> test_slow] quota super period 1000000 [init -> init_1 -> test_slow] quota 49987 [init -> init_1 -> test_slow] counter A 6808621 [init -> init_1 -> test_slow] counter B 7609956 [init -> init_1 -> test_slow] done Good: Slow counter A received 4.704 % CPU (goal 4.75 %) Good: Slow counter B received 5.258 % CPU (goal 5.25 %) Good: Middle counter A received 9.26 % CPU (goal 9.625 %) Good: Middle counter B received 15.763 % CPU (goal 15.375 %) Good: Fast counter A received 22.283 % CPU (goal 22.625 %) Good: Fast counter B received 42.732 % CPU (goal 42.375 %) Test succeeded
Do you have an idea why the effective numbers on hardware would be so far off the mark from the assigned quota numbers?
Regards, Adrian
Hi Adrian,
On 13.07.2015 12:28, Adrian-Ken Rueegsegger wrote:
Do you have an idea why the effective numbers on hardware would be so far off the mark from the assigned quota numbers?
When looking at your raw counter values, I assume that this is caused by overflows. Some of your hardware values are near (uint32_t)~0 while your Qemu values are all relatively small. I've addressed this issue recently in [1]. Commit [2] there introduces 64 bit counter values, but maybe its also a good idea to check for overflows. The commits are on genodelabs/staging but not on genodelabs/master yet. Could you please give staging a try?
Btw.: From tomorrow till the 1st of August I'm on vacation and wont read my mails.
Cheers, Martin
[1] https://github.com/genodelabs/genode/issues/1628 [2] https://github.com/m-stein/genode/commit/bf7855d259950339b1e8411c58161d779ec...
Hello Martin,
On 07/13/2015 12:47 PM, Martin Stein wrote:
When looking at your raw counter values, I assume that this is caused by overflows. Some of your hardware values are near (uint32_t)~0 while your Qemu values are all relatively small. I've addressed this issue recently in [1]. Commit [2] there introduces 64 bit counter values, but maybe its also a good idea to check for overflows. The commits are on genodelabs/staging but not on genodelabs/master yet. Could you please give staging a try?
I cherry-picked the above mentioned commit (d9ec7fa...) and the test now passes successfully:
[init -> init_2 -> test_fast] quota super period 1000000 [init -> init_2 -> test_fast] quota 599975 [init -> init_2 -> test_fast] counter A 8587417160 [init -> init_2 -> test_fast] counter B 16330577199 [init -> init_2 -> test_fast] done [init -> init_2 -> test_midl] quota super period 1000000 [init -> init_2 -> test_midl] quota 199981 [init -> init_2 -> test_midl] counter A 3729701872 [init -> init_2 -> test_midl] counter B 5928486521 [init -> init_2 -> test_midl] done [init -> init_1 -> test_slow] quota super period 1000000 [init -> init_1 -> test_slow] quota 49987 [init -> init_1 -> test_slow] counter A 1947120741 [init -> init_1 -> test_slow] counter B 2045329877 [init -> init_1 -> test_slow] done Good: Slow counter A received 5.048 % CPU (goal 4.75 %) Good: Slow counter B received 5.303 % CPU (goal 5.25 %) Good: Middle counter A received 9.67 % CPU (goal 9.625 %) Good: Middle counter B received 15.371 % CPU (goal 15.375 %) Good: Fast counter A received 22.265 % CPU (goal 22.625 %) Good: Fast counter B received 42.342 % CPU (goal 42.375 %) Test succeeded
Thanks for all the help.
Regards, Adrian
Hello,
while spending some days out of office, I kept thinking about your performance issue. Could it possibly be related to the spinlock that we use to protect the lock metadata? See [1] and [2] for the lock implementation.
I remember when Alex first introduced the use of multiple priorities within a single component (which was actually VirtualBox), we stumbled over the issue that a high-priority thread would eventually keep spinning on a contended spinlock. The spinlock holder, however, was never scheduled anymore because it had a lower priority than the lock applicant. Alex addressed this issue with a NOVA-specific spinlock that maintains priority inheritance. See [3].
[1] https://github.com/genodelabs/genode/blob/master/repos/base/include/base/can... [2] https://github.com/genodelabs/genode/blob/master/repos/base/src/base/lock/lo... [3] https://github.com/genodelabs/genode/blob/master/repos/base-nova/src/base/lo...
It would be interesting to see the spinlock-holding times of the lock's spinlock (by taking rdtsc values when acquiring and releasing the spinlock) in your version of VirtualBox on base-hw/Muen. One would expect the hold times to be just the few cycles. If they are orders of magnitudes higher, we'd need to consider a priority-inheriting spinlock for base-hw.
Cheers Norman
Hello Martin,
On 07/07/2015 03:26 PM, Martin Stein wrote:
There is a bug [1] in the current master branch that prevents the initial quota configuration on the construction of a thread. Thus, threads merely receive quota when another thread at the session gets created or destructed. Could you please apply the commit in the referenced issue and try again.
I rebased our hw_x86_64_muen-vbox working branch(es) on staging (b84d81) which includes the fix. The measured time is unchanged at approximately 33-34 seconds from bootloader to the login prompt using the default super period and time slice values. With adjusted super period the boot time also stays about the same as without the fix.
[...]
According to my above writings, I think that we can still raise the performance through modifications at the application level. It would also be helpful to see your CPU-utilization stats if the scheudling remains a problem.
Thank you for the explanations. I will work through your suggestions and collect some usage statistics for our scenario.
Regards, Adrian
Hello Martin,
On 07/07/2015 03:26 PM, Martin Stein wrote:
- For comparison with NOVA, it might be a good idea to assign 100% of
the CPU quota, because then, priorities are absolute also on base-hw. Later, you may switch the best-effort parts of the scenario back to a non-quota mode and see if the performance benefits from it.
I have adjusted the virtualbox.run scenario accordingly and split the CPU quota between the components.
- For a better debugging basis it might be useful to get the CPU
utilization stats of the scenario. This can be achieved in the kernel pretty simple and unadulterately. You can find a basic implementation and demonstration on my branch [2] (all .* commits). The output is triggered via Kernel::print_char(0) that can be called from everywhere in the userland but you may also print it e.g. on console-input IRQ through THREAD_QUOTA_STATS_IRQ in the last commit. The printed "spent" values are the timer tics that a thread has used for its own or spent helping another thread. The printed "used" values are the timer tics that the thread was really executed (on its own or through a helper).
Based on your cpu quota stats commits I made some minor adjustments so the output is in CSV format [1], which makes it more suitable for processing by other tools. I wired the stats print to the keyboard IRQ.
The first set of numbers for the top 10 threads were collected when pressing a key in the guest VM bootloader. So they basically cover the base-hw/Virtualbox init and start:
pd_label | label | spent | % of total ---------------------+--------------+------------+----------- init -> virtualbox | EMT | 5283917631 | 39.8206% init -> virtualbox | virtualbox | 4629194169 | 34.8865% init -> rtc_drv | rtc_ep | 1439510978 | 10.8484% init -> rtc_drv | rtc_drv | 698105922 | 5.2611% init -> fb_drv | vesa_ep | 221176830 | 1.6668% init | init | 150973892 | 1.1378% init -> timer | timer_drv_ep | 142735932 | 1.0757% init -> ps2_drv | ps2_drv_ep | 132941808 | 1.0019% init -> ps2_drv | ps2_drv | 125508496 | 0.9459% init -> platform_drv | platform_drv | 107468792 | 0.8099%
The second set was collected once the Linux system has booted to the command prompt:
pd_label | label | spent | % of total -------------------+-------------------+-------------+----------- init -> virtualbox | EMT | 97132063560 | 89.9964% init -> virtualbox | virtualbox | 4631019033 | 4.2908% init -> timer | timer_drv_ep | 2135808255 | 1.9789% init -> rtc_drv | rtc_ep | 1439510978 | 1.3338% init -> rtc_drv | rtc_drv | 698105922 | 0.6468% init -> virtualbox | Timer | 313374324 | 0.2904% init -> fb_drv | vesa_ep | 238903418 | 0.2214% init -> virtualbox | alarm-timer | 207690920 | 0.1924% init -> timer | timeout_scheduler | 199926872 | 0.1852% core | pager_activation | 155622972 | 0.1442%
So the overwhelming majority of CPU time is spent by the Virtualbox EMT thread, which looks sensible. I assume that the time spent in hwaccel mode is all accounted to the EMT thread since it is the one that invokes Vm_session::run. I do not know the implementation details of rtc_drv but I did not expect it to make the top 10.
A sidenote: The difference between quota used and spent (total used - total spent) is 7029216 ticks. This value is identical for each measurement.
According to my above writings, I think that we can still raise the performance through modifications at the application level. It would also be helpful to see your CPU-utilization stats if the scheudling remains a problem.
I experimented a bit with the CPU quotas in the virtualbox.run scenario by giving Vbox 70/80/90% and dividing the remaining percentage among the other components. The execution was only slightly faster with more CPU time but the time differences were quite small in the range of one second (compared to the total time of 33-35seconds).
Do you see anything suspicious with these numbers or a potential for further optimization?
Regards, Adrian
[1] - https://github.com/codelabs-ch/genode/tree/hw_x86_64_muen-vbox-cpu_quota_sta...
Hello Adrian,
* Adrian-Ken Rueegsegger <ken@...309...> [2015-07-09 15:59:11 +0200]:
The first set of numbers for the top 10 threads were collected when pressing a key in the guest VM bootloader. So they basically cover the base-hw/Virtualbox init and start:
pd_label | label | spent | % of total ---------------------+--------------+------------+----------- init -> virtualbox | EMT | 5283917631 | 39.8206% init -> virtualbox | virtualbox | 4629194169 | 34.8865% init -> rtc_drv | rtc_ep | 1439510978 | 10.8484% init -> rtc_drv | rtc_drv | 698105922 | 5.2611% init -> fb_drv | vesa_ep | 221176830 | 1.6668% init | init | 150973892 | 1.1378% init -> timer | timer_drv_ep | 142735932 | 1.0757% init -> ps2_drv | ps2_drv_ep | 132941808 | 1.0019% init -> ps2_drv | ps2_drv | 125508496 | 0.9459% init -> platform_drv | platform_drv | 107468792 | 0.8099%
The second set was collected once the Linux system has booted to the command prompt:
pd_label | label | spent | % of total -------------------+-------------------+-------------+----------- init -> virtualbox | EMT | 97132063560 | 89.9964% init -> virtualbox | virtualbox | 4631019033 | 4.2908% init -> timer | timer_drv_ep | 2135808255 | 1.9789% init -> rtc_drv | rtc_ep | 1439510978 | 1.3338% init -> rtc_drv | rtc_drv | 698105922 | 0.6468% init -> virtualbox | Timer | 313374324 | 0.2904% init -> fb_drv | vesa_ep | 238903418 | 0.2214% init -> virtualbox | alarm-timer | 207690920 | 0.1924% init -> timer | timeout_scheduler | 199926872 | 0.1852% core | pager_activation | 155622972 | 0.1442%
Since the stats are absolute it would be best to reset them to zero before collecting them a second time. In this case the rtc_drv should not be in the top range anymore because it should not be used by vbox while it is running but only once when it starts.
I do not know the implementation details of rtc_drv but I did not expect it to make the top 10.
I did not expect this either. Although it is indeed somewhat expensive (i.e. it should be avoided to read it several times a second), I cannot imagine why it takes about 10% of the total time when starting vbox. It basically reads the time from the cmos which merely involves an IPC to core to read the I/O port. It would be good to know if the number of calls of the rtc_drv differ between base-hw and base-nova.
Regards Josef
Hello Josef,
On 07/10/2015 02:08 PM, Josef Söntgen wrote:
Since the stats are absolute it would be best to reset them to zero before collecting them a second time. In this case the rtc_drv should not be in the top range anymore because it should not be used by vbox while it is running but only once when it starts.
I collected another round of numbers that are non-cumulative for hw_x86_64 (not running on Muen). As before, the first measurement was done in the bootloader of the guest VM:
pd_label | label | spent | % of total ---------------------+--------------+----------+----------- init -> virtualbox | EMT | 98001213 | 37.45505% init -> virtualbox | virtualbox | 83688021 | 31.98470% init -> rtc_drv | rtc_ep | 36408986 | 13.91514% init -> rtc_drv | rtc_drv | 23748123 | 9.07629% init | init | 4032147 | 1.54105% init -> acpi_drv | acpi_drv | 2543088 | 0.97194% init -> platform_drv | platform_drv | 2430257 | 0.92882% init -> ps2_drv | ps2_drv_ep | 1830875 | 0.69974% init -> timer | timer_drv_ep | 1654987 | 0.63252% init -> virtualbox | VMPwrUp | 1586461 | 0.60633%
And here the second measurement once the Linux login prompt is visible:
pd_label | label | spent | % of total -------------------+-------------------+------------+----------- init -> virtualbox | EMT | 1521043562 | 98.25379% init -> timer | timer_drv_ep | 19607895 | 1.26660% init -> virtualbox | Timer | 2819414 | 0.18212% init -> timer | timeout_scheduler | 1971086 | 0.12732% init -> virtualbox | alarm-timer | 1298777 | 0.08390% init -> virtualbox | ATA-1 | 525509 | 0.03395% core | pager_activation | 522312 | 0.03374% init -> fb_drv | vesa_ep | 181979 | 0.01176% core | entrypoint | 61660 | 0.00398% init -> virtualbox | OhciFramer | 15831 | 0.00102%
So looking at these numbers the EMT thread is even more dominant and the rtc_drv is indeed gone: its spent/used figures of the second measurement are actually zero.
I do not know the implementation details of rtc_drv but I did not expect it to make the top 10.
I did not expect this either. Although it is indeed somewhat expensive (i.e. it should be avoided to read it several times a second), I cannot imagine why it takes about 10% of the total time when starting vbox. It basically reads the time from the cmos which merely involves an IPC to core to read the I/O port. It would be good to know if the number of calls of the rtc_drv differ between base-hw and base-nova.
Compared to the other work that is going on during startup I would have guessed that the rtc_drv had very little to do... Not quite sure if this could be a contributor to the timing inaccuracies that we observe on hw_x86_64_muen. There we have to specify kernel parameters notsc and lpj with a fixed value so Linux boots reliably.
Regards, Adrian