Benchmarking Genode TrustZone

List overview All Threads
Download

newer

older

Recursive Nitpicker

l4linux fails on fiasco.oc 64bit

Tiago Brito

21 Jun 2016 21 Jun '16

3:30 p.m.

Hi, I want to benchmark the execution of a function running in the secure world of the TZ_VMM scenario in the i.MX53 QSB.

I have added a syscall to Linux which allows me to trigger a world switch from a user program running in Linux. In this program I have a function which allocates a buffer and processes it (each buffer position is changed in some way). This same function is coded inside TZ_VMM.

This is what I'm testing:

1. Inside my user program in Linux I use gettimeofday before and after the execution of the function in order to get the amount of milliseconds in between. This is my NW test. 2. Inside my user program in Linux I use gettimeofday to get the start time, then I execute the syscall which in turn does a world switch. Then the function is executed inside the SW and it returns to the user program inside Linux. After this I call another gettimeofday in order to get the amount of milliseconds of execution.

The problem is that test 1 is giving me about 90 ms of real time execution, but test 2 gives me about 40 ms.

I suspect it might be a problem with Linux virtualization in the TZ_VMM example, which may be causing a drift in Linux's clock once it loses control to the SW. What I mean is, when there isn't a syscall triggering the SMC, Linux can count time just fine, but once the control is lost to the secure world the clock inside Linux becomes inconsistent and doesn't count time while the secure world is executing. Is this right?

Since I really need to benchmark a scenario similar to this I think that the best alternative is to offload the time functionality to Genode (SW). I create another syscall which is responsible for starting a timer inside Genode, then I call the SMC syscall which processes the buffer in the SW, then I call the time syscall again and check the difference. When I want to benchmark the NW function I follow the same steps as before. Will this work as intended?

I'm thinking that this alternative may suffer from the same problem as before if Genode's time clock becomes inconsistent whenever Linux is being executed in NW.

Do you know any other way to benchmark a world switch + processing + world switch scenario? Is there any timer I can execute inside TZ_VMM?

Thanks in advance, Tiago

Attachments:

attachment.html (text/html — 2.5 KB)

Show replies by date

Stefan Kalkowski

23 Jun 23 Jun

10:33 a.m.

Hello Tiago,

On 06/21/2016 03:30 PM, Tiago Brito wrote:

...

Hi, I want to benchmark the execution of a function running in the secure world of the TZ_VMM scenario in the i.MX53 QSB.

I have added a syscall to Linux which allows me to trigger a world switch from a user program running in Linux. In this program I have a function which allocates a buffer and processes it (each buffer position is changed in some way). This same function is coded inside TZ_VMM.

This is what I'm testing:

Inside my user program in Linux I use gettimeofday before and after

the execution of the function in order to get the amount of milliseconds in between. This is my NW test. 2. Inside my user program in Linux I use gettimeofday to get the start time, then I execute the syscall which in turn does a world switch. Then the function is executed inside the SW and it returns to the user program inside Linux. After this I call another gettimeofday in order to get the amount of milliseconds of execution.

The problem is that test 1 is giving me about 90 ms of real time execution, but test 2 gives me about 40 ms.

Well, I do not know how big your buffer is, and how computing intensive the operation, but in general it is not irrational that a computing intensive task executed in the secure world is completed faster than in the normal world, given our experimental TrustZone VMM/hypervisor. Due to the fact, that the secure world immediately receives any secure IRQ, e.g., during the normal world buffer processing, which might cause probably expensive world-switches. In contrast to this when the secure-world is executing it is not "disturbed" by normal world IRQs, which means: no additional world-switches. Nevertheless, it does not explain supposedly the mighty gap of 50ms.

...

I suspect it might be a problem with Linux virtualization in the TZ_VMM example, which may be causing a drift in Linux's clock once it loses control to the SW. What I mean is, when there isn't a syscall triggering the SMC, Linux can count time just fine, but once the control is lost to the secure world the clock inside Linux becomes inconsistent and doesn't count time while the secure world is executing. Is this right?

That is totally right, as I've described above, Linux won't get any IRQs as long as the secure world is executing.

...

Since I really need to benchmark a scenario similar to this I think that the best alternative is to offload the time functionality to Genode (SW). I create another syscall which is responsible for starting a timer inside Genode, then I call the SMC syscall which processes the buffer in the SW, then I call the time syscall again and check the difference. When I want to benchmark the NW function I follow the same steps as before. Will this work as intended?

It sounds quite expensive, but should work in general.

...

I'm thinking that this alternative may suffer from the same problem as before if Genode's time clock becomes inconsistent whenever Linux is being executed in NW.

No, Genode's timer service will work consitently, because its secure IRQ is prioritized higher than Linux normal world IRQs.

...

Do you know any other way to benchmark a world switch + processing + world switch scenario? Is there any timer I can execute inside TZ_VMM?

Well, in theory if you need a specified latency of IRQs in the normal world, you need to guarantee that it is executed regularily. Therefore, you would need to turn your synchronous secure-world call into an asynchronous one. By now, the normal world won't be executed until the call returns. That means in the asynchronous case, the "SMC" call would return immediately, and for the response to the normal world the VMM must instead inject an IRQ into the normal world. Moreover, the normal world's execution context must not be prioritized lower than the secure world's component that does the buffer processing. However, this way you would turn the whole scenario into a fundamental different execution model with a lot of implications regarding security and liveliness. For example, the VMM cannot count on the shared memory's consitency due to the normal world being executed in parallel, or a higher priority of the VM can lead to starving, secure components.

To sum it up, if its "just" for the measurements, I would not change the fundamental setup being in your position.

Regards Stefan

...

Thanks in advance, Tiago

Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San Francisco, CA to explore cutting-edge tech and listen to tech luminaries present their vision of the future. This family event has something for everyone, including kids. Get more information and register today. http://sdm.link/attshape

genode-main mailing list genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main

-- Stefan Kalkowski Genode Labs http://www.genode-labs.com/ · http://genode.org/

Norman Feske

11:16 a.m.

Hi Tiago,

...

I'm thinking that this alternative may suffer from the same problem as before if Genode's time clock becomes inconsistent whenever Linux is being executed in NW.

Do you know any other way to benchmark a world switch + processing + world switch scenario? Is there any timer I can execute inside TZ_VMM?

have you considered the use of a performance counter for measuring low-level code paths? For reference, you may take a look at the 'timestamp' function for ARM:

https://github.com/genodelabs/genode/blob/master/repos/os/include/spec/arm_v...

Compared to the other time sources, the counter is precise while having very little overhead. The exact meaning of the counter value may depend on the platform. E.g., on the Raspberry Pi where I used it, the counter increases every 64 clock cycles.

As far as I know, the feature must be explicitly enabled by adding the following line to your <build-dir>/etc/specs.conf:

SPECS += perf_counter

Be aware that further (TZ configuration) steps may be required to expose the counter to the normal world.

Cheers Norman

-- Dr.-Ing. Norman Feske Genode Labs http://www.genode-labs.com · http://genode.org Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth

Tiago Brito

3:40 p.m.

Thanks for the replies, it was helpful!

I wasn't using the optimization flag -O3 on both the code running in the NW and SW. Now I am and the times are pretty similar between the NW execution and the SW execution on the example I was testing.

Now I'm testing another example and I'm getting some interesting results. The code above represents an image transformation. I'm going through every position in an array of integers and changing que new array values with a slight modification from the old values:

// start timer here for(i = 0; i < size; i++) { color = oldp[i]; alpha = (color >> 24) & 0xff; red = (color >> 16) & 0xff; green = (color >> 8) & 0xff; blue = color & 0xff; lum = (int) (red * 0.299 + green * 0.587 + blue * 0.114); newp[i] = (alpha << 24) | (lum << 16) | (lum << 8) | lum; } // end timer here // check timer diff and print result

I'm testing this same exact code on both the Secure and Nonsecure domains. In the NW I'm getting about 155 ms of execution time, which for that buffer and transformation seems ok. On the other hand, the SW is giving me about 610 ms of execution time.

I can't seem to find a reasonable explanation for this time difference, since the code running in both scenarios is exactly the same. The secure code is running inside the TZ_VMM example.

Do you have an ideia on what might be happening here?

Thanks in advance, Tiago

2016-06-23 10:16 GMT+01:00 Norman Feske <norman.feske@...1...>:

...

Hi Tiago,

...
I'm thinking that this alternative may suffer from the same problem as before if Genode's time clock becomes inconsistent whenever Linux is being executed in NW.

Do you know any other way to benchmark a world switch + processing + world switch scenario? Is there any timer I can execute inside TZ_VMM?

have you considered the use of a performance counter for measuring low-level code paths? For reference, you may take a look at the 'timestamp' function for ARM:

https://github.com/genodelabs/genode/blob/master/repos/os/include/spec/arm_v...

Compared to the other time sources, the counter is precise while having very little overhead. The exact meaning of the counter value may depend on the platform. E.g., on the Raspberry Pi where I used it, the counter increases every 64 clock cycles.

As far as I know, the feature must be explicitly enabled by adding the following line to your <build-dir>/etc/specs.conf:

SPECS += perf_counter

Be aware that further (TZ configuration) steps may be required to expose the counter to the normal world.

Cheers Norman

-- Dr.-Ing. Norman Feske Genode Labs

http://www.genode-labs.com · http://genode.org

Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth

Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San Francisco, CA to explore cutting-edge tech and listen to tech luminaries present their vision of the future. This family event has something for everyone, including kids. Get more information and register today. http://sdm.link/attshape _______________________________________________ genode-main mailing list genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main

Christian Helmuth

4:23 p.m.

Hello Tiago,

On Thu, Jun 23, 2016 at 02:40:10PM +0100, Tiago Brito wrote:

...

// start timer here for(i = 0; i < size; i++) { color = oldp[i]; alpha = (color >> 24) & 0xff; red = (color >> 16) & 0xff; green = (color >> 8) & 0xff; blue = color & 0xff; lum = (int) (red * 0.299 + green * 0.587 + blue * 0.114); newp[i] = (alpha << 24) | (lum << 16) | (lum << 8) | lum; } // end timer here // check timer diff and print result

I'm testing this same exact code on both the Secure and Nonsecure domains. In the NW I'm getting about 155 ms of execution time, which for that buffer and transformation seems ok. On the other hand, the SW is giving me about 610 ms of execution time.

I can't seem to find a reasonable explanation for this time difference, since the code running in both scenarios is exactly the same. The secure code is running inside the TZ_VMM example.

Did you check that the generated binary code is similar? Did you try to measure only the run time of the for-loop in both worlds?

Regards

-- Christian Helmuth Genode Labs http://www.genode-labs.com/ · http://genode.org/ https://twitter.com/GenodeLabs · /ˈdʒiː.nəʊd/ Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth

Tiago Brito

5:32 p.m.

I did not check if the binary code is similar, but I did measure just the for-loop in both worlds and the times are those I described previously.

On the other hand, this code, which I used as test code before (previous messages in this post) does have a similar execution time in both worlds:

void bench(int n) { int buf[1024]; int i, j, k, r = 0; for (i = 0; i < 1024; i++) { buf[i] = 0; } for (j = 0; j < n; j++) for (i = 0; i < 1024; i++) for (k = 0; k < 1024; k++) buf[i] = buf[i]+j+k;

for (i = 0; i < 1024; i++) { r += buf[i]; } PINF("Ended Bench %d - %d", (int)buf[0], r); }

I tested this with n = 100000 and it showed an execution time of about 500 ms in both worlds.

2016-06-23 15:23 GMT+01:00 Christian Helmuth < christian.helmuth@...1...>:

...

Hello Tiago,

On Thu, Jun 23, 2016 at 02:40:10PM +0100, Tiago Brito wrote:

...
// start timer here for(i = 0; i < size; i++) { color = oldp[i]; alpha = (color >> 24) & 0xff; red = (color >> 16) & 0xff; green = (color >> 8) & 0xff; blue = color & 0xff; lum = (int) (red * 0.299 + green * 0.587 + blue * 0.114); newp[i] = (alpha << 24) | (lum << 16) | (lum << 8) | lum; } // end timer here // check timer diff and print result

I'm testing this same exact code on both the Secure and Nonsecure

domains.

...
In the NW I'm getting about 155 ms of execution time, which for that

buffer

...
and transformation seems ok. On the other hand, the SW is giving me about 610 ms of execution time.

I can't seem to find a reasonable explanation for this time difference, since the code running in both scenarios is exactly the same. The secure code is running inside the TZ_VMM example.

Did you check that the generated binary code is similar? Did you try to measure only the run time of the for-loop in both worlds?

Regards

Christian Helmuth Genode Labs

http://www.genode-labs.com/ · http://genode.org/ https://twitter.com/GenodeLabs · /ˈdʒiː.nəʊd/

Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth

Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San Francisco, CA to explore cutting-edge tech and listen to tech luminaries present their vision of the future. This family event has something for everyone, including kids. Get more information and register today. http://sdm.link/attshape _______________________________________________ genode-main mailing list genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main

Christian Helmuth

6:20 p.m.

Hello Tiago,

On Thu, Jun 23, 2016 at 04:32:18PM +0100, Tiago Brito wrote:

...

I did not check if the binary code is similar, but I did measure just the for-loop in both worlds and the times are those I described previously.

You really should compare the binary code as example that's slower in SW uses floating arithmetics unless I'm mistaken. If the code is similar and the execution time differs much, there may be an issue with FPU handling in SW.

Greets

Tiago Brito

27 Jun 27 Jun

1:16 p.m.

Hi again, I'm using the linaro's arm-linux-gnueabihf 5.3 toolchain to compile the application running on top of the NW linux, but I want to try and compile it using the same compiler version used for genode since the use of different compilers is probably interfering with my benchmark measurements (it's the only variant).

I tried using the GCC present in genode's toolchain but I'm having several No such file or directory errors with stdio.h and string.h includes. My application also uses sockets, so the necessary include files are used and may cause similar errors in the compilation process.

Do you have any suggestion on how to solve this and get the same basic for-loop to show similar performance results in both the Normal and Secure World execution contexts?

Thanks, Tiago

2016-06-23 17:20 GMT+01:00 Christian Helmuth < christian.helmuth@...1...>:

...

Hello Tiago,

On Thu, Jun 23, 2016 at 04:32:18PM +0100, Tiago Brito wrote:

...
I did not check if the binary code is similar, but I did measure just the for-loop in both worlds and the times are those I described previously.

You really should compare the binary code as example that's slower in SW uses floating arithmetics unless I'm mistaken. If the code is similar and the execution time differs much, there may be an issue with FPU handling in SW.

Greets

Christian Helmuth Genode Labs

http://www.genode-labs.com/ · http://genode.org/ https://twitter.com/GenodeLabs · /ˈdʒiː.nəʊd/

Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth

Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San Francisco, CA to explore cutting-edge tech and listen to tech luminaries present their vision of the future. This family event has something for everyone, including kids. Get more information and register today. http://sdm.link/attshape _______________________________________________ genode-main mailing list genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main

Tiago Brito

28 Jun 28 Jun

11:55 a.m.

Hi, so after comparing the binaries I realized that my NW application is using hard float instruction, unlike it's SW counter part.

I changed my NW application toolchain for one which support soft float in order to check if my measurements are consistent. And in some way the execution time gap between both for-loops decreased significantly, but there's still a 100 ms gap between both execution times.

Before the NW was measuring 155 ms and the SW was measuring 610 ms. Now the NW is measuring 500 ms whilst the SW is measuring the same 610 ms as before.

My theory is that Genode's scheduling might be delaying the SW execution. I say this because I added a print in the resume function of the thread scheduler and it prints several times when the SW for-loop is executing.

What I want to ask is, is my theory plausible? Would the SW scheduler delay the execution by 100 ms? It seems a bit too much time... What can I do to shorten this time gap between both executions?

Thanks in advance, Tiago

2016-06-27 12:16 GMT+01:00 Tiago Brito <tb.genode@...9...>:

...

Hi again, I'm using the linaro's arm-linux-gnueabihf 5.3 toolchain to compile the application running on top of the NW linux, but I want to try and compile it using the same compiler version used for genode since the use of different compilers is probably interfering with my benchmark measurements (it's the only variant).

I tried using the GCC present in genode's toolchain but I'm having several No such file or directory errors with stdio.h and string.h includes. My application also uses sockets, so the necessary include files are used and may cause similar errors in the compilation process.

Do you have any suggestion on how to solve this and get the same basic for-loop to show similar performance results in both the Normal and Secure World execution contexts?

Thanks, Tiago

2016-06-23 17:20 GMT+01:00 Christian Helmuth < christian.helmuth@...1...>:

...
Hello Tiago,

On Thu, Jun 23, 2016 at 04:32:18PM +0100, Tiago Brito wrote:

...
I did not check if the binary code is similar, but I did measure just

the

...
for-loop in both worlds and the times are those I described previously.

You really should compare the binary code as example that's slower in SW uses floating arithmetics unless I'm mistaken. If the code is similar and the execution time differs much, there may be an issue with FPU handling in SW.

Greets

Christian Helmuth Genode Labs

http://www.genode-labs.com/ · http://genode.org/ https://twitter.com/GenodeLabs · /ˈdʒiː.nəʊd/

Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth

Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San Francisco, CA to explore cutting-edge tech and listen to tech luminaries present their vision of the future. This family event has something for everyone, including kids. Get more information and register today. http://sdm.link/attshape _______________________________________________ genode-main mailing list genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main

Stefan Kalkowski

12:45 p.m.

Hi Tiago,

On 06/28/2016 11:55 AM, Tiago Brito wrote:

...

Hi, so after comparing the binaries I realized that my NW application is using hard float instruction, unlike it's SW counter part.

I changed my NW application toolchain for one which support soft float in order to check if my measurements are consistent. And in some way the execution time gap between both for-loops decreased significantly, but there's still a 100 ms gap between both execution times.

Before the NW was measuring 155 ms and the SW was measuring 610 ms. Now the NW is measuring 500 ms whilst the SW is measuring the same 610 ms as before.

My theory is that Genode's scheduling might be delaying the SW execution. I say this because I added a print in the resume function of the thread scheduler and it prints several times when the SW for-loop is executing.

What I want to ask is, is my theory plausible? Would the SW scheduler delay the execution by 100 ms? It seems a bit too much time... What can I do to shorten this time gap between both executions?

By default, if you do not configure any CPU quota and priority, the kernel will schedule round-robin. As long as the "normal" world is not stopped during the calculation within your "secure" Genode component, it will be executed side-by-side. But I wonder why it is not stopped - did you changed the execution model?

If you want to ensure that your specific calculation routine is always executed when it is runnable, you have to add a:

in its start node within the XML configuration of the init component. This will give 100% of the CPU quota to your component. But be aware that you can easily starve other components this way, as long as your component never blocks.

Regards Stefan

...

Thanks in advance, Tiago

2016-06-27 12:16 GMT+01:00 Tiago Brito <tb.genode@...9...>:

...
Hi again, I'm using the linaro's arm-linux-gnueabihf 5.3 toolchain to compile the application running on top of the NW linux, but I want to try and compile it using the same compiler version used for genode since the use of different compilers is probably interfering with my benchmark measurements (it's the only variant).

I tried using the GCC present in genode's toolchain but I'm having several No such file or directory errors with stdio.h and string.h includes. My application also uses sockets, so the necessary include files are used and may cause similar errors in the compilation process.

Do you have any suggestion on how to solve this and get the same basic for-loop to show similar performance results in both the Normal and Secure World execution contexts?

Thanks, Tiago

2016-06-23 17:20 GMT+01:00 Christian Helmuth < christian.helmuth@...1...>:

...
Hello Tiago,

On Thu, Jun 23, 2016 at 04:32:18PM +0100, Tiago Brito wrote:

...
I did not check if the binary code is similar, but I did measure just

the

...
for-loop in both worlds and the times are those I described previously.

You really should compare the binary code as example that's slower in SW uses floating arithmetics unless I'm mistaken. If the code is similar and the execution time differs much, there may be an issue with FPU handling in SW.

Greets

Christian Helmuth Genode Labs

http://www.genode-labs.com/ · http://genode.org/ https://twitter.com/GenodeLabs · /ˈdʒiː.nəʊd/

Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth

Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San Francisco, CA to explore cutting-edge tech and listen to tech luminaries present their vision of the future. This family event has something for everyone, including kids. Get more information and register today. http://sdm.link/attshape _______________________________________________ genode-main mailing list genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main

Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San Francisco, CA to explore cutting-edge tech and listen to tech luminaries present their vision of the future. This family event has something for everyone, including kids. Get more information and register today. http://sdm.link/attshape

genode-main mailing list genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main

-- Stefan Kalkowski Genode Labs http://www.genode-labs.com/ · http://genode.org/

3450

Age (days ago)

3457

Last active (days ago)

users@lists.genode.org

9 comments

4 participants

tags (0)

participants (4)

Christian Helmuth
Norman Feske
Stefan Kalkowski
Tiago Brito