"Core" thread doesn't become active in kernel initialization

List overview All Threads
Download

newer

older

Trying to port DosBox on Genode

Attempting to run...

Bob Stewart

1 Jul 2015 1 Jul '15

9:42 p.m.

Hi,

I'm bringing up Genode (15.05) on an TI AM437x SoC and it is failing kernel initialization. The initialization appears go ok until the call to init_kernel_mp_primary(). Once the Core_thread singleton is created the last printf output that appears is one I inserted after the call to to _become_active() in the constructor. A printf inserted into the singleton method after the Core_thread creation never shows up on the console. The initialization never completes.

The AM437x is a Cortex-A9MP platform with a single core. It is very similar to the Panda board as far as memory mapping goes.

Any hints on how to debug this situation further would be very helpful.

Thanks, Bob Stewart

Show replies by date

Stefan Kalkowski

2 Jul 2 Jul

10:47 a.m.

Hi Bob,

On 07/01/2015 09:42 PM, Bob Stewart wrote:

...

Hi,
 I'm bringing up Genode (15.05) on an TI AM437x SoC and it is 
failing kernel initialization. The initialization appears go ok until the call to init_kernel_mp_primary(). Once the Core_thread singleton is created the last printf output that appears is one I inserted after the call to to _become_active() in the constructor. A printf inserted into the singleton method after the Core_thread creation never shows up on the console. The initialization never completes.

The AM437x is a Cortex-A9MP platform with a single core. It is very similar to the Panda board as far as memory mapping goes.

Any hints on how to debug this situation further would be very helpful.

Hmm, that sounds really strange. The only thing that is done between leaving the "Core_thread" constructor, and landing within the "Core_thread::singleton" function again, is the call of "__cxa_guard_release", which marks the static object as being constructed successfully. Of course, you can add printings there too. You can apply patch [1] to your branch, then compile core, and search for the guard variable of the Core_thread object like this:

genode-arm-nm bin/core | sort | c++filt | grep Core_thread::singleton

Take the address of the guard variable displayed, and replace the address "0xdeadbeef" within the patch with the address of the guard variable you've found. Thereby you can verify whether __cxa_guard_release is still executed successfully.

Maybe the kernel provokes a page-fault (due to some caching issues or whatever)? To detect a kernel page-fault you can use the kernel lock, that has no real use on a single-core system. It is used to synchronize different cores only. Therefore, if you detect that the kernel lock is still locked you know that the kernel was entered although you were in kernel mode just before. The patch [2] uses this knowledge. It extends the kernel lock by printing some exception information, when the kernel enters its lock twice.

Regards Stefan

[1] http://pastebin.com/C7gSVG6e [2] http://pastebin.com/qLkuZr0f

...

Thanks, Bob Stewart

Don't Limit Your Business. Reach for the Cloud. GigeNET's Cloud Solutions provide you with the tools and support that you need to offload your IT needs and focus on growing your business. Configured For All Businesses. Start Your Cloud Today. https://www.gigenetcloud.com/ _______________________________________________ genode-main mailing list genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main

-- Stefan Kalkowski Genode Labs http://www.genode-labs.com/ · http://genode.org/

Bob Stewart

8:28 p.m.

Thank you for the suggestions, Stefan. I applied your patches. The __cxa_guard_release appears to be working fine as I get the three debug lines output. The lock() method you included in the second patch is never called so an exception is not being thrown. I'll move on to Martin's suggestions and read more of the TRM for the 437x.

Thanks, Bob

On 07/02/2015 04:47 AM, Stefan Kalkowski wrote:

...

Hi Bob,

On 07/01/2015 09:42 PM, Bob Stewart wrote:

...
Hi,
  I'm bringing up Genode (15.05) on an TI AM437x SoC and it is
failing kernel initialization. The initialization appears go ok until the call to init_kernel_mp_primary(). Once the Core_thread singleton is created the last printf output that appears is one I inserted after the call to to _become_active() in the constructor. A printf inserted into the singleton method after the Core_thread creation never shows up on the console. The initialization never completes.

The AM437x is a Cortex-A9MP platform with a single core. It is very similar to the Panda board as far as memory mapping goes.

Any hints on how to debug this situation further would be very helpful.
Hmm, that sounds really strange. The only thing that is done between leaving the "Core_thread" constructor, and landing within the "Core_thread::singleton" function again, is the call of "__cxa_guard_release", which marks the static object as being constructed successfully. Of course, you can add printings there too. You can apply patch [1] to your branch, then compile core, and search for the guard variable of the Core_thread object like this:

genode-arm-nm bin/core | sort | c++filt | grep Core_thread::singleton

Take the address of the guard variable displayed, and replace the address "0xdeadbeef" within the patch with the address of the guard variable you've found. Thereby you can verify whether __cxa_guard_release is still executed successfully.

Maybe the kernel provokes a page-fault (due to some caching issues or whatever)? To detect a kernel page-fault you can use the kernel lock, that has no real use on a single-core system. It is used to synchronize different cores only. Therefore, if you detect that the kernel lock is still locked you know that the kernel was entered although you were in kernel mode just before. The patch [2] uses this knowledge. It extends the kernel lock by printing some exception information, when the kernel enters its lock twice.

Regards Stefan

[1] http://pastebin.com/C7gSVG6e [2] http://pastebin.com/qLkuZr0f

...
Thanks, Bob Stewart

Don't Limit Your Business. Reach for the Cloud. GigeNET's Cloud Solutions provide you with the tools and support that you need to offload your IT needs and focus on growing your business. Configured For All Businesses. Start Your Cloud Today. https://www.gigenetcloud.com/ _______________________________________________ genode-main mailing list genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main

Martin Stein

11:46 a.m.

Hi Bob,

On 01.07.2015 21:42, Bob Stewart wrote:

...

Once the Core_thread singleton is created the last printf output that appears is one I inserted after the call to to _become_active() in the constructor. A printf inserted into the singleton method after the Core_thread creation never shows up on the console. The initialization never completes.

This might be caused by a stack corruption. I would check whether the stack pointer (insert [1] after _become_active()) is inside the kernel_stack array (constraints via [2]). If the AM437x's hardware CPU-ID of your CPU is not 0, this might be a problem as this value is used to index CPU local data like the kernel stack. Another problem might be the CXA guard stuff that Stefan mentioned. Additional to Stefans suggestions you can try to prevent CXA guards by replacing [3] with [4]. This way, you can easily validate the assumption before digging deeper into it.

Cheers, Martin

[1] int i = 0; PINF("SP %p", &i);

[2] /usr/local/genode-gcc/bin/genode-arm-nm bin/core | grep "kernel_stack$" grep -r "KERNEL_STACK_SIZE =" base-hw/

[3] Thread & Core_thread::singleton() { static Core_thread ct; return ct; }

[4] #include <unmanaged_singleton.h> Thread & Core_thread::singleton() { return *unmanaged_singleton<Core_thread>(); }

Bob Stewart

9:36 p.m.

Thank you for the suggestions, Martin.

Stack corruption does not appear to be the issue: (a)Your PINF in [1] yields a run-time error -- "SP <warning: unsupported format string argument>p". (Not sure why that would be.) (b) Replacing %p with 0x%x and applying the appropriate cast, results in PINF showing "SP 0x810893f0". (c) The kernel_stack$ symbol is set at "81079440 B kernel_stack". (d )KERNEL_STACK_SIZE = 64 * 1024. So the stack pointer is appropriately near the top of the stack, assuming it's growing from top to bottom.

Your suggestion, [4] failed to compile with the following error output: "//Work/Genode/genode-15.05/repos/base/src/base/include/unmanaged_singleton.h: In instantiation of ‘T* unmanaged_singleton(ARGS ...) [with T = Kernel::Core_thread; int ALIGNMENT = 4; ARGS = {}]’:// ///Work/Genode/genode-15.05/repos/base-hw/src/core/kernel/thread.cc:805:45: required from here// ///Work/Genode/genode-15.05/repos/base-hw/src/core/kernel/thread.cc:771:1: error: ‘Kernel::Core_thread::Core_thread()’ is private// // Core_thread::Core_thread()// // ^// //In file included from /Work/Genode/genode-15.05/repos/base-hw/src/core/kernel/thread.cc:17:0:// ///Work/Genode/genode-15.05/repos/base/src/base/include/unmanaged_singleton.h:69:3: error: within this context// // new (&object_space) T(args...);// // ^/" Based on Stefan's suggestion however, the CXA guards appear to be working correctly.

I'll look into your thought about the cpu_id, once I understand its purpose and use.

I should also note that the thread.cc I'm using contains two additional methods and associated call case statements I ported from my modified kernel from an AM335x implementation. Those core-based kernel calls are necessary in both the 4371 and 335x to allow writing to control subsystem registers in these platforms. I don't believe these changes have anything to do with the issue, but it is a difference.

Thanks, Bob On 07/02/2015 05:46 AM, Martin Stein wrote:

...

Hi Bob,

On 01.07.2015 21:42, Bob Stewart wrote:

...
Once the Core_thread singleton is created the last printf output that appears is one I inserted after the call to to _become_active() in the constructor. A printf inserted into the singleton method after the Core_thread creation never shows up on the console. The initialization never completes.

This might be caused by a stack corruption. I would check whether the stack pointer (insert [1] after _become_active()) is inside the kernel_stack array (constraints via [2]). If the AM437x's hardware CPU-ID of your CPU is not 0, this might be a problem as this value is used to index CPU local data like the kernel stack. Another problem might be the CXA guard stuff that Stefan mentioned. Additional to Stefans suggestions you can try to prevent CXA guards by replacing [3] with [4]. This way, you can easily validate the assumption before digging deeper into it.

Cheers, Martin

[1] int i = 0; PINF("SP %p", &i);

[2] /usr/local/genode-gcc/bin/genode-arm-nm bin/core | grep "kernel_stack$" grep -r "KERNEL_STACK_SIZE =" base-hw/

[3] Thread & Core_thread::singleton() { static Core_thread ct; return ct; }

[4] #include <unmanaged_singleton.h> Thread & Core_thread::singleton() { return *unmanaged_singleton<Core_thread>(); }

Don't Limit Your Business. Reach for the Cloud. GigeNET's Cloud Solutions provide you with the tools and support that you need to offload your IT needs and focus on growing your business. Configured For All Businesses. Start Your Cloud Today. https://www.gigenetcloud.com/ _______________________________________________ genode-main mailing list genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main

Martin Stein

3 Jul 3 Jul

12:06 p.m.

Hi Bob,

On 02.07.2015 21:36, Bob Stewart wrote:

...

(a)Your PINF in [1] yields a run-time error -- "SP <warning: unsupported format string argument>p". (Not sure why that would be.)

That is indeed strange. I can't reproduce this output with the current master branch and the supported platforms. Instead of the warning case, the switch(cmd.type) statement in [1] should end up in 'case Format_command::PTR'. If you want to dig deeper into that, I would use the _out_string method inside the switch statement to find out what's going on.

...

(b) Replacing %p with 0x%x and applying the appropriate cast, results in PINF showing "SP 0x810893f0". (c) The kernel_stack$ symbol is set at "81079440 B kernel_stack". (d )KERNEL_STACK_SIZE = 64 * 1024. So the stack pointer is appropriately near the top of the stack, assuming it's growing from top to bottom.

That's right. However, the stack data may still be corrupted by some code that uses a broken pointer. That would be hard to debug. A debugger that supports watchpoints would be helpful but as you likely would have used single stepping in this case, I assume that you don't have such an interface. However, before investigating more into that, I would check whether the return pointers are correct at the very end of Core_thread::Core_thread() respectively __cxa_guard_release by using __builtin_return_address(0).

...

Your suggestion, [4] failed to compile with the following error output: "//Work/Genode/genode-15.05/repos/base/src/base/include/unmanaged_singleton.h: In instantiation of ‘T* unmanaged_singleton(ARGS ...) [with T = Kernel::Core_thread; int ALIGNMENT = 4; ARGS = {}]’:// ///Work/Genode/genode-15.05/repos/base-hw/src/core/kernel/thread.cc:805:45: required from here// ///Work/Genode/genode-15.05/repos/base-hw/src/core/kernel/thread.cc:771:1: error: ‘Kernel::Core_thread::Core_thread()’ is private//

Oh sorry, I didn't consider that the constructor is private. But after making it public in [2] the problem is solved and the patch works also at runtime.

...

I'll look into your thought about the cpu_id, once I understand its purpose and use.

This was a misconception of mine. I thought that the almost complete SMP support for Cortex A9 has already made its way to master. But as this is not the case, Cpu::executing_id() always returns 0 independently from any hardware. Additionally, if the CPU ID wouldn't be correct, the stack pointer would be broken as well as the initialization would have chosen the wrong item of the kernel_stack array.

...

I should also note that the thread.cc I'm using contains two additional methods and associated call case statements I ported from my modified kernel from an AM335x implementation. Those core-based kernel calls are necessary in both the 4371 and 335x to allow writing to control subsystem registers in these platforms. I don't believe these changes have anything to do with the issue, but it is a difference.

As long as they do not introduce additional data members to Kernel::Thread, this should indeed make no difference, especially at that early stage where such methods are not called yet. However, just to be sure, you may also remove these modifications for now.

Cheers, Martin

[1] base/src/base/console/console.cc - void Console::vprintf [2] base-hw/src/core/include/kernel/thread.h - class Kernel::Core_thread

Bob Stewart

1:52 p.m.

Thanks for the quick reply Martin.

Regarding possible issues with CXA guards, I did change the constructor of Core_thread to be public and applied your [4] modification. That did allow allow the kernel initialization to complete and I got the "kernel initialized" message on the console. Now I need to study the guard code to see what is going on.

I'll take a look at Console::vprintf to see if I can find the reason I get that run-time error on the %p format variable.

My additions to thread.cc simply write a passed value to a register pointer which is also a passed argument. But I'll take them out if I can't solve the current issue.

My only method of debugging is via printf. After I'm through with this issue I think I'll take a look at CoreSight and see what's involved in creating a debugger using its trace facilities.

Thanks, Bob

On 07/03/2015 06:06 AM, Martin Stein wrote:

...

Hi Bob,

On 02.07.2015 21:36, Bob Stewart wrote:

...
(a)Your PINF in [1] yields a run-time error -- "SP <warning: unsupported format string argument>p". (Not sure why that would be.)

That is indeed strange. I can't reproduce this output with the current master branch and the supported platforms. Instead of the warning case, the switch(cmd.type) statement in [1] should end up in 'case Format_command::PTR'. If you want to dig deeper into that, I would use the _out_string method inside the switch statement to find out what's going on.

...
(b) Replacing %p with 0x%x and applying the appropriate cast, results in PINF showing "SP 0x810893f0". (c) The kernel_stack$ symbol is set at "81079440 B kernel_stack". (d )KERNEL_STACK_SIZE = 64 * 1024. So the stack pointer is appropriately near the top of the stack, assuming it's growing from top to bottom.

That's right. However, the stack data may still be corrupted by some code that uses a broken pointer. That would be hard to debug. A debugger that supports watchpoints would be helpful but as you likely would have used single stepping in this case, I assume that you don't have such an interface. However, before investigating more into that, I would check whether the return pointers are correct at the very end of Core_thread::Core_thread() respectively __cxa_guard_release by using __builtin_return_address(0).

...
Your suggestion, [4] failed to compile with the following error output: "//Work/Genode/genode-15.05/repos/base/src/base/include/unmanaged_singleton.h: In instantiation of ‘T* unmanaged_singleton(ARGS ...) [with T = Kernel::Core_thread; int ALIGNMENT = 4; ARGS = {}]’:// ///Work/Genode/genode-15.05/repos/base-hw/src/core/kernel/thread.cc:805:45: required from here// ///Work/Genode/genode-15.05/repos/base-hw/src/core/kernel/thread.cc:771:1: error: ‘Kernel::Core_thread::Core_thread()’ is private//

Oh sorry, I didn't consider that the constructor is private. But after making it public in [2] the problem is solved and the patch works also at runtime.

...
I'll look into your thought about the cpu_id, once I understand its purpose and use.

This was a misconception of mine. I thought that the almost complete SMP support for Cortex A9 has already made its way to master. But as this is not the case, Cpu::executing_id() always returns 0 independently from any hardware. Additionally, if the CPU ID wouldn't be correct, the stack pointer would be broken as well as the initialization would have chosen the wrong item of the kernel_stack array.

...
I should also note that the thread.cc I'm using contains two additional methods and associated call case statements I ported from my modified kernel from an AM335x implementation. Those core-based kernel calls are necessary in both the 4371 and 335x to allow writing to control subsystem registers in these platforms. I don't believe these changes have anything to do with the issue, but it is a difference.

As long as they do not introduce additional data members to Kernel::Thread, this should indeed make no difference, especially at that early stage where such methods are not called yet. However, just to be sure, you may also remove these modifications for now.

Cheers, Martin

[1] base/src/base/console/console.cc - void Console::vprintf [2] base-hw/src/core/include/kernel/thread.h - class Kernel::Core_thread

Don't Limit Your Business. Reach for the Cloud. GigeNET's Cloud Solutions provide you with the tools and support that you need to offload your IT needs and focus on growing your business. Configured For All Businesses. Start Your Cloud Today. https://www.gigenetcloud.com/ _______________________________________________ genode-main mailing list genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main

Martin Stein

3:29 p.m.

Hi Bob,

On 03.07.2015 13:52, Bob Stewart wrote:

...

Regarding possible issues with CXA guards, I did change the constructor of Core_thread to be public and applied your [4] modification. That did allow allow the kernel initialization to complete and I got the "kernel initialized" message on the console.

Nice! :)

...

Now I need to study the guard code to see what is going on.

Regarding the CXA guards: The reason why we have introduced the unmanaged_singleton was that we already had troubles with the atomic operations in the cmpxchg that is used by the guards. It was on the Raspberry PI where atomic ops do not work until the caches are enabled. Thus we replaced every static variable in an early called function by an unmanaged_singleton. What made me rule out this explanation in your case is that the problem definitely occurs after the MMU and thereby the caches have been enabled (Cpu::init_virt_kernel in init_kernel_mp).

Did you change something regarding this, e.g. switched off caches in [1]? Can you please post the output of [2]. It may also help if you enable the hw_info test by doing [3] and post the output. If the hw_info test gets stuck at any point before it prints "--- End ---" you can uncomment the respective register read in [4] (some registers are not accessible on all platforms).

Another idea: The atomic ops in ARM are realized through a so-called "exclusive" state that a CPU can set. This state can be cleared via assembly op "clrex". Maybe your exclusive state is erroneously set when entering the kernel and clearing it right before [5] solves the problem.

Cheers, Martin

[1] Arm::Sctlr::init_common in base-hw/src/core/include/spec/arm/cpu_support.h

[2] PINF("Sctlr %x", Genode::Arm::Sctlr::read()); during init_kernel_mp_primary

[3] cp base-hw/lib/mk/platform_panda/test-hw_info.mk base-hw/lib/mk/platform_am437x/

[4] base-hw/src/test/hw_info/spec/arm_v7/info.cc

[5] "ldr r0, =_bss_start" in base-hw/src/core/spec/arm/kernel/crt0.s

Bob Stewart

6 Jul 6 Jul

2:22 p.m.

Hi Martin,

I've made no changes to cpu_support.h, it's from main in 15.05. You'll see below that the C and M bits are correctly set in the SCTRL register. I'm currently seeing inconsistent behaviour depending on where I have print statements in the kernel initialization code: 1. kernel.cc code snippet (around line 142 in base-hw/src/core/kernel/kernel.cc): /* enable timer interrupt */ unsigned const cpu = Cpu::executing_id(); pic()->unmask(Timer::interrupt_id(cpu), cpu); PDBG("Starting kernel-6"); /* do further initialization only as primary CPU */ if (Cpu::primary_id() != cpu) { return; } init_kernel_mp_primary(); PDBG("Starting kernel-7");

2. Output produced: Starting kernel ...

void init_kernel_mp(): Starting kernel-6 Sctlr c5387d kernel initialized void init_kernel_mp(): Starting kernel-7

3. Commenting out the "PDBG("Starting kernel-7");" line results in the ouput: Starting kernel ...

void init_kernel_mp(): Starting kernel-6

4. Additionally commenting out the "PDBG("Starting kernel-6");" line results in Starting kernel ...

My theory is that that this behaviour is TLB entry attribute related, wherein an incorrect entry is encountered on an MMU table walk.

I'm going through the Cortex-A9 and Arm_v7 architecture documentation to try to understand what's going on. According to section 1.5 in the Cortex-A9 MPCore TRM, related to Private Memory for an A9 core, the global and peripheral control registers must be accessed through memory mapped transfers to the Cortex-A9 MPCore private memory region. The memory regions used for these registers must be marked as Device or Strongly-ordered.The translation table code in short_translation_table.h has no code to allow a page table entry to be setup with either of these attributes. What is it I am missing?

Thanks, Bob

On 07/03/2015 09:29 AM, Martin Stein wrote:

...

Hi Bob,

On 03.07.2015 13:52, Bob Stewart wrote:

...
Regarding possible issues with CXA guards, I did change the constructor of Core_thread to be public and applied your [4] modification. That did allow allow the kernel initialization to complete and I got the "kernel initialized" message on the console.

Nice! :)

...
Now I need to study the guard code to see what is going on.

Regarding the CXA guards: The reason why we have introduced the unmanaged_singleton was that we already had troubles with the atomic operations in the cmpxchg that is used by the guards. It was on the Raspberry PI where atomic ops do not work until the caches are enabled. Thus we replaced every static variable in an early called function by an unmanaged_singleton. What made me rule out this explanation in your case is that the problem definitely occurs after the MMU and thereby the caches have been enabled (Cpu::init_virt_kernel in init_kernel_mp).

Did you change something regarding this, e.g. switched off caches in [1]? Can you please post the output of [2]. It may also help if you enable the hw_info test by doing [3] and post the output. If the hw_info test gets stuck at any point before it prints "--- End ---" you can uncomment the respective register read in [4] (some registers are not accessible on all platforms).

Another idea: The atomic ops in ARM are realized through a so-called "exclusive" state that a CPU can set. This state can be cleared via assembly op "clrex". Maybe your exclusive state is erroneously set when entering the kernel and clearing it right before [5] solves the problem.

Cheers, Martin

[1] Arm::Sctlr::init_common in base-hw/src/core/include/spec/arm/cpu_support.h

[2] PINF("Sctlr %x", Genode::Arm::Sctlr::read()); during init_kernel_mp_primary

[3] cp base-hw/lib/mk/platform_panda/test-hw_info.mk base-hw/lib/mk/platform_am437x/

[4] base-hw/src/test/hw_info/spec/arm_v7/info.cc

[5] "ldr r0, =_bss_start" in base-hw/src/core/spec/arm/kernel/crt0.s

Don't Limit Your Business. Reach for the Cloud. GigeNET's Cloud Solutions provide you with the tools and support that you need to offload your IT needs and focus on growing your business. Configured For All Businesses. Start Your Cloud Today. https://www.gigenetcloud.com/ _______________________________________________ genode-main mailing list genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main

Christian Helmuth

2:44 p.m.

Hi Bob,

just a tiny side note from me: When debugging issues, which seem to interfere with the binary layout, I add log messages like follows

/* global vars to prevent optimization */ int dbg_on = true; int dbg_off = false;

if (dbg_off) PDBG("message 1"); if (dbg_on) PDBG("message 2");

and also use CC_OLEVEL=-O0 or CC_OLEVEL=-Og.

Greets

unknown＠example.com

3:12 p.m.

Thanks Christian. I'll use that.

Bob

Sent from my android device.

-----Original Message----- From: Christian Helmuth <christian.helmuth@...1...> To: genode-main@lists.sourceforge.net Sent: Mon, 06 Jul 2015 8:44 AM Subject: Re: "Core" thread doesn't become active in kernel initialization

Hi Bob,

just a tiny side note from me: When debugging issues, which seem to interfere with the binary layout, I add log messages like follows

/* global vars to prevent optimization */ int dbg_on = true; int dbg_off = false;

if (dbg_off) PDBG("message 1"); if (dbg_on) PDBG("message 2");

and also use CC_OLEVEL=-O0 or CC_OLEVEL=-Og.

Greets

-- Christian Helmuth Genode Labs http://www.genode-labs.com/ · http://genode.org/ https://twitter.com/GenodeLabs · /ˈdʒiː.nəʊd/ Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth ------------------------------------------------------------------------------ Don't Limit Your Business. Reach for the Cloud. GigeNET's Cloud Solutions provide you with the tools and support that you need to offload your IT needs and focus on growing your business. Configured For All Businesses. Start Your Cloud Today. https://www.gigenetcloud.com/ _______________________________________________ genode-main mailing list genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main

Bob Stewart

7 Jul 7 Jul

3:32 p.m.

Hi Martin, By pure luck I got the kernel to initialize and the printf test to run to completion.

The issue appears to be with the member function "trustzone_hypervisor_call" in the PL310 struct which is in the board.h file for the platform. My board.h file is just a direct copy of the one from the panda directory.

I arrived at this conclusion when I took Christian's suggestion to turn off compiler optimization. After a make clean, I ran "make CC_OLEVEL=-O0 run/printf" which surprisingly failed in platform_support.cc as the following build output shows

/ASSEMBLE spec/arm_v7/mode_transition.o// // COMPILE spec/rico/platform_support.o// //In file included from /Work/Genode/genode-15.05/repos/base-hw/src/core/spec/rico/platform_support.cc:16:0:// ///Work/Genode/genode-15.05/repos/base-hw/src/core/include/spec/rico/board.h: In static member function ‘static void Genode::Board::Pl310::trustzone_hypervisor_call(Genode::addr_t, Genode::addr_t)’:// ///Work/Genode/genode-15.05/repos/base-hw/src/core/include/spec/rico/board.h:44:4: error: fp cannot be used in asm here// // }// // ^// //make[3]: *** [spec/rico/platform_support.o] Error 1/

Just to see if I could get through a build I commented out the code in "trustzone_hypervisor_call". The link for core failed with: /Program core/core// // COMPILE kernel/test.o// // LINK core// ///Work/Genode/Builds/15.05/rico/var/libcache/core/core.lib.a(thread.o): In function `Kernel::Core_thread::Core_thread()':// ///Work/Genode/genode-15.05/repos/base-hw/src/core/kernel/thread.cc:799: undefined reference to `Kernel::Cpu_priority::max'// ///Work/Genode/Builds/15.05/rico/var/libcache/core/core.lib.a(platform_thread.o): In function `Genode::Platform_thread::Platform_thread(char const*, Genode::Native_utcb*)':// ///Work/Genode/genode-15.05/repos/base-hw/src/core/platform_thread.cc:103: undefined reference to `Kernel::Cpu_priority::max'// //collect2: error: ld returned 1 exit status// //make[3]: *** [core] Error 1// //make[2]: *** [core.prg] Error 2// //make[1]: *** [gen_deps_and_build_targets] Error 2// //make[1]: Leaving directory `/Work/Genode/Builds/15.05/rico'/

At which point I abandoned trying to build with that O flag setting.

I kept the code commented out in "trustzone_hypervisor_call" did a clean and built the printf test without the the O flag setting. The uImage ran correctly on the target hardware:

Starting kernel ...

void init_kernel_mp(): Starting kernel-6 SP = 8108942c Sctlr c5387d kernel initialized Genode 15.05-77-g01f22d4 <local changes> int main(): --- create local services --- int main(): --- start init --- int main(): transferred 506 MB to init int main(): --- init created, waiting for exit condition --- [init] Could not open ROM session for module "ld.lib.so" [init -> test-printf] -1 = -1 = -1 [init] virtual void Genode::Child_policy::exit(int): child "test-printf" exited with exit value 0

I'm not planning to use trustzone support currently but I'll take a look at the assembly code and read the trustzone support in the Cortex-A9 TRM. The AM437x is at revision R2P10 of the Cortex-A9.

Does the Panda board come up in Secure mode?

Thanks, Bob

On 07/06/2015 08:22 AM, Bob Stewart wrote:

...

Hi Martin,

I've made no changes to cpu_support.h, it's from main in 15.05. You'll see below that the C and M bits are correctly set in the SCTRL register. I'm currently seeing inconsistent behaviour depending on where I have print statements in the kernel initialization code:

kernel.cc code snippet (around line 142 in

base-hw/src/core/kernel/kernel.cc): /* enable timer interrupt */ unsigned const cpu = Cpu::executing_id(); pic()->unmask(Timer::interrupt_id(cpu), cpu); PDBG("Starting kernel-6"); /* do further initialization only as primary CPU */ if (Cpu::primary_id() != cpu) { return; } init_kernel_mp_primary(); PDBG("Starting kernel-7");

Output produced: Starting kernel ...

void init_kernel_mp(): Starting kernel-6 Sctlr c5387d kernel initialized void init_kernel_mp(): Starting kernel-7

Commenting out the "PDBG("Starting kernel-7");" line results in the

ouput: Starting kernel ...
void init_kernel_mp(): Starting kernel-6
Additionally commenting out the "PDBG("Starting kernel-6");" line

results in Starting kernel ...

My theory is that that this behaviour is TLB entry attribute related, wherein an incorrect entry is encountered on an MMU table walk.

I'm going through the Cortex-A9 and Arm_v7 architecture documentation to try to understand what's going on. According to section 1.5 in the Cortex-A9 MPCore TRM, related to Private Memory for an A9 core, the global and peripheral control registers must be accessed through memory mapped transfers to the Cortex-A9 MPCore private memory region. The memory regions used for these registers must be marked as Device or Strongly-ordered.The translation table code in short_translation_table.h has no code to allow a page table entry to be setup with either of these attributes. What is it I am missing?

Thanks, Bob

On 07/03/2015 09:29 AM, Martin Stein wrote:

...
Hi Bob,

On 03.07.2015 13:52, Bob Stewart wrote:

...
Regarding possible issues with CXA guards, I did change the constructor of Core_thread to be public and applied your [4] modification. That did allow allow the kernel initialization to complete and I got the "kernel initialized" message on the console.

Nice! :)

...
Now I need to study the guard code to see what is going on.

Regarding the CXA guards: The reason why we have introduced the unmanaged_singleton was that we already had troubles with the atomic operations in the cmpxchg that is used by the guards. It was on the Raspberry PI where atomic ops do not work until the caches are enabled. Thus we replaced every static variable in an early called function by an unmanaged_singleton. What made me rule out this explanation in your case is that the problem definitely occurs after the MMU and thereby the caches have been enabled (Cpu::init_virt_kernel in init_kernel_mp).

Did you change something regarding this, e.g. switched off caches in [1]? Can you please post the output of [2]. It may also help if you enable the hw_info test by doing [3] and post the output. If the hw_info test gets stuck at any point before it prints "--- End ---" you can uncomment the respective register read in [4] (some registers are not accessible on all platforms).

Another idea: The atomic ops in ARM are realized through a so-called "exclusive" state that a CPU can set. This state can be cleared via assembly op "clrex". Maybe your exclusive state is erroneously set when entering the kernel and clearing it right before [5] solves the problem.

Cheers, Martin

[1] Arm::Sctlr::init_common in base-hw/src/core/include/spec/arm/cpu_support.h

[2] PINF("Sctlr %x", Genode::Arm::Sctlr::read()); during init_kernel_mp_primary

[3] cp base-hw/lib/mk/platform_panda/test-hw_info.mk base-hw/lib/mk/platform_am437x/

[4] base-hw/src/test/hw_info/spec/arm_v7/info.cc

[5] "ldr r0, =_bss_start" in base-hw/src/core/spec/arm/kernel/crt0.s

Don't Limit Your Business. Reach for the Cloud. GigeNET's Cloud Solutions provide you with the tools and support that you need to offload your IT needs and focus on growing your business. Configured For All Businesses. Start Your Cloud Today. https://www.gigenetcloud.com/ _______________________________________________ genode-main mailing list genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main

Martin Stein

4:06 p.m.

Hi Bob,

On 07.07.2015 15:32, Bob Stewart wrote:

...

Hi Martin, By pure luck I got the kernel to initialize and the printf test to run to completion.

Cool that you managed to find the problem :)

...

The issue appears to be with the member function
"trustzone_hypervisor_call" in the PL310 struct which is in the board.h file for the platform. My board.h file is just a direct copy of the one from the panda directory. ... I kept the code commented out in "trustzone_hypervisor_call" did a clean and built the printf test without the the O flag setting. ... I'm not planning to use trustzone support currently but I'll take a look at the assembly code and read the trustzone support in the Cortex-A9 TRM. The AM437x is at revision R2P10 of the Cortex-A9.

Does the Panda board come up in Secure mode?

The hypervisor calls you mentioned are not related to Trustzone support. The Panda board comes with a fixed secure-hypervisor firmware. The user OS (here Genode) is started in non-secure mode and AFAIK there is no way to change this. The hypervisor calls are needed in our L2-cache driver because some parts of the PL310-controller are configured to be Trustzone-secured on Panda and the hypervisor provides an interface to access them from the non-secure world.

Admittedly, I don't think that it is a good idea to copy stuff from the Panda-specific sources. If you want to copy, better use PBXA9 as its port has a less sophisticated feature-set. For example, now you have enabled the L2 cache and maybe also FPU stuff which is both not necessary but obstructive for a basic port. That said, it might also be pure luck that your current implementation works. Maybe your PL310 needs a proper replacement for the ops that were achieved by the hypercalls.

Cheers, Martin

unknown＠example.com

5:20 p.m.

Thanks for the help, Martin. I'm digging through the L2 cache controller section in the TRM now.

Bob

Sent from my android device.

-----Original Message----- From: Martin Stein <martin.stein@...1...> To: Genode OS Framework Mailing List genode-main@lists.sourceforge.net Sent: Tue, 07 Jul 2015 10:06 AM Subject: Re: "Core" thread doesn't become active in kernel initialization

Hi Bob,

On 07.07.2015 15:32, Bob Stewart wrote:

...

Hi Martin, By pure luck I got the kernel to initialize and the printf test to run to completion.

Cool that you managed to find the problem :)

...

The issue appears to be with the member function
"trustzone_hypervisor_call" in the PL310 struct which is in the board.h file for the platform. My board.h file is just a direct copy of the one from the panda directory. ... I kept the code commented out in "trustzone_hypervisor_call" did a clean and built the printf test without the the O flag setting. ... I'm not planning to use trustzone support currently but I'll take a look at the assembly code and read the trustzone support in the Cortex-A9 TRM. The AM437x is at revision R2P10 of the Cortex-A9.

Does the Panda board come up in Secure mode?

Cheers, Martin

------------------------------------------------------------------------------ Don't Limit Your Business. Reach for the Cloud. GigeNET's Cloud Solutions provide you with the tools and support that you need to offload your IT needs and focus on growing your business. Configured For All Businesses. Start Your Cloud Today. https://www.gigenetcloud.com/ _______________________________________________ genode-main mailing list genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main

unknown＠example.com

8:22 p.m.

Just to close the loop on this saga, the reason that this SoC would not not come up using the same board.h settings as the Pandaboard is because of a difference in L2 cache way size. The Pandaboard sets the way size to 32 KB in the PL310 L2 cache controller's Auxiliary Control register. The 437x has 256 KB of cache and 16 ways, giving 16 KB per way. The number of ways and the way size really don't need to be set as there are pins on the chip that provide the correct settings as defaults. I'll use this modified Pandaboard board.h file for the 437x platform as the other Auxiliary Control settings are desirable for it.

Bob

Sent from my android device.

Hi Bob,

On 07.07.2015 15:32, Bob Stewart wrote:

...

Hi Martin, By pure luck I got the kernel to initialize and the printf test to run to completion.

Cool that you managed to find the problem :)

...

The issue appears to be with the member function
"trustzone_hypervisor_call" in the PL310 struct which is in the board.h file for the platform. My board.h file is just a direct copy of the one from the panda directory. ... I kept the code commented out in "trustzone_hypervisor_call" did a clean and built the printf test without the the O flag setting. ... I'm not planning to use trustzone support currently but I'll take a look at the assembly code and read the trustzone support in the Cortex-A9 TRM. The AM437x is at revision R2P10 of the Cortex-A9.

Does the Panda board come up in Secure mode?

Cheers, Martin

Bob Stewart

3 Jul 3 Jul

3:43 p.m.

Martin, I believe this issue is really related to translation table entry attributes. After I cleaned up some of my printf's I got the exception Stefan suggested I include in the lock(0 method. The DFSR and IFSR both indicate an MMU fault:

/Starting kernel ... // //// //void init_kernel_mp(): Starting kernel-6 // //SP 0x810893f8 // //kernel initialized // //void init_kernel_mp(): Starting kernel-7 // //An exception was raised during kernel execution! // //DFSR=0x5 IFSR=0x5 DFAR=0x1

/Previously, when bringing up the AM335x on Genode I found that there were issues with that platform and the settings in arm/short_short_translation_table.h. So, I'll first I'll go back through the Tex, B, and C attribute values and set them to exactly what the the Arm_v7 spec says they should and see what happens.

This is a holiday weekend here, so I'll be slow getting back.

Bob

p.s. Just saw your reply to the previous message... I'll go through your 5 steps also.

On 07/03/2015 07:52 AM, Bob Stewart wrote:

...

Thanks for the quick reply Martin.

Regarding possible issues with CXA guards, I did change the constructor of Core_thread to be public and applied your [4] modification. That did allow allow the kernel initialization to complete and I got the "kernel initialized" message on the console. Now I need to study the guard code to see what is going on.

I'll take a look at Console::vprintf to see if I can find the reason I get that run-time error on the %p format variable.

My additions to thread.cc simply write a passed value to a register pointer which is also a passed argument. But I'll take them out if I can't solve the current issue.

My only method of debugging is via printf. After I'm through with this issue I think I'll take a look at CoreSight and see what's involved in creating a debugger using its trace facilities.

Thanks, Bob

On 07/03/2015 06:06 AM, Martin Stein wrote:

...
Hi Bob,

On 02.07.2015 21:36, Bob Stewart wrote:

...
(a)Your PINF in [1] yields a run-time error -- "SP <warning: unsupported format string argument>p". (Not sure why that would be.)

That is indeed strange. I can't reproduce this output with the current master branch and the supported platforms. Instead of the warning case, the switch(cmd.type) statement in [1] should end up in 'case Format_command::PTR'. If you want to dig deeper into that, I would use the _out_string method inside the switch statement to find out what's going on.

...
(b) Replacing %p with 0x%x and applying the appropriate cast, results in PINF showing "SP 0x810893f0". (c) The kernel_stack$ symbol is set at "81079440 B kernel_stack". (d )KERNEL_STACK_SIZE = 64 * 1024. So the stack pointer is appropriately near the top of the stack, assuming it's growing from top to bottom.

That's right. However, the stack data may still be corrupted by some code that uses a broken pointer. That would be hard to debug. A debugger that supports watchpoints would be helpful but as you likely would have used single stepping in this case, I assume that you don't have such an interface. However, before investigating more into that, I would check whether the return pointers are correct at the very end of Core_thread::Core_thread() respectively __cxa_guard_release by using __builtin_return_address(0).

...
Your suggestion, [4] failed to compile with the following error output: "//Work/Genode/genode-15.05/repos/base/src/base/include/unmanaged_singleton.h:

In instantiation of ‘T* unmanaged_singleton(ARGS ...) [with T = Kernel::Core_thread; int ALIGNMENT = 4; ARGS = {}]’:// ///Work/Genode/genode-15.05/repos/base-hw/src/core/kernel/thread.cc:805:45:

required from here// ///Work/Genode/genode-15.05/repos/base-hw/src/core/kernel/thread.cc:771:1:

error: ‘Kernel::Core_thread::Core_thread()’ is private//

Oh sorry, I didn't consider that the constructor is private. But after making it public in [2] the problem is solved and the patch works also at runtime.

...
I'll look into your thought about the cpu_id, once I understand its purpose and use.

This was a misconception of mine. I thought that the almost complete SMP support for Cortex A9 has already made its way to master. But as this is not the case, Cpu::executing_id() always returns 0 independently from any hardware. Additionally, if the CPU ID wouldn't be correct, the stack pointer would be broken as well as the initialization would have chosen the wrong item of the kernel_stack array.

...
I should also note that the thread.cc I'm using contains two additional methods and associated call case statements I ported from my modified kernel from an AM335x implementation. Those core-based kernel calls are necessary in both the 4371 and 335x to allow writing to control subsystem registers in these platforms. I don't believe these changes have anything to do with the issue, but it is a difference.

As long as they do not introduce additional data members to Kernel::Thread, this should indeed make no difference, especially at that early stage where such methods are not called yet. However, just to be sure, you may also remove these modifications for now.

Cheers, Martin

[1] base/src/base/console/console.cc - void Console::vprintf [2] base-hw/src/core/include/kernel/thread.h - class Kernel::Core_thread

Don't Limit Your Business. Reach for the Cloud. GigeNET's Cloud Solutions provide you with the tools and support that you need to offload your IT needs and focus on growing your business. Configured For All Businesses. Start Your Cloud Today. https://www.gigenetcloud.com/ _______________________________________________ genode-main mailing list genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main

3695

Age (days ago)

3701

Last active (days ago)

users@lists.genode.org

15 comments

5 participants

tags (0)

participants (5)

unknown＠example.com
Bob Stewart
Christian Helmuth
Martin Stein
Stefan Kalkowski