Hi to all.
In my last message I wrote:
"Now I'm starting again from start but with different knowledge level."
and now I have some preliminary results and another set of issues/questions.
After making run/log work on single core I had two possible directions to take: device drivers or enable smp. I started with smp as with the device drivers I need an advice about build system (later in this email).
For SMP on RPI I implemented new timer driver as the current one was not capable to support multiple independent timers for cores and new PIC driver - for timer and inter processor interrputs.
Current code is in (rewritten) same branch as previously at [1] in case you wanted to look. run/smp test is almost working on all cores with the exception for thread cleanup (which will be another question later in email).
1. Register sets.
On multicore Raspbery Pi there are sets of registers in which each register has the same structure and each is "assigned" to different core. Currently I replicated definitions of registers for each core [2] but it's obviously not a good solution. Additionally it forces to write ugly code when cpu_id is in the variable as in [3].
Is there some support in Genode for such structures? If not, do you see how that could be implemented? Some wrapper for Register? I fail to see how dynamic value of 'cpu_id' could be nicely mixed with register offsets passed as template arguments.
I think ideally it would be to define something like this:
struct CoreTIrqCtl : RegisterSet<0x40, 32, COUNT, OFFSET>
and use it like this:
write<CoreTIrqCtl>(CPU_ID, 0);
Any thoughts?
2. Interrupts handling
I have doubts about interrupts processing. I have limited knowledge in this area so I may be totally wrong but based on what I read for RPI I think that:
a. On one IRQ there can be more than one interrupt flag marking an exception. During this one entry to 'Thread::exception(Cpu&)' and later to 'Cpu_job::_interrupt(unsigned const)' all pending interrupts should be handled in some loop. Without that some interrupts may stay unhandled. And I don't see such loop in this code. One exception is taken from Pic driver, handled and that's all.
b. I don't know where clearing an interrupt should be handled. I would have thought that 'Pic::finish_request()' would be good for this, but there are many empty implementations. Should code in 'Pic::take_exception(...) also clear the exception condition? My doubts come from feeling that if I do this too early "something" can happen before handling of interrupt is finished.
If 'finish_request()' is a proper place to clear interrupt as handled than it would be useful to pass to it interrupt number earlier returned from 'take_request'. Otherwise there would have to be separate storage for each core in Pic to store information about currently handled exception on a given core. Am I right? Does it work properly in [4]?
3. run/smp test hangs during threads destruction
Like I wrote earlier run/smp test is almost working on multiple cores. It hangs somewhere in [5] during destructing thread running non-zero core (destroying first one does not cause problems).
I don't have any concrete questions here. The only ones are:
a. Does this test properly work on arndale?
b. Do you have any thoughts what can go wrong? Maybe I haven't implemented something important for smp yet? Only timer and IPI?
4. Drivers reused for different configurations
Currently to support RPI3 I created configurations named 'rpi3bplus'. How should I proceed to reuse a driver written for rpi? For example framebuffer driver has 'REQUIRES = rpi' in target.mk [6] and that causes that it does not compile for my 'rpi3bplus'. It would be nice to have: 'REQUIRES = rpi|rpi3bplus' but I don't think it is possible. Can you suggest something?
Tomasz Gajewski
[1] https://github.com/tomga/genode/tree/rpi3bplus [2] https://github.com/tomga/genode/blob/rpi3bplus/repos/base-hw/src/core/spec/r... [3] https://github.com/tomga/genode/blob/rpi3bplus/repos/base-hw/src/core/spec/r... [4] https://github.com/tomga/genode/blob/rpi3bplus/repos/base-hw/src/lib/hw/spec... [5] https://github.com/tomga/genode/blob/rpi3bplus/repos/base/src/test/smp/main.... [6] https://github.com/tomga/genode/blob/rpi3bplus/repos/os/src/drivers/framebuf...
Hi again,
- run/smp test hangs during threads destruction
Like I wrote earlier run/smp test is almost working on multiple cores. It hangs somewhere in [5] during destructing thread running non-zero core (destroying first one does not cause problems).
I don't have any concrete questions here. The only ones are:
a. Does this test properly work on arndale?
b. Do you have any thoughts what can go wrong? Maybe I haven't implemented something important for smp yet? Only timer and IPI?
Situation with this test has changed after I rebased my branch to current staging. It doesn't stop at threads destruction in Affinity test like earlier so probably something was fixed since 18.11 release.
Currently it hangs a little further:
... init -> test-smp] Affinity: Round 09: A A A A [init -> test-smp] Affinity: CPU: 00 01 02 03 [init -> test-smp] Affinity: Round 10: A A A A [init -> test-smp] Affinity: --- test finished --- [init -> test-smp] TLB: --- test started --- [init -> test-smp] TLB: thread started on CPU 1 [init -> test-smp] TLB: thread started on CPU 2 [init -> test-smp] TLB: thread started on CPU 3 [init -> test-smp] TLB: all threads are up and running... [init -> test-smp] TLB: ram dataspace destroyed, all will fault... no RM attachment (READ pf_addr=0xc00c pf_ip=0x1000d2c from pager_object: pd='init -> test-smp' thread='tlb_thread') Warning: page fault, pager_object: pd='init -> test-smp' thread='tlb_thread' ip=0x1000d2c fault-addr=0xc00c type=no-page Warning: core -> pager_ep: cannot submit unknown signal context [init -> test-sm
Can you say if this reported fault is something that currently happens on tested configurations on staging or is it specific to my rpi work?
I can investigate it, but if it's something generic it will be much harder for me than for you and probably you will get it working quicker.
Tomasz Gajewski
Hi Tomasz,
first of all: congratulation for going that far with RPI3.
On Mon, Feb 18, 2019 at 11:21:10PM +0100, Tomasz Gajewski wrote:
Hi again,
- run/smp test hangs during threads destruction
Like I wrote earlier run/smp test is almost working on multiple cores. It hangs somewhere in [5] during destructing thread running non-zero core (destroying first one does not cause problems).
I don't have any concrete questions here. The only ones are:
a. Does this test properly work on arndale?
b. Do you have any thoughts what can go wrong? Maybe I haven't implemented something important for smp yet? Only timer and IPI?
Situation with this test has changed after I rebased my branch to current staging. It doesn't stop at threads destruction in Affinity test like earlier so probably something was fixed since 18.11 release.
Currently it hangs a little further:
... init -> test-smp] Affinity: Round 09: A A A A [init -> test-smp] Affinity: CPU: 00 01 02 03 [init -> test-smp] Affinity: Round 10: A A A A [init -> test-smp] Affinity: --- test finished --- [init -> test-smp] TLB: --- test started --- [init -> test-smp] TLB: thread started on CPU 1 [init -> test-smp] TLB: thread started on CPU 2 [init -> test-smp] TLB: thread started on CPU 3 [init -> test-smp] TLB: all threads are up and running... [init -> test-smp] TLB: ram dataspace destroyed, all will fault... no RM attachment (READ pf_addr=0xc00c pf_ip=0x1000d2c from pager_object: pd='init -> test-smp' thread='tlb_thread') Warning: page fault, pager_object: pd='init -> test-smp' thread='tlb_thread' ip=0x1000d2c fault-addr=0xc00c type=no-page Warning: core -> pager_ep: cannot submit unknown signal context [init -> test-sm
The page-fault of this tlb-thread is perfectly fine! It is what is tested here exactly. Actually, you should see three faulting tlb-threads. The test forks threads on each cpu apart from the first one, and let them infinetly work on a shared page, which is thereby in the TLB of the corresponding cpu. Because we do not start anything else on the other cpus, there is little probability that the TLB entry gets evicted. Then the first cpu unmaps the page. If cross-cpu TLB shootdown is implemented correctly for the platform, you should notice a fault on all other cpus.
The question is why your messages end here, or did you just send a snippet?
Regards Stefan
Can you say if this reported fault is something that currently happens on tested configurations on staging or is it specific to my rpi work?
I can investigate it, but if it's something generic it will be much harder for me than for you and probably you will get it working quicker.
Tomasz Gajewski
Genode users mailing list users@lists.genode.org https://lists.genode.org/listinfo/users
Stefan Kalkowski stefan.kalkowski@genode-labs.com writes:
Hi Tomasz,
first of all: congratulation for going that far with RPI3.
Thanks.
On Mon, Feb 18, 2019 at 11:21:10PM +0100, Tomasz Gajewski wrote:
Hi again,
- run/smp test hangs during threads destruction
Like I wrote earlier run/smp test is almost working on multiple cores. It hangs somewhere in [5] during destructing thread running non-zero core (destroying first one does not cause problems).
I don't have any concrete questions here. The only ones are:
a. Does this test properly work on arndale?
b. Do you have any thoughts what can go wrong? Maybe I haven't implemented something important for smp yet? Only timer and IPI?
Situation with this test has changed after I rebased my branch to current staging. It doesn't stop at threads destruction in Affinity test like earlier so probably something was fixed since 18.11 release.
Currently it hangs a little further:
... init -> test-smp] Affinity: Round 09: A A A A [init -> test-smp] Affinity: CPU: 00 01 02 03 [init -> test-smp] Affinity: Round 10: A A A A [init -> test-smp] Affinity: --- test finished --- [init -> test-smp] TLB: --- test started --- [init -> test-smp] TLB: thread started on CPU 1 [init -> test-smp] TLB: thread started on CPU 2 [init -> test-smp] TLB: thread started on CPU 3 [init -> test-smp] TLB: all threads are up and running... [init -> test-smp] TLB: ram dataspace destroyed, all will fault... no RM attachment (READ pf_addr=0xc00c pf_ip=0x1000d2c from pager_object: pd='init -> test-smp' thread='tlb_thread') Warning: page fault, pager_object: pd='init -> test-smp' thread='tlb_thread' ip=0x1000d2c fault-addr=0xc00c type=no-page Warning: core -> pager_ep: cannot submit unknown signal context [init -> test-sm
The page-fault of this tlb-thread is perfectly fine! It is what is tested here exactly. Actually, you should see three faulting tlb-threads. The test forks threads on each cpu apart from the first one, and let them infinetly work on a shared page, which is thereby in the TLB of the corresponding cpu. Because we do not start anything else on the other cpus, there is little probability that the TLB entry gets evicted. Then the first cpu unmaps the page. If cross-cpu TLB shootdown is implemented correctly for the platform, you should notice a fault on all other cpus.
The question is why your messages end here, or did you just send a snippet?
No. Unfortunately it is what I get. From the beginning I have problems with everything hanging after faults. I would expect some handler to be invoked that would give me some debug information but I don't get any.
I have a feeling that some exceptions may be routed to hyp mode but I didn't put much effort to confirm that. Is there any code that installs exception handlers in hyp mode on base-hw on some other architecture that I could check/use?
Generally from the beginning I had problems with hangs and I had no uart driver at all. That's why I use macros to dump to memory. Unfortunately this "framework" is also new and caused some troubles. I think I get similar hangs when I generate mmu exception in kernel code. That's why I think that such exceptions can be routed "somewhere else" (hyp?). Unless you'll be able to help with some wild guess I'll have to read more arm reference documentation and check control registers values.
Tomasz Gajewski
Tomasz Gajewski tomga@wp.pl writes:
[init -> test-smp] TLB: thread started on CPU 1 [init -> test-smp] TLB: thread started on CPU 2 [init -> test-smp] TLB: thread started on CPU 3 [init -> test-smp] TLB: all threads are up and running... [init -> test-smp] TLB: ram dataspace destroyed, all will fault... no RM attachment (READ pf_addr=0xc00c pf_ip=0x1000d2c from pager_object: pd='init -> test-smp' thread='tlb_thread') Warning: page fault, pager_object: pd='init -> test-smp' thread='tlb_thread' ip=0x1000d2c fault-addr=0xc00c type=no-page Warning: core -> pager_ep: cannot submit unknown signal context [init -> test-sm
The page-fault of this tlb-thread is perfectly fine! It is what is tested here exactly. Actually, you should see three faulting tlb-threads. The test forks threads on each cpu apart from the first one, and let them infinetly work on a shared page, which is thereby in the TLB of the corresponding cpu. Because we do not start anything else on the other cpus, there is little probability that the TLB entry gets evicted. Then the first cpu unmaps the page. If cross-cpu TLB shootdown is implemented correctly for the platform, you should notice a fault on all other cpus.
The question is why your messages end here, or did you just send a snippet?
No. Unfortunately it is what I get. From the beginning I have problems with everything hanging after faults. I would expect some handler to be invoked that would give me some debug information but I don't get any.
I have a feeling that some exceptions may be routed to hyp mode but I didn't put much effort to confirm that. Is there any code that installs exception handlers in hyp mode on base-hw on some other architecture that I could check/use?
It seems that puzzle is somewhat solved.
With the help of Norman's helper to run depot scripts I could run more tests. Most of them were passing but test-timer was not and I remembered that I left calibrating timer for later and forgot about it. Now after setting proper value for TICS_PER_US both tests that were not working for me: test-timer and smp, are passing.
Nevertheless I'm afraid that this "fix" just hides some problem as with wrong value of TICS_PER_US and some greater values in test-timer I was receiving:
[init -> test ->Kernel: Cpu 0 error: re-entered lock. Kernel exception?!
For now the "solution" satisfies me but probably I'll get back to it when I have a JTAG debugger. Without it it is hard to diagnose this.
Tomasz Gajewski
Hi Tomasz,
[init -> test ->Kernel: Cpu 0 error: re-entered lock. Kernel exception?!
For now the "solution" satisfies me but probably I'll get back to it when I have a JTAG debugger. Without it it is hard to diagnose this.
Stefan is away right now but I think that the following commit is related to this issue. It is part of the just-released Genode 19.02:
https://github.com/genodelabs/genode/commit/2cf4e5a6de176bc0c9fe6da46f06193a...
Cheers Norman
Norman Feske norman.feske@genode-labs.com writes:
Hi Tomasz,
[init -> test ->Kernel: Cpu 0 error: re-entered lock. Kernel exception?!
For now the "solution" satisfies me but probably I'll get back to it when I have a JTAG debugger. Without it it is hard to diagnose this.
Stefan is away right now but I think that the following commit is related to this issue. It is part of the just-released Genode 19.02:
https://github.com/genodelabs/genode/commit/2cf4e5a6de176bc0c9fe6da46f06193a...
Thank you for this information. I have this commit on my branch as it is based on staging but I didn't know that it could be a fix for the problem. Therefore I haven't tested this until now.
After first test I was going to write that currently even after "decalibrating" timer frequency (reverting my "fix") problem doesn't appear but tried again, and unfortunately it failed. After 8 attempts I had 2 failures and 6 successes.
Definitely something on staging helped. Probably this commit. But the issue still exists and I leave it on my todo list for later.
Tomasz Gajewski