Hello Tomasz,
thank you very much for sharing your experimentation efforts and insights and for starting the discussion. Please, see my comments inline below.
On Mon, Jan 28, 2019 at 06:31:33PM +0100, Tomasz Gajewski wrote:
Hi,
last weeks of attempts gave me some small progress with code, understanding of some aspects of ARM (AArch32) architecture and some knowledge about how low level hw kernel is implemented.
I have some thoughts and questions about how some issues should/could be implemented that I'd like to discuss.
Firstly I think that in my ignorance I took a wrong path to experiment with Raspberry Pi 3B+. It has a Cortex A53 (ARMv8-A) processor but as I want to target AArch32 (and initially without SMP) I thought that modifying current rpi target is the best choice. Now I think that I should have started with the newest supported by hw ARM processor (Cortex A15 if I correctly understand) and I would not be stopped by issues I had. Or maybe not...
Indeed, starting with a fork of the most recent processor in hw as a starting point might be better.
When last time I wrote about my progress I couldn't pass through following code:
sctlr = Cpu::Sctlr::read(); Cpu::Sctlr::C::set(sctlr, 1); // enable data cache Cpu::Sctlr::I::set(sctlr, 1); // enable instruction cache Cpu::Sctlr::M::set(sctlr, 1); // enable mmu Cpu::Sctlr::write(sctlr);
After reading and making different attempts I checked that enabling instruction cache does not cause problems, enabling data cache or mmu individually does allow to go further but enabling both does not. But I thought that for initial experiments I can live without data cache. But soon I found out that I was wrong.
After many attempts I found a place where it halted. It was deep inside Genode::log in ... assembly in atomic.h in cmpxchg. And reading gave me another unpleasent surprise that it will not work if I don't have data cache enabled... So I had to go back.
Yes, we are using ldrex/strex especially to support multi-processor systems. Those instructions use cache-coherency mechanisms of the processors when working on "cacheable" memory. I you did not turn on caches and the snoop-control-unit (SMP bit in Armv7) the cmpxchg won't work properly.
After two weeks of poking around finally I found a workaround that allowed me to go further. I've changed Page_flags for all types from CACHED to UNCACHED and it allowed to pass through enabling mmu with data cache enabled (of course without real caching) and through spinlock in atomic.h.
Next thing was to make UART working. Unfortunately version 3 has different UART device enabled by default (it has two) and it required a different driver which I implemented.
Seeing "kernel initialized" passed through serial connection was a real pleasure after analyzing logs in memory for two weeks.
So I thought that it is a good moment to write my thoughts and questions.
For writing traces to memory I've created a simple utility (set of macros in assembly and C++) to write simple debug values to a buffer in memory. I used it to diagnose what is going on before serial connection is working.
It is specified by a buffer address and size. At the beginning a pointer to current position in buffer is stored. A 1-1 mapping is inserted into TLB for this region and I had to make changes in virtual memory layout of core. Currently addresses are hardcoded but I can polish it to a state where it could be enabled/disabled with some build option or some defines in one place in the code if you'd like to use this. I pushed current state of my work [1] (without any cleanup yet). Utility macros are in:
- repos/base/include/base/memtrace.h
- repos/base-hw/src/bootstrap/spec/arm/crt0.s
Would you be interested in adding something like this to Genode? I would create issue and propose some version.
Very cool that you could help yourself that way! In general, we already have a tracing mechanism in Genode, where tracing points are collected in memory, and might get aggregated even from many different components for debug or optimization reasons. Anyway, that mechanism is based on core's functionality and therefore only appropriated for components running on top of core/kernel.
Another option used in NOVA/Genode is to write all kernel messages into a memory buffer and export it to the userland. Thereby, you can aggregate messages in headless systems or systems without AMT or serial line. I think this way might be appropriated to be used in base-hw too. Anyway, I would not introduce new debug macros, but just add a simple serial driver, which isn't using a special device but a portion of memory for printing. Then one can switch UART to that model if there is no one, or it is troublesome. What do you think about that approach?
I have to admit that usually I can use a JTAG debugger on most platforms to tackle such early initialization problems. On the other hand I'm quite cautious in adding pure debug feature in the "microkernel" ;-).
I have some general doubts about how some issues could be resolved. I started experimenting with an idea to make it possible to have a Sculpt for Raspberry Pi. And I thought it could be created one in such a way that single binary could run on any Pi. I knew that they are all ARM devices and even though some are 64bit they can work in AArch32 so I wanted to treat all version as just 32bit ARM boards.
I knew that they differ in list of devices so I thought that some kind of configuration would have to be passed during booting process so proper device drivers could be loaded and with proper configuration. I checked that a solution for this problem (using one binary for different ARM boards) used by at least Linux and U-boot is device tree. My plan was (and still is unless something better is proposed) to try to experiment with incorporating support for device tree to a platform driver - if I correctly understand it is a place where similar functionality is implemented for x86_64.
Browsing through bootstrap and core code in base-hw for past weeks made me realise that to support one binary for different Pi devices which have different generations of processors other problems would have to be resolved.
If I correctly understand currently whe building base-hw for a specific ARM board all configuration is provided during build process. It is performed by:
a. specifying target processor version (-m for g++)
b. specifying list of files to compile and include to binary that contain implementations of functions that differ between different processor generations and to enable some functionality (e.g. for virtualization support for ARM)
c. constants in code - to provide proper memory ranges, MMIO addresses, etc. for device drivers
Don't get me wrong, striving for the stars is welcome, but I think the goal of having a generic kernel image for all RPIs or even all ARM devices is a bit too ambitious right now. Although, Linux ARM developers are working on it since many years, they aren't there yet. Look at current RPI distro images like Raspian, they deliver at least two different kernels _and_ multiple device trees. The bootloader code has to decide, which one is loaded. Currently, there is no generic convention followed by all SoC/board vendors, bootloaders etc.. We have to be pragmatic here. Given the current state of e.g. the Rasperry Pi universe, where you have to decide what needs to be loaded, I do not see in general the advantage to differentiate in between configuration data and some small, statically linked image.
Having all this information during build allows to:
optimize code by inlining even processor specific code
minimize resulting binary (maybe not so important) and therefore memory footprint on runnig system (more importand)
On the other hand wikipedia (here [2]) currently mentions 15 models of Raspberry Pi which differ in:
processor generations (ARM11, Cortex-A7, Cortex-A53) - that breaks point (a)
have different devices (with/without ethernet, wireless, etc.) - that breaks point (b)
have different runtime configurations that can be changed by configuring firmware (e.g. different partitioning of RAM for used by graphics and operating system can be performed using a configuration option) - that contradicts with (c)
Surely, those are too much dimensions to provide different fully fledged system images. Also, we do not want too much platform differentiations in the codebase. Again, I vote for pragmaticsm and to target the different platforms step-by-step and not all-in-one. Currently, I would think that having different targets for the different architectures (Armv6, Armv7, Armv8) is a good starting point. It is a natural boundary, and we can compile all components to target the correct one. Then the question is whether there is a mechanism in the Broadcom SoC to identify the correct model at runtime, e.g. some identification registers. If possible, we can of course hide the model differences in the platform/bootstrap code within one image. I would omit different runtime configurations of the firmware for the moment and just support the default one.
When thinking about supporting different Pi devices (and more generally other ARM boards) I think that there are two areas to consider:
A. support for runtime/startup configuration of devices that will have drivers running in different processes - this is a part that I knew about from the beginning of my experimentation. I still think that implementing support for device tree (or some alternative) is a proper solution for this part (what whould be a method of passing configuration to device drivers is an open question)
B. support for passing configuration that is required by bootstrap and core. Here drivers are selected with C++ code e.g.:
//using Serial = Genode::Pl011_uart; using Serial = Genode::Mini_uart; Selecting UART driver is an interesting example as RPI3B+ has two UART devices and which one is available is decided by using proper configuration for firmware (mini uart is available by default and Pl011 uart is used internally for communcation with Bluetooth but it can be changed with configuration before starting operating system).
Now questions:
- Do you (Genode) plan to have or would like to have such configurable support for different ARM devices?
I only state my point of view that does not necessarily be the way it will go, but anyway: nowadays, configuration starts above core/kernel. We should keep this for the time-being to limit the risk of over-engineering in this minimal, critical code-base. I can imagine, that we weaken this claim for the bootstrap component in the future, because that code is critical in the boot-stage only, but then gets thrown away anyway.
Above core, we need to know:
* What devices exists * What resources do they need * Are there additional configuration aspects needed by the driver, beyond the resources, e.g.: operation mode of the PHY interface for ethernet devices ... * Resp. how to power/clock the device at runtime if dynamic power consumption is a topic
- Do you think that creating generic platform driver for ARM (to support A) that work using informations provided by device tree support is generally a good idea? What alternatives you consider better?
All functionality described above is realized using open firmware / flattened device tree in Linux and Co. I agree that it is appealing to be able to access all the platform data for free. But this only half the truth for the following reasons:
# Device tree support was a moving target within the last years with lots of changes in the implementation and in the trees themselfs, it is not forseeable that this will change # There are reams of device-specific attributes. Please, have a look at Documentation/devicetree in the Linux kernel. There are over 3000 files. The whole "of_" bureaucracy in the Linux kernel shows its complexity. # Different device trees and its "generic" language should not suggest that we will have one platform driver to rule them all. There are tons of "platform devices" in the Linux kernel necessary to provide the resources that toplevel devices need. That means for each SoC you still need a specific platform driver that support all the multiplexing units for pins, power, clocking etc. # Device trees describe also a lot of relations in between devices. For instance, a device might get its interrupts from the interrupt controller or from a GPIO controller. All routes to some kind of multiplexer device are part of the tree. It is much easier if you want to apply this configuration to a monolithic component containing all multiplexer devices than to split this in multiple components.
To sum it up. I'm not sure whether it is the right way to go to support device trees by Genode's ARM platform driver, but I do not want to exclude it completely. Maybe I'm wrong and the complexity of the core functionality is over-estimated by me. Surely we need something comparable at some point. Maybe, it would be good to start it as an experiment for one specific SoC.
- Do you see any way to implement suport for B?
Surely, but at least here at Genode Labs it is not highly prioritized right now. Currently, we try to limit the platform support and care to the NXP i.MX* platforms. Here, we plan to support i.MX8 as soon as possible, which has Cortex A53 too.
I'd very much like to receive some comments about it.
Than you very much for sharing your thoughts.
Best regards Stefan
-- Tomasz Gajewski
[1] https://github.com/tomga/genode/tree/rpi3bplus [2] https://en.wikipedia.org/wiki/Raspberry_Pi#Specifications
Genode users mailing list users@lists.genode.org https://lists.genode.org/listinfo/users