Hello
As a preparation of a certain task I want to check the PCI resources of my platform (IVI platform with Intel ATOM). For this purpose I built Genode-on-OKL4, only consisting of a minimum driver set and the test-pci application. Running this image in qemu looks good, but on the IVI platform the init process fails with a page fault before or when starting the PCI driver which is the first entry in the config file. The error message is:
no RM attachment (READ pf_addr=6000 pf_ip=2001286 from 01)
I have no clue what this message is trying to tell me. The given IP points to the function Genode::strncpy(). I also wonder why the system wants to read from virtual address 0x6000, because all modules are allocated beginning at virtual address 0x02000000. Checking init's pagetable with OKL4's KDB on qemu shows a number of allocations below:
00000000 [00140027]: tree=f0140000
00001000 [001f5067]: phys=001f5000 pg=f0140004 4KB rwx (RWX) user WB
00003000 [001f7067]: phys=001f7000 pg=f014000c 4KB rwx (RWX) user WB
00004000 [001f8067]: phys=001f8000 pg=f0140010 4KB rwx (RWX) user WB
00005000 [001df025]: phys=001df000 pg=f0140014 4KB r~x (R~X) user WB
00006000 [00275067]: phys=00275000 pg=f0140018 4KB rwx (RWX) user WB
00007000 [00276067]: phys=00276000 pg=f014001c 4KB rwx (RWX) user WB
00008000 [00277067]: phys=00277000 pg=f0140020 4KB rwx (RWX) user WB
00009000 [00278067]: phys=00278000 pg=f0140024 4KB rwx (RWX) user WB
0000a000 [00368067]: phys=00368000 pg=f0140028 4KB rwx (RWX) user WB
0000b000 [00369067]: phys=00369000 pg=f014002c 4KB rwx (RWX) user WB
0000c000 [0036a067]: phys=0036a000 pg=f0140030 4KB rwx (RWX) user WB
0000d000 [0036b067]: phys=0036b000 pg=f0140034 4KB rwx (RWX) user WB
0000e000 [0037b067]: phys=0037b000 pg=f0140038 4KB rwx (RWX) user WB
00012000 [003fa067]: phys=003fa000 pg=f0140048 4KB rwx (RWX) user WB
00016000 [00852067]: phys=00852000 pg=f0140058 4KB rwx (RWX) user WB
0004a000 [00336067]: phys=00336000 pg=f0140128 4KB rwx (RWX) user WB
00066000 [00370067]: phys=00370000 pg=f0140198 4KB rwx (RWX) user WB
On the IVI platform this area at the time of the page fault looks:
00000000 [00141027]: tree=f0141000
00001000 [001f5067]: phys=001f5000 pg=f0141004 4KB rwx (RWX) user WB
00005000 [001df025]: phys=001df000 pg=f0141014 4KB r~x (R~X) user WB
I'd like to get some hints where to look into the code for finding the cause of the problem. Since I cannot debug the platform, I probably have to add more trace messages to get additonal information about what is going on.
Regards
Frank
Hi,
On Thu, Jul 30, 2009 at 06:14:03PM +0200, Frank Kaiser wrote:
As a preparation of a certain task I want to check the PCI resources of my platform (IVI platform with Intel ATOM). For this purpose I built Genode-on-OKL4, only consisting of a minimum driver set and the test-pci application. Running this image in qemu looks good, but on the IVI platform the init process fails with a page fault before or when starting the PCI driver which is the first entry in the config file. The error message is:
no RM attachment (READ pf_addr=6000 pf_ip=2001286 from 01)
I have no clue what this message is trying to tell me.
The message indicates a potential bug with undefined pointers, i.e. init did not attach a dataspace at this virtual address.
The given IP points to the function Genode::strncpy(). I also wonder why the system wants to read from virtual address 0x6000, because all modules are allocated beginning at virtual address 0x02000000.
On Genode the core service RM (region manager) manages address spaces of processes. When init creates and attaches a new RAM dataspace to its virtual address space, a unused region fitting the dataspace is looked up by RM.
Checking init's pagetable with OKL4's KDB on qemu shows a number of allocations below:
00000000 [00140027]: tree=f0140000
[...]
00066000 [00370067]: phys=00370000 pg=f0140198 4KB rwx (RWX) user WB
Looks good and common for me ;-)
On the IVI platform this area at the time of the page fault looks:
00000000 [00141027]: tree=f0141000
00001000 [001f5067]: phys=001f5000 pg=f0141004 4KB rwx (RWX) user WB
00005000 [001df025]: phys=001df000 pg=f0141014 4KB r~x (R~X) user WB
I'd like to get some hints where to look into the code for finding the cause of the problem. Since I cannot debug the platform, I probably have to add more trace messages to get additonal information about what is going on.
I have no idea what happened, but files you should have a look at are:
base-okl4/src/core/rm_session_support.cc (set verbose_unmap) base/src/core/rm_session_component.cc (set verbose and verbose_page_faults)
Good luck
Hello Frank,
I think, you hit an issue with the handling of boot modules on OKL4. In contrast to running on Qemu, on real hardware, the padding space between boot modules is not cleared on startup so that there is the chance that the actual data is followed by bit garbage. This is particularly annoying for the config file. We directly pass the locally mapped config file to our XML parser, which expects a null termination. However, without initial clearing of memory, there may be no such termination. So the XML parser continues parsing until it hits the following (not mapped) page. The next release will fix the problem by allowing a length limit to be specified to the XML parser. For now, you can use the short-term fix to manually append a zero character to your config file.
I would be grateful to know if I'm guessing right and if this quick fix works for you.
Regards Norman
Frank Kaiser wrote:
Hello
As a preparation of a certain task I want to check the PCI resources of my platform (IVI platform with Intel ATOM). For this purpose I built Genode-on-OKL4, only consisting of a minimum driver set and the /test-pci/ application. Running this image in /qemu/ looks good, but on the IVI platform the /init/ process fails with a page fault before or when starting the PCI driver which is the first entry in the /config/ file. The error message is:
no RM attachment (READ pf_addr=6000 pf_ip=2001286 from 01)
I have no clue what this message is trying to tell me. The given IP points to the function /Genode::strncpy()/. I also wonder why the system wants to read from virtual address 0x6000, because all modules are allocated beginning at virtual address 0x02000000. Checking /init’s/ pagetable with OKL4’s KDB on /qemu/ shows a number of allocations below:
00000000 [00140027]: tree=f0140000
00001000 [001f5067]: phys=001f5000 pg=f0140004 4KB rwx (RWX) user WB
00003000 [001f7067]: phys=001f7000 pg=f014000c 4KB rwx (RWX) user WB
00004000 [001f8067]: phys=001f8000 pg=f0140010 4KB rwx (RWX) user WB
00005000 [001df025]: phys=001df000 pg=f0140014 4KB r~x (R~X) user WB
00006000 [00275067]: phys=00275000 pg=f0140018 4KB rwx (RWX) user WB
00007000 [00276067]: phys=00276000 pg=f014001c 4KB rwx (RWX) user WB
00008000 [00277067]: phys=00277000 pg=f0140020 4KB rwx (RWX) user WB
00009000 [00278067]: phys=00278000 pg=f0140024 4KB rwx (RWX) user WB
0000a000 [00368067]: phys=00368000 pg=f0140028 4KB rwx (RWX) user WB
0000b000 [00369067]: phys=00369000 pg=f014002c 4KB rwx (RWX) user WB
0000c000 [0036a067]: phys=0036a000 pg=f0140030 4KB rwx (RWX) user WB
0000d000 [0036b067]: phys=0036b000 pg=f0140034 4KB rwx (RWX) user WB
0000e000 [0037b067]: phys=0037b000 pg=f0140038 4KB rwx (RWX) user WB
00012000 [003fa067]: phys=003fa000 pg=f0140048 4KB rwx (RWX) user WB
00016000 [00852067]: phys=00852000 pg=f0140058 4KB rwx (RWX) user WB
0004a000 [00336067]: phys=00336000 pg=f0140128 4KB rwx (RWX) user WB
00066000 [00370067]: phys=00370000 pg=f0140198 4KB rwx (RWX) user WB
On the IVI platform this area at the time of the page fault looks:
00000000 [00141027]: tree=f0141000
00001000 [001f5067]: phys=001f5000 pg=f0141004 4KB rwx (RWX) user WB
00005000 [001df025]: phys=001df000 pg=f0141014 4KB r~x (R~X) user WB
I’d like to get some hints where to look into the code for finding the cause of the problem. Since I cannot debug the platform, I probably have to add more trace messages to get additonal information about what is going on.
Regards
Frank
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july
Genode-main mailing list Genode-main@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/genode-main
Hello, Norman
Your guess is right: The page fault is caused while parsing the config file. The trigger is the method Xml_node::content(), which tries to copy the process’ filename from the config file, but the root cause is a nasty bug in function Genode::strncpy() which is used to obtain the filename. In the function’s first line Genode::strlen() is used to determine the length of the source string. In the given case, where the source is a tagged item of the config file having no null termination, strlen() runs thru the memory until it randomly finds a null character. For my opinion Genode::strncpy() is not allowed to parse the source string beyond the given size argument. Your suggestion of appending a null character to the config file (by the way: how is this to be done w/o corrupting the XML syntax?) heals a symptome, but does not solve the root cause.
I tried to fix Genode::strncpy() myself. Since there is no function Genode::strnlen(), I made the following change:
size_t i = 0;
while (i < size)
{
if (src[i] == 0)
{
size = i;
break;
}
++i;
}
Interestingly this seem to trigger another problem. Now I get on all platforms the following two errors:
virtual Genode::Session_capability Genode::Core_parent::session(const char*, const char*): service_name="RM" arg="ram_quota=4K" not handled
virtual Genode::Session_capability Genode::Core_parent::session(const char*, const char*): service_name="PD" arg="ram_quota=4K" not handled
Could it be that there are already some workarounds for buggy Genode::strncpy(), which do not work anymore once the function is fixed?
Frank
-----Original Message-----
From: Norman Feske [mailto:norman.feske@...1...]
Sent: Sunday, August 02, 2009 5:38 PM
To: Genode OS Framework Mailing List
Subject: Re: Problem with 'test-pci'
Hello Frank,
I think, you hit an issue with the handling of boot modules on
OKL4. In contrast to running on Qemu, on real hardware, the padding
space between boot modules is not cleared on startup so that there is
the chance that the actual data is followed by bit garbage. This is
particularly annoying for the config file. We directly pass the locally
mapped config file to our XML parser, which expects a null termination.
However, without initial clearing of memory, there may be no such
termination. So the XML parser continues parsing until it hits the
following (not mapped) page. The next release will fix the problem by
allowing a length limit to be specified to the XML parser. For now, you
can use the short-term fix to manually append a zero character to your
config file.
I would be grateful to know if I'm guessing right and if this quick fix
works for you.
Regards
Norman
Frank Kaiser wrote:
Hello
As a preparation of a certain task I want to check the PCI resources of
my platform (IVI platform with Intel ATOM). For this purpose I built
Genode-on-OKL4, only consisting of a minimum driver set and the
/test-pci/ application. Running this image in /qemu/ looks good, but on
the IVI platform the /init/ process fails with a page fault before or
when starting the PCI driver which is the first entry in the /config/
file. The error message is:
no RM attachment (READ pf_addr=6000 pf_ip=2001286 from 01)
I have no clue what this message is trying to tell me. The given IP
points to the function /Genode::strncpy()/. I also wonder why the system
wants to read from virtual address 0x6000, because all modules are
allocated beginning at virtual address 0x02000000. Checking /init’s/
pagetable with OKL4’s KDB on /qemu/ shows a number of allocations below:
...
I’d like to get some hints where to look into the code for finding the
cause of the problem. Since I cannot debug the platform, I probably have
to add more trace messages to get additonal information about what is
going on.
Regards
Frank
Hi Frank,
thanks for your investigation. We have also hit this issue (hence my initial guess) on real hardware and it will be fixed in the upcoming release. Until then, I hope you are fine with the interim solution of appending the zero-termination manually. Of course, the pending null character does not comply to the XML syntax. It's just a work-around.
Regards Norman
Frank Kaiser wrote:
Your guess is right: The page fault is caused while parsing the config file. The trigger is the method /Xml_node::content()/, which tries to copy the process’ filename from the config file, but the root cause is a nasty bug in function /Genode::strncpy()/ which is used to obtain the filename. In the function’s first line /Genode::strlen()/ is used to determine the length of the source string. In the given case, where the source is a tagged item of the config file having no null termination, /strlen()/ runs thru the memory until it randomly finds a null character. For my opinion /Genode::strncpy()/ is not allowed to parse the source string beyond the given /size/ argument. Your suggestion of appending a null character to the config file (by the way: how is this to be done w/o corrupting the XML syntax?) heals a symptome, but does not solve the root cause.
I tried to fix /Genode::strncpy()/ myself. Since there is no function /Genode::strnlen()/, I made the following change:
size_t i = 0; while (i < size) { if (src[i] == 0) { size = i; break; } ++i; }
Interestingly this seem to trigger another problem. Now I get on all platforms the following two errors:
virtual Genode::Session_capability Genode::Core_parent::session(const char*, const char*): service_name="RM" arg="ram_quota=4K" not handled
virtual Genode::Session_capability Genode::Core_parent::session(const char*, const char*): service_name="PD" arg="ram_quota=4K" not handled
Could it be that there are already some workarounds for buggy /Genode::strncpy()/, which do not work anymore once the function is fixed?
Frank
Hi, Norman
I prefer to fix the root cause. However my attempt outlined below did not work, since it does not take into account that the function writes a ‘\0’ at the end of the destination string (something the standard C library function doesn’t do), for which the calculated size value has to be adjusted. The final fix of Genode::strncpy() is:
size_t i = 0;
for (; i < (size - 1); ++i) // last char will be set to \0 anyway
{
if (src[i] == 0)
{
size = i + 1; // let room for \0 char
break;
}
}
Frank
-----Original Message-----
From: Norman Feske [mailto:norman.feske@...1...]
Sent: Monday, August 03, 2009 3:08 PM
To: Genode OS Framework Mailing List
Subject: Re: Problem with 'test-pci'
Hi Frank,
thanks for your investigation. We have also hit this issue (hence my
initial guess) on real hardware and it will be fixed in the upcoming
release. Until then, I hope you are fine with the interim solution of
appending the zero-termination manually. Of course, the pending null
character does not comply to the XML syntax. It's just a work-around.
Regards
Norman
Frank Kaiser wrote:
Your guess is right: The page fault is caused while parsing the config
file. The trigger is the method /Xml_node::content()/, which tries to
copy the process’ filename from the config file, but the root cause is a
nasty bug in function /Genode::strncpy()/ which is used to obtain the
filename. In the function’s first line /Genode::strlen()/ is used to
determine the length of the source string. In the given case, where the
source is a tagged item of the config file having no null termination,
/strlen()/ runs thru the memory until it randomly finds a null
character. For my opinion /Genode::strncpy()/ is not allowed to parse
the source string beyond the given /size/ argument. Your suggestion of
appending a null character to the config file (by the way: how is this
to be done w/o corrupting the XML syntax?) heals a symptome, but does
not solve the root cause.
I tried to fix /Genode::strncpy()/ myself. Since there is no function
/Genode::strnlen()/, I made the following change:
size_t i = 0;
while (i < size)
{
if (src[i] == 0)
{
size = i;
break;
}
++i;
}
Interestingly this seem to trigger another problem. Now I get on all
platforms the following two errors:
virtual Genode::Session_capability Genode::Core_parent::session(const
char*, const char*): service_name="RM" arg="ram_quota=4K" not handled
virtual Genode::Session_capability Genode::Core_parent::session(const
char*, const char*): service_name="PD" arg="ram_quota=4K" not handled
Could it be that there are already some workarounds for buggy
/Genode::strncpy()/, which do not work anymore once the function is fixed?
Frank
Hi Frank,
Frank Kaiser wrote:
I prefer to fix the root cause. However my attempt outlined below did not work, since it does not take into account that the function writes a ‘\0’ at the end of the destination string (something the standard C library function doesn’t do), for which the calculated /size/ value has to be adjusted. The final fix of /Genode::strncpy()/ is:
Indeed. The libc version gives no indication of whether the string was cut or not. So you would need to check dst[size - 1] == 0 for the zero padding (which our version does not implement). So we decided to ensure that the result of the function is always a properly terminated string.
The strncpy function is only a part of a bigger problem, which is the reason why we deferred the fix until now. The root of the problem is that the end of a data spaces acquired from core's ROM service cannot be expected to be padded with zeros. In the corner case of a data module with the exact size of 4096 bytes, there is no padding at all. However, we mistakenly specified the local address of the mapped dataspace directly to the Xml_node constructor for parsing the config file. The constructor, however, expected a null-terminated string. Our fix introduces a further constructor argument for specifying the maximum length of the string. The proper handling of respecting this boundary needed code changes in the XML parser, the tokenizer, and some string functions (e.g., ascii_to_ulong). The strncpy function is final element in the chain of troublemakers ;-)
Looking from the implementation viewpoint, the strncpy function actually complied to the function interface and was not buggy. The interface expects a string as argument 'src', which is, by definition, null- terminated. The size argument is normally used to specify the boundary of the 'dst' buffer, which worked correctly. The problem is that strncpy is called with an invalid 'src' argument and the implicit assumption that the function will not touch memory beyond 'src + size - 1'. However, we need this semantics for our particular data-space-parsing use-case.
I have checked in the complete fix into our SVN. For strncpy, I went for a single loop rather than two loops (checking the size and memcpy) and I think that the resulting code is more obvious. In the process, I also complemented the documentation with regard to the differences between our implementation and the libc version.
Best regards Norman