Dear Genodians
I'm lost and would appreciate some inspiration on how to further debug a problem in a scenario that is similar to `run/depot_query`.
My scenario seems to work fine on x86_64 but on 32-bit ARM (base-linux) the `depot_query` component segfaults. Well, it doesn't always segfault but it seems to depend either on the `depot.tar` or the `deploy.config` or both. However, the `deploy.config` and `depot.tar` don't need to change much: only the depot user or the package version may vary.
The segfault occurs in `Depot_query::Main::_query_blueprint(...)` when accessing the `node` parameter that has been instantiated by the class `File_content` when calling the lambda.
I also need to mention that the segfault reproducibly occurs at the same "to be deployed" runtime (pkg/chroot) - at least for a given `depot.tar` / `deploy.config`. Other, much larger runtime files (~9k) have been processed successfully prior that in `pkg/chroot`.
I'm aware that this is a rather vague description and therefore tried to provide a modified the `run/depot_query.run` so it uses my `deploy.config` and `depot.tar`. But no luck, I can't reproduce the problem this way...
Has anyone an idea what could possibly lead to a "corrupt" instance of `Xml_node`? Or how to test hypotheses like a stack overflow, ...?
Thanks, Roman
Hi Roman,
On 26.03.20 15:08, Roman Iten wrote:
I'm lost and would appreciate some inspiration on how to further debug a problem in a scenario that is similar to `run/depot_query`.
can you please share which Genode version you are using? I particular, I want to ensure that you are using commit [1]. (it would not explain the different behavior between x86_64 and ARM though)
[1] https://github.com/genodelabs/genode/commit/f85ec313de2cb723b1ca004866f03163...
For investigating issues like this, I usually start with looking at the page-fault address. What does 'dmesg' tell you? If the address lies within the stack area, the issue might be a stack overflow. If the address is very small, it would hint at a de-referenced null pointer. What is the code around the faulting instruction pointer doing (using 'objdump -lSd' on the faulting binary and searching for the instruction pointer)?
Since you are using base-linux, have you tried obtaining a backtrace via the GNU debugger? You may find the steps given at [2] useful.
[2] https://genodians.org/ssumpf/2019-04-26-java-gdb
I vaguely remember that you have tweaked the tool chain for base-linux on ARM using hard fp. Does the problem also occur with the original tool chain? I'm asking just to rule out tricky tool-chain-related technicalities.
Is the problem deterministic? If yes, have you tried the same scenario (same binaries) on another ARM kernel, e.g., base-hw on pbxa9? By cross-correlating different kernels, you can see whether the problem is specific to Linux or generally applies to 32-bit ARM.
You mention that the problem occurs with one particular deploy.config but not with another. So you may try gradually turning the one (bad) version into the other (good) and see when it stops breaking (bisecting the issue). Similarly, you could try reducing the "bad" configuration as far as possible while the problem persists, eventually reaching a minimal test case.
Good luck! Norman
Hi Norman,
Thanks for the hints!
can you please share which Genode version you are using? I particular, I want to ensure that you are using commit [1]. (it would not explain the different behavior between x86_64 and ARM though)
The Genode version I use is based on sculpt-20.02 which includes the commit you mentioned.
It turns out that the problem is related to the runtime file of pkg/chroot:
``` <runtime ram="1M" caps="100" binary="chroot">
<requires> <file_system/> </requires> <provides> <file_system/> </provides>
<config/>
<content> <rom label="ld.lib.so"/> <rom label="chroot"/> </content>
</runtime> ```
There's no segfault if I either completely remove the empty `<config/>`-node or replace it with `<config></config>`.
For me it looks like [1] refers to a similar issue in a different context.
[1] https://lists.genode.org/pipermail/users/2019-June/006781.html
Cheers Roman
Hi Roman,
There's no segfault if I either completely remove the empty `<config/>`-node or replace it with `<config></config>`.
For me it looks like [1] refers to a similar issue in a different context.
[1] https://lists.genode.org/pipermail/users/2019-June/006781.html
intuitively, this looks related, indeed. But given the code, I'm unable to immediately spot the same pattern. The '_apply_blueprint' does not parse the <config> node after all. It merely copies the compounding <runtime> node as is (via 'Xml_node::with_raw_node').
I just tried to reproduce the problem by executing the depot_query.run script (modified to deploy chroot) for NOVA on x86_32 and base-hw on pbxa9 (ARM) but I could not trigger it.
Do you have an example scenario that I could use for reproducing the problem at hand? I'd very much appreciate that.
You left a few of my questions unanswered. In particular,
* Does the problem occur on any kernel/architecture combination other than base-linux on 32-bit ARM?
* Does it occur when using the original tool chain?
Cheers Norman
Hello Norman,
There's no segfault if I either completely remove the empty `<config/>`-node or replace it with `<config></config>`.
intuitively, this looks related, indeed. But given the code, I'm unable to immediately spot the same pattern. The '_apply_blueprint' does not parse the <config> node after all. It merely copies the compounding <runtime> node as is (via 'Xml_node::with_raw_node').
My conclusion was indeed premature. Using another depot user with different archive versions, the segfault happens at a different package - no matter the `<config>`-node in pkg/chroot. At least in this particular case it *seems* to be related to the file size of the runtime file. If this file happens to be between 817 and 832 bytes, it fails. There's no problem it the file is bigger or smaller (tested with +/- ~5 bytes).
- Does the problem occur on any kernel/architecture combination other than base-linux on 32-bit ARM?
It does not occur on base-linux/x86_64. Because the scenario is rather complex it cannot be run easily on other kernel/architecture combinations.
- Does it occur when using the original tool chain?
For the ARM target and base-linux where the error happens I cannot use the original tool chain. For x86_64 and base-linux I use the original toolchain.
I tried run/depot_query with my depot.tar on base-hw on 32-bit ARM (zynq/qemu). It successfully executes.
So as far as I know it doesn't occur when using the original tool chain.
Do you have an example scenario that I could use for reproducing the problem at hand? I'd very much appreciate that.
Unfortunately not. I'm aware that under this circumstances your hands are pretty much tied. I keep you posted if I have more concise information either about the source of the problem or how to reproduce it in a "generally available setting".
Thanks for your support so far!
Cheers, Roman
Hi
Do you have an example scenario that I could use for reproducing the problem at hand? I'd very much appreciate that.
Unfortunately not. I'm aware that under this circumstances your hands are pretty much tied. I keep you posted if I have more concise information either about the source of the problem or how to reproduce it in a "generally available setting"
I'm quite confident that I finally found the problem. I created an issue [1] and pushed a commit as an illustration of a possible fix.
Cheers, Roman
Hi,
On Mon, Apr 13, 2020 at 20:09:52 CEST, Roman Iten wrote:
I'm quite confident that I finally found the problem. I created an issue [1] and pushed a commit as an illustration of a possible fix.
That's great! For further discussion I'll comment on GitHub.
Many thanks for your efforts!