Dear Genodians
After updating to genode 24.04 we found an intermittent problem involving the depot_download runscript in very specific circumstances.
The problem manifests itself by fetchurl getting stuck for more than 10 seconds downloading a signature file, then aborting with the message 'Error: Operation was aborted by an application callback (code=0)'. Next fetchurl sucessfully downloads the archive file but then signature verification fails because no signature was downloaded.
Some debugging revealed that in the error case the DNS response is forwarded by the nic_router a few milliseconds after the request but is not read by the fetchurl application until immediately before the abortion message, e.g. more than 10 seconds later. Further debugging is difficult since this Heisebug disappears when more output slows down execution. For illustration I've added an excerpt from the log below.
The problem occurs infrequently and only under very specific circumstances: * using kernel base-hw running on qemu * on our gitlab CI/CD runner inside docker * on two distinct physical servers at our hoster
We have unsucessfully tried to reproduce the problem as follows: * on the same physical machines manually started inside and outside docker * on other multiple physical machines using an identical gitlab CI/CD runner setup
The problem occurs with both lxip and lwip network stack. The problem seems to disappear when we run the network stack (lxip/lwip) inside the fetchurl process instead of a separate vfs.
We think we hit a very peculiar corner case of signalling between the network stack running in a separate vfs and the libc socket plugin.
We would greatly appriciate any insight of what might be amiss or any suggestion how we could proceed with debugging.
Best regards Stefan
[2024-07-03 09:17:20.687416] [init -> report_rom] report 'depot_download -> state' [2024-07-03 09:17:20.688199] [init -> report_rom] <state progress="yes"> [2024-07-03 09:17:20.689929] [init -> report_rom] <archive path="gapstage/pkg/http_client_server_demo/2023-11-09" state="download" total="0" now="0"/> [2024-07-03 09:17:20.690572] [init -> report_rom] </state> [2024-07-03 09:17:20.775999] [init -> depot_download -> report_rom-fetchurl] report 'dynamic -> fetchurl -> progress' [2024-07-03 09:17:20.776670] [init -> depot_download -> report_rom-fetchurl] <progress> [2024-07-03 09:17:20.778610] [init -> depot_download -> report_rom-fetchurl] <fetch url="http://dev.depot.gapfruit.com/gapstage/pkg/http_client_server_demo/2023-11-09.tar.xz.sig" total="0.0" now="0.0"/> [2024-07-03 09:17:20.780845] [init -> depot_download -> report_rom-fetchurl] <fetch url="http://dev.depot.gapfruit.com/gapstage/pkg/http_client_server_demo/2023-11-09.tar.xz" total="0.0" now="0.0"/> [2024-07-03 09:17:20.781492] [init -> depot_download -> report_rom-fetchurl] </progress> [2024-07-03 09:17:20.782465] [init -> report_rom] report 'depot_download -> state' [2024-07-03 09:17:20.783111] [init -> report_rom] <state progress="yes"> [2024-07-03 09:17:20.784645] [init -> report_rom] <archive path="gapstage/pkg/http_client_server_demo/2023-11-09" state="download" total="0" now="0"/> [2024-07-03 09:17:20.784974] [init -> report_rom] </state> [2024-07-03 09:17:20.786612] [init -> depot_download -> dynamic -> fetchurl] fetch http://dev.depot.gapfruit.com/gapstage/pkg/http_client_server_demo/2023-11-0... [2024-07-03 09:17:20.802449] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::read() >>>>> dst.length=9 name= [2024-07-03 09:17:20.818176] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::write() >>>>> src.length=11 name= [2024-07-03 09:17:20.819180] [init -> depot_download -> tcpip] Warning: | [2024-07-03 09:17:20.819871] [init -> depot_download -> tcpip] Warning: | 10.0.2.3:53 [2024-07-03 09:17:20.822032] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::read() >>>>> dst.length=9 name= [2024-07-03 09:17:20.824025] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::write() >>>>> src.length=40 name= [2024-07-03 09:17:20.824679] [init -> depot_download -> tcpip] Warning: | [2024-07-03 09:17:20.826310] [init -> depot_download -> tcpip] Warning: | CB...........dev.depot.gapfruit.com..... [2024-07-03 09:17:20.829221] [init -> depot_download -> tcpip] Warning: Lxip_factory::socket_progress() >>>>> [2024-07-03 09:17:20.830848] [init -> depot_download -> tcpip] Warning: Lxip_factory::socket_progress() >>>>> [2024-07-03 09:17:21.065728] [init -> depot_download -> report_rom-fetchurl] report 'dynamic -> fetchurl -> progress' [2024-07-03 09:17:21.066344] [init -> depot_download -> report_rom-fetchurl] <progress> [2024-07-03 09:17:21.068594] [init -> depot_download -> report_rom-fetchurl] <fetch url="http://dev.depot.gapfruit.com/gapstage/pkg/http_client_server_demo/2023-11-09.tar.xz.sig" total="0.0" now="0.0"/> [2024-07-03 09:17:21.070546] [init -> depot_download -> report_rom-fetchurl] <fetch url="http://dev.depot.gapfruit.com/gapstage/pkg/http_client_server_demo/2023-11-09.tar.xz" total="0.0" now="0.0"/> [2024-07-03 09:17:21.071200] [init -> depot_download -> report_rom-fetchurl] </progress> ... (many more reports of zero progress) [2024-07-03 09:17:33.015603] [init -> depot_download -> report_rom-fetchurl] report 'dynamic -> fetchurl -> progress' [2024-07-03 09:17:33.016567] [init -> depot_download -> report_rom-fetchurl] <progress> [2024-07-03 09:17:33.018731] [init -> depot_download -> report_rom-fetchurl] <fetch url="http://dev.depot.gapfruit.com/gapstage/pkg/http_client_server_demo/2023-11-09.tar.xz.sig" total="0.0" now="0.0" timeout="true"/> [2024-07-03 09:17:33.020665] [init -> depot_download -> report_rom-fetchurl] <fetch url="http://dev.depot.gapfruit.com/gapstage/pkg/http_client_server_demo/2023-11-09.tar.xz" total="0.0" now="0.0"/> [2024-07-03 09:17:33.021626] [init -> depot_download -> report_rom-fetchurl] </progress> [2024-07-03 09:17:33.066395] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::read() >>>>> dst.length=12 name= [2024-07-03 09:17:33.067928] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::read() >>>>> dst.length=56 name= [2024-07-03 09:17:33.072760] [init -> depot_download -> dynamic -> fetchurl] Error: Operation was aborted by an application callback (code=0), failed to fetch http://dev.depot.gapfruit.com/gapstage/pkg/http_client_server_demo/2023-11-0... [2024-07-03 09:17:33.074300] [init -> depot_download -> dynamic -> fetchurl] fetch http://dev.depot.gapfruit.com/gapstage/pkg/http_client_server_demo/2023-11-0... [2024-07-03 09:17:33.080683] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::read() >>>>> dst.length=9 name= [2024-07-03 09:17:33.094847] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::write() >>>>> src.length=11 name= [2024-07-03 09:17:33.095631] [init -> depot_download -> tcpip] Warning: | [2024-07-03 09:17:33.096409] [init -> depot_download -> tcpip] Warning: | 10.0.2.3:53 [2024-07-03 09:17:33.098564] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::read() >>>>> dst.length=9 name= [2024-07-03 09:17:33.100202] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::write() >>>>> src.length=40 name= [2024-07-03 09:17:33.100989] [init -> depot_download -> tcpip] Warning: | [2024-07-03 09:17:33.102030] [init -> depot_download -> tcpip] Warning: | 2i...........dev.depot.gapfruit.com..... [2024-07-03 09:17:33.104395] [init -> depot_download -> tcpip] Warning: Lxip_factory::socket_progress() >>>>> [2024-07-03 09:17:33.106800] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::read() >>>>> dst.length=12 name= [2024-07-03 09:17:33.108505] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::read() >>>>> dst.length=56 name= [2024-07-03 09:17:33.118989] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::write() >>>>> src.length=16 name= [2024-07-03 09:17:33.119804] [init -> depot_download -> tcpip] Warning: | [2024-07-03 09:17:33.120599] [init -> depot_download -> tcpip] Warning: | 83.166.133.92:80 [2024-07-03 09:17:33.122640] [init -> depot_download -> tcpip] Warning: Lxip_factory::socket_progress() >>>>> [2024-07-03 09:17:33.124381] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::read() >>>>> dst.length=15 name= [2024-07-03 09:17:33.170773] [init -> depot_download -> tcpip] Warning: Lxip_factory::socket_progress() >>>>> [2024-07-03 09:17:33.173216] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::read() >>>>> dst.length=9 name= [2024-07-03 09:17:33.175279] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::read() >>>>> dst.length=15 name= [2024-07-03 09:17:33.177831] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::read() >>>>> dst.length=15 name= [2024-07-03 09:17:33.180810] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::read() >>>>> dst.length=15 name= [2024-07-03 09:17:33.181887] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::write() >>>>> src.length=143 name= [2024-07-03 09:17:33.182025] [init -> depot_download -> tcpip] Warning: Lxip_factory::socket_progress() >>>>> [2024-07-03 09:17:33.182234] [init -> depot_download -> tcpip] Warning: Lxip_factory::socket_progress() >>>>> [2024-07-03 09:17:33.218590] [init -> depot_download -> tcpip] Warning: Lxip_factory::socket_progress() >>>>> [2024-07-03 09:17:33.220428] [init -> depot_download -> tcpip] Warning: Lxip_vfs_file_handle::read() >>>>> dst.length=1690 name= [2024-07-03 09:17:33.222224] [init -> depot_download -> report_rom-fetchurl] report 'dynamic -> fetchurl -> progress' [2024-07-03 09:17:33.222838] [init -> depot_download -> report_rom-fetchurl] <progress> [2024-07-03 09:17:33.225334] [init -> depot_download -> report_rom-fetchurl] <fetch url="http://dev.depot.gapfruit.com/gapstage/pkg/http_client_server_demo/2023-11-09.tar.xz.sig" total="0.0" now="0.0" timeout="true"/> [2024-07-03 09:17:33.227218] [init -> depot_download -> report_rom-fetchurl] <fetch url="http://dev.depot.gapfruit.com/gapstage/pkg/http_client_server_demo/2023-11-09.tar.xz" total="1412.0" now="1412.0"/> [2024-07-03 09:17:33.228017] [init -> depot_download -> report_rom-fetchurl] </progress> [2024-07-03 09:17:33.235666] [init -> report_rom] report 'depot_download -> state' [2024-07-03 09:17:33.236203] [init -> report_rom] <state progress="yes"> [2024-07-03 09:17:33.237418] [init -> report_rom] <archive path="gapstage/pkg/http_client_server_demo/2023-11-09" state="fetched"/> [2024-07-03 09:17:33.238035] [init -> report_rom] </state> [2024-07-03 09:17:33.238657] [init -> report_rom] report 'depot_download -> state' [2024-07-03 09:17:33.239417] [init -> report_rom] <state progress="yes"> [2024-07-03 09:17:33.240663] [init -> report_rom] <archive path="gapstage/pkg/http_client_server_demo/2023-11-09" state="verify"/> [2024-07-03 09:17:33.240973] [init -> report_rom] </state> [2024-07-03 09:17:33.250585] [init -> depot_download -> tcpip] Warning: Lxip_factory::socket_progress() >>>>> [2024-07-03 09:17:33.252277] [init -> depot_download -> tcpip] Warning: Lxip_factory::socket_progress() >>>>> [2024-07-03 09:17:33.289341] [init -> depot_download -> dynamic -> verify] /public/gapstage/pkg/http_client_server_demo/2023-11-09.tar.xz: bad signature [2024-07-03 09:17:33.291209] [init -> depot_download -> manager] Error: signature check failed for 'gapstage/pkg/http_client_server_demo/2023-11-09' (bad signature) [2024-07-03 09:17:33.292338] [init -> report_rom] report 'depot_download -> state' [2024-07-03 09:17:33.293402] [init -> depot_download -> tcpip] Warning: Lxip_factory::socket_progress() >>>>> [2024-07-03 09:17:33.294471] [init -> depot_download -> tcpip] Warning: Lxip_factory::socket_progress() >>>>> [2024-07-03 09:17:33.295748] [init -> report_rom] <state progress="yes"> [2024-07-03 09:17:33.296838] [init -> report_rom] <archive path="gapstage/pkg/http_client_serveError: Test execution timed out while waiting for ".*path="gapstage/pkg/http_client_server_demo/2023-11-09" state="done".*.*path="genodelabs/pkg/wm/2019-03-17" state="done".*.*path="genodelabs/index/19.03" state="failed".*" make: *** [Makefile:449: run/depot_download-gapfruit-internal_init] Error 254 r_demo/2023-11-09" state="corrupted"/> [2024-07-03 09:17:33.297432] [init -> report_rom] </state> [2024-07-03 09:17:33.311050] [init -> depot_download -> manager] installation complete. [2024-07-03 09:17:33.313447] [init -> report_rom] report 'depot_download -> state' [2024-07-03 09:17:33.313475] [init -> report_rom] <state> [2024-07-03 09:17:33.313716] [init -> report_rom] <archive path="gapstage/pkg/http_client_server_demo/2023-11-09" state="failed"/> [2024-07-03 09:17:33.314796] [init -> report_rom] <archive path="genodelabs/pkg/wm/2019-03-17" state="done"/> [2024-07-03 09:17:33.315881] [init -> report_rom] <index path="genodelabs/index/19.03" state="failed"/> [2024-07-03 09:17:33.316150] [init -> report_rom] </state> [2024-07-03 09:17:33.318112] [init -> depot_download -> manager] installation complete.
Howdy, I only have a naive/labor-intensive idea to contribute:
De : Stefan Thöni stefan.thoeni@gapfruit.com
Dear Genodians
After updating to genode 24.04 we found an intermittent problem involving the depot_download runscript in very specific circumstances.
Is there both a known-bad version (24.04) and a known-good one ? Like 24.02, or chronologically closer.
If so, then the 'git bissect' (?) command should pinpoint the culprit commit, though that's going to take time.. (and what with the need to interact with docker etc)
Cédric
Hi Cédric
Thank you for your input. Meanwhile, we have tested several things. First we crafted a runscript that would trigger the error at roughly 50% on these circumstances:
using kernel base-hw running on qemu on our gitlab CI/CD runner inside docker on two distinct physical servers at our hoster We went back to the Genode 23.11 release, on which the same runscript and circumstances would produce the error at roughly 10% of the runs.
We also noticed the error happening on more complex scenarios on other architectures, namely to download boot images.
Currently, we assume the problem to be within the libc or within the File_system session.
The following lxip issues/discussions could be relevant. However, it seems that our error is stack independent.
https://github.com/genodelabs/genode/issues/5104#issuecomment-1915295169 https://github.com/genodelabs/genode/issues/4810
Best regards Stefan
---- Message d'origine ----
De : Stefan Thöni stefan.thoeni@gapfruit.com
Thank you for your input. Meanwhile, we have tested several things. First we crafted a runscript that would trigger the error at roughly 50% on these circumstances: (..) We went back to the Genode 23.11 release, on which the same runscript and circumstances would produce the error at roughly 10% of the runs.
Indeed if the 'git bissect' strategy just tweaks the probability of the bug occuring, it sounds like going further down this route might just be a heavy-work-low-reward endeavour.
As I recall, the issue is about fetchurl sometimes failing to download a file, returning an error code, with some suspected interplay with a vfs server, right ? I wonder if one could temporarily tweak the 'fetchurl' source code (or 'vfs' code) so that it drops into debugger/monitor, the very moment the bug occurs ?
eg if( http error code such and such ) drop_to_debugger();
That would work kind of like "core dumps" on Unices of old when they stumbled on an assert(), but much nicer, one could immediately inspect the state of the component, understand how it got in trouble. That's a technique I sometimes used back in BeOS days.
Or...
Revisiting the tracing and Heisenbug issue, it might be possible to find forms of tracing that are _so_ lightweight that they do not fret the "Heisenbug", i.e. they do not make the bug disappear into hiding when tracing is activated. For the "media kit" part of my project I published a sort of 'ring buffer' tracing class that just memcpy()s the log string into a tiny "ring" buffer, so it has really minimum impact. It's a bit tricky to use, one has to "freeze" the program once the bug occurs, so as to be able to inspect the ringbuffer *before* the log lines are clobbered by new logs. But, if the bug is indeed reproducible under such conditions, even once in ten runs (10%), it might allow collecting valuable information.
I'll be following this if you keep us posted anyhow... I sometimes see "packet operation=1 failed" errors from fetchurl when downloading Falkon etc packages, with the latest Genode, but depot_download reswpawns it and fetchurl works on the second try, so it's a low priority item in my case.
PS. my procedural 2 cents: if you file a ticket it will be easier to re-visit the history, make suggestions etc, compared to e-mails... Plus, this appears to be an upstream bug, and I believe the Genode team likes to keep abreast of upstream bugs ^^
Cedric