Dear Genodians,
With the democratization of Goa and the direction we are taking to use it more often in our workflow, we extensively create, upload, download, extract, and deploy PKGs on our devices. Inevitably, we filled up our storage with the depot, most notably on our arm_v8a-based gateway that only comes with 16Gb of eMMC. We need a strategy to clearly define what can be removed from the device and do it accordingly. I have sketched out a plan for implementing this feature and would like your opinion on that. Also, would you be interested in taking this upstream?
I want to create a new component, 'depot_autoremove', that would be executed by the 'depot_download_manager' after the 'extract' step as an optional stage. The 'depot_autoremove' component will consume a config, the installation config, to identify a set of packages to keep installed. It will remove any other packages and orphan dependencies not part of that set.
That would be the first iteration for this component.
You may have already thought about this problem and might have an idea of how to solve it.
1. Is there already an existing solution to this problem that I have overlooked? 2. Is there an alternative approach that you would be more in favour of? 3. What is your opinion on this approach?
Best, Alice Domage
Hi Alice,
thanks for bringing up this topic. Coincidentally, I've been (vaguely) planning to work on this topic as well. On the phone version of Sculpt, I'd like to give the user an easy way to uninstall software (including the implicitly installed dependencies).
I want to create a new component, 'depot_autoremove', that would be executed by the 'depot_download_manager' after the 'extract' step as an optional stage. The 'depot_autoremove' component will consume a config, the installation config, to identify a set of packages to keep installed. It will remove any other packages and orphan dependencies not part of that set.
I think that the removal of depot content is not related to the 'depot_download_manager'. It can be considered as an independent problem. Since the removal of depot content does not involve any networking and does not process any (potentially dangerous) data (like extracting archives), we can get away with a simple component that merely scans the depot and removes content using plain file operations. So we don't need any fine-grained sandboxing as done by the depot-download subsystem.
Let me share my thoughts what I would expect from a depot-uninstall component. In general, it would perform two steps.
1. It would remove a selection of depot packages according to its configuration. With depot package, I only refer to pkg/<version>/ directories.
2. It would garbage-collect all depot content that is no longer referenced by any pkg present in the depot. I guess this is what you had in mind with the 'depot_autoremove' naming.
The pkg-removal step raises the question of how to specify the set of packages to remove. You suggested specifying a list of pkgs to keep and discard everything not featured in this list. In other situations, like when using Sculpt with a large disk and preserving the ability to roll the system back to any previous version, it would be more appropriate to explicitly select the packages to remove - letting the user take interactive decisions.
I think the <config> of the depot-uninstall tool could accommodate both situations quite well. E.g.,
Remove one specific version of a pkg:
<config> <remove user="cnuke" pkg="pdf_view" version="2022-02-22"/> ... </config>
Remove all versions of a pkg:
<config> <remove user="cnuke" pkg="pdf_view"/> ... </config>
Remove all pkgs of a specified depot user:
<config> <remove user="cnuke"/> ... </config>
Remove all pkgs except for an explicit selection of packages to keep:
<config> <remove-all> <keep user="cnuke" pkg="pdf_view"/> </remove-all> ... </config>
The <keep> node could also give the freedom to select a particular version or a whole user.
Would that configuration interface satisfy your needs?
The second (garbage-collection) step would collect the content of all 'pkg/<version>/archive' files found in the depot - the remaining "working set" of pkg dependencies so to say. With the working set determined, it would traverse all src/, bin/, and raw/ directories, look if the respective directory is part of the working set, and remove it otherwise.
For the directory traversal and file operations, it may be useful to take the implementation of the depot_query and fs_tool components as inspiration.
There is one open question, though: A pkg archive file can refer to other pkgs, which are implicitly installed. It would be nice to include such implicitly installed pkgs in the garbage connection. In order to do so, we would need to slightly enhance the depot-download mechanism to annotate the way of how each pkg entered the depot. E.g., for all pkgs explicitly specified in the <installation>, we could add an empty file 'selected' inside the pkg. All pkgs without such an annotation would then be included in the garbage collection.
The feature would be a very welcome addition.
Do you think that the rough plan above is sensible?
Cheers Norman
Dear Norman,
Thank you for sharing your plan. It enlights me on the big picture. I agree, it does not have to be bound in usage with the 'depot_download_manager'. My motivation was to ensure that a depot clean-up task would not interfere with others. Likely this would be part of an automated process for us. We can manage it out of the 'depot_download_manager' picture.
Would that configuration interface satisfy your needs?
Yes! The proposed configuration scheme fits our needs and yours, I assume, for using it interactively in Sculpt.
For the directory traversal and file operations, it may be useful to take the implementation of the depot_query and fs_tool components as inspiration.
Thanks for pointing those out. It will be helpful!
There is one open question, though: A pkg archive file can refer to other pkgs, which are implicitly installed.
The way I envision the implementation is as follows:
1. It creates a graph representing the depot state by traversing it. The graph is implemented with a dictionary. Each node uses as a key a 'Depot::Archive::Path' and as a value a list of 'Depot::Archive::Path' that are dependencies neighbours. Graph nodes can be of any archive type.
2. First, it goes through the packages. As you said, it registers dependencies. It also creates nodes for any dependencies archive pointing to their referenced 'pkgs'. Thus, this creates loops in the graph between dependencies.
3. It iterates over its config and performs the required actions.
4. When a package is deleted, it traverses the neighbour dependencies list. Colours them for deletion, and remove the package reference. If a node has an empty list of neighbours, it can be deleted safely, as it isn't in use any more.
It would be nice to include such implicitly installed pkgs in the garbage connection.
When a package depends on another package, it will be coloured for deletion as any other dependency.
However, there is a pitfall. If a package has another 'pkg' in its dependencies, it is unclear if it is here because it is present in the 'archives' list or because it is a dependency itself.
This can be solved by comparing the node neighbours list with the 'pkg/<name>/archives'. If it matches, the current 'pkg' node can be coloured for deletion. Otherwise, it means that this 'pkg' is also a dependency of another 'pkg'. Thus it is not coloured for deletion.
This way, I believe there is no need for persistent annotation of 'pkg' dependencies by the 'depot_download_manager'. I am concerned by the performance of such an algorithm and would have to finish a first implementation for certainty. As the dictionary is implemented with an AVL, it should perform in a reasonable time.
Do you think that the rough plan above is sensible?
It looks good to me. I will proceed in this direction.
Cheers, Alice
Hi Alice,
The way I envision the implementation is as follows:
- It creates a graph representing the depot state by traversing it. The
graph is implemented with a dictionary. Each node uses as a key a 'Depot::Archive::Path' and as a value a list of 'Depot::Archive::Path' that are dependencies neighbours. Graph nodes can be of any archive type.
- First, it goes through the packages. As you said, it registers
dependencies. It also creates nodes for any dependencies archive pointing to their referenced 'pkgs'. Thus, this creates loops in the graph between dependencies.
It iterates over its config and performs the required actions.
When a package is deleted, it traverses the neighbour dependencies
list. Colours them for deletion, and remove the package reference. If a node has an empty list of neighbours, it can be deleted safely, as it isn't in use any more.
Maybe it is beneficial to break down the problem even further. In fact, depot archive types do not arbitrary depend on one another. Specifically, binary archives cannot depend on each other. Also raw archives have no dependencies. Src archives can only depend on api archives but not on other src archives. Also api archives cannot have dependencies. For this current discussion, I'd leave out src and api archives anyway.
The only case where a dependency tree of multiple levels is formed are pkg archives depending on other pkg archives. With this observation, I would only look at pkg archives at first. Scanning the depot for the list of pkg archives should be quick enough. For each pkg, I would ask: "should this pkg be removed?". The answer is given by the <config>. To implement this step, there is no need to build an internal data structure.
Then, after having removed pkg archives, I'd read the content of all remaining 'archives' files present in the depot, putting each line into a dictionary (removing duplicates that way). Now we know all archives that are still required.
With this list (dictionary) gathered, we can again go through the depot. For each bin or raw archive, we'd look whether it is featured in our list or not. If not, we can remove the sub directory. For each pkg archive, we look if it is either featured in our list or if it is tagged as manually installed by the user. If neither is the case, we can remove it as well, and remember that we should do another iteration of garbage collection (now with the pkg removed, further removals may become possible).
By breaking the problem down this way, there is no need to build a graph as internal data structure.
Transitive dependencies are handled by iterating the whole process as long as progress happens.
When a package depends on another package, it will be coloured for deletion as any other dependency.
But what if a pkg was manually installed by the user (lets say "blue_backdrop") and also happens to be a dependency of another dependent pkg (like "blue_backdrop_with_logo") installed separately?
In this case, I would expect to keep the "blue_backdrop" when uninstalling only the dependent pkg "blue_backdrop_with_logo". If the "blue_backdrop" had been installed as a mere side effect of installing "blue_backdrop_with_logo", I would expect to have it automatically removed along with "blue_backdrop_with_logo".
To take this decision, I think we have to preserve the information of how each pkg entered the depot. Hence, my suggestion to explicitly mark the pkg archives that entered the depot by user intent.
This way, I believe there is no need for persistent annotation of 'pkg' dependencies by the 'depot_download_manager'. I am concerned by the performance of such an algorithm and would have to finish a first implementation for certainty. As the dictionary is implemented with an AVL, it should perform in a reasonable time.
I would not be too concerned about performance at this point. The most costly step (apart from the deletion of files) is probably the gathering of the content of all 'archives' files found in the depot. To get a feeling of what to expect, you may issue the following command in your genode directory (with the depot you are currently working with):
genode$ cat depot/*/pkg/*/*/archives
Cheers Norman
Dear Norman,
Maybe it is beneficial to break down the problem even further. In fact, depot archive types do not arbitrary depend on one another. Specifically, binary archives cannot depend on each other. Also raw archives have no dependencies. Src archives can only depend on api archives but not on other src archives. Also api archives cannot have dependencies. For this current discussion, I'd leave out src and api archives anyway.
The only case where a dependency tree of multiple levels is formed are pkg archives depending on other pkg archives. With this observation, I would only look at pkg archives at first. Scanning the depot for the list of pkg archives should be quick enough. For each pkg, I would ask: "should this pkg be removed?". The answer is given by the <config>. To implement this step, there is no need to build an internal data structure.
Then, after having removed pkg archives, I'd read the content of all remaining 'archives' files present in the depot, putting each line into a dictionary (removing duplicates that way). Now we know all archives that are still required.
Sorry, I was not very clear. I agree, at first we only traverse archives of type PKG to collect 'archives' dependency files.
With this list (dictionary) gathered, we can again go through the depot. For each bin or raw archive, we'd look whether it is featured in our list or not. If not, we can remove the sub directory. For each pkg archive, we look if it is either featured in our list or if it is tagged as manually installed by the user. If neither is the case, we can remove it as well, and remember that we should do another iteration of garbage collection (now with the pkg removed, further removals may become possible).
There is no need to create a complete implementation of a Graph data structure. As you describe with the Dictionary, I have something similar in mind to collect archives dependencies. I have named the top-level class that holds the Dictionary "graph". I should not if this is confusing.
The Dictionary would be used to associate an archive path with a list of PKG archive types it is referenced in. Thus, archives with no references after PKG deletion are identified, and archives referenced by a deleted PKG but still referenced by any other PKG(s) can be kept.
But what if a pkg was manually installed by the user (lets say "blue_backdrop") and also happens to be a dependency of another dependent pkg (like "blue_backdrop_with_logo") installed separately?
In this case, I would expect to keep the "blue_backdrop" when uninstalling only the dependent pkg "blue_backdrop_with_logo". If the "blue_backdrop" had been installed as a mere side effect of installing "blue_backdrop_with_logo", I would expect to have it automatically removed along with "blue_backdrop_with_logo".
To take this decision, I think we have to preserve the information of how each pkg entered the depot. Hence, my suggestion to explicitly mark the pkg archives that entered the depot by user intent.
You are correct. I missed that. Thank you for explaining in details!
I have a first implementation on the 'depot_remove' [1] branch. It can be improved or changed. Please note that this is a partial implementation. There are some TODOs comments. I also commented on it as much as possible for clarity.
[1] https://github.com/a-dmg/genode/tree/depot_remove
Points that remain to be addressed:
- Identify BIN archives, and provide 'arch' attribute to the configuration for this purpose.
- Make the PKG deletion in place to remove PKG references in the Dictionary.
- Collect orphan archive reference by no PKG. Make that last step optional? As it requires traversing the depot for any other archive types. I am questioning myself if this is necessary.
- The configuration does not implement all config's nodes as discussed, only '<remove />' for instance.
You might be interested in the 'depot.h' file. I would suggest reading it from bottom to top. You can use the 'depot_remove' runscript, which has debug logs describing what's happening.
Let me know what you think about it. If you want it simplified, and if you have further suggestions?
I hope this is digestible enough for a pleasant review. Thank you very much for your time.
Cheers,
Alice
Hello Alice,
I have a first implementation on the 'depot_remove' [1] branch. It can be improved or changed. Please note that this is a partial implementation. There are some TODOs comments. I also commented on it as much as possible for clarity.
thanks a lot for sharing!
As I'm short of time until the end of this month, I only had a cursory glance. I may miss parts of the picture but the solution looks more complex than I thought. For example, I'm unable to quickly assess if cyclic dependencies (two bad pkgs that refer to each other in their archives files) may pose a risk.
In my perception, the complexity comes from the approach of building up an internal representation (introducing notions of graph, vertex, edges, neighbor along the way) instead of working with the plain file system directly.
- Collect orphan archive reference by no PKG. Make that last step optional? As it requires traversing the depot for any other archive types. I am questioning myself if this is necessary.
That's what I had in mind in the first place - operating like a garbage collector. If we find that we ultimately need to traverse the depot anyway to implement this, I wonder what is gained by building up a cached internal representation of the depot structure beforehand. I foresee that we'd end up at a much more straight-forward solution by simply traversing the depot, and iterating this process until no further work can be done (no orphaned content remains in the depot), like I described in my previous posting.
If you don't find the idea worth pursuing, can you share why? Or may you give implementing it a try to see which version makes you more comfortable in terms of simplicity?
Cheers Norman
Dear Norman,
I hope this e-mail finds you well. Please excuse the long delay on that matter.
I have proceeded with the implementation following your guidelines. I came up with the following solution [1].
We have yet to discuss reporting. At first, it was not a needed feature. However, this is very useful for our management component. So I have included it with this proposal.
It is handier to proceed with code review on GitHub. I will open an issue [1] and mention it on my topic branch.
I hope this is okay with you to move the conversation on GitHub.
[1] https://github.com/genodelabs/genode/issues/4866
Cheers, Alice
Hi Alice,
On 2023-05-09 18:28, Alice Domage wrote:
[...] It is handier to proceed with code review on GitHub. I will open an issue [1] and mention it on my topic branch.
I hope this is okay with you to move the conversation on GitHub.
it's very good to see the continuation of this line of work. Your move of the discussion to GitHub is sensible. Thank you for sticking to the topic and for opening the issue.
With the upcoming Genode release in sight, I'm currently quite overloaded. Please bear with me if the review of your work takes a while.
Cheers Norman
Hi Alice,
AFAIK, this doesn't exist so far. I personally would really appreciate to have such a tool at hand but don't have the time to dive into it currently. However, I suspect that this kind of tool is part of a complete new abstraction layer (software management) on top of the current rather manual/low-level depot management (and not a mere addition to the latter) because I see other tasks connected to it like detecting and applying updates of packages in the "installed list" and managing the "installed list".
Cheers, Martin
On 08.03.23 11:27, Alice Domage wrote:
Dear Genodians,
With the democratization of Goa and the direction we are taking to use it more often in our workflow, we extensively create, upload, download, extract, and deploy PKGs on our devices. Inevitably, we filled up our storage with the depot, most notably on our arm_v8a-based gateway that only comes with 16Gb of eMMC. We need a strategy to clearly define what can be removed from the device and do it accordingly. I have sketched out a plan for implementing this feature and would like your opinion on that. Also, would you be interested in taking this upstream?
I want to create a new component, 'depot_autoremove', that would be executed by the 'depot_download_manager' after the 'extract' step as an optional stage. The 'depot_autoremove' component will consume a config, the installation config, to identify a set of packages to keep installed. It will remove any other packages and orphan dependencies not part of that set.
That would be the first iteration for this component.
You may have already thought about this problem and might have an idea of how to solve it.
- Is there already an existing solution to this problem that I have
overlooked? 2. Is there an alternative approach that you would be more in favour of? 3. What is your opinion on this approach?
Best, Alice Domage
Genode users mailing list users@lists.genode.org https://lists.genode.org/listinfo/users
Dear Martin,
Thank you for sharing your thought!
If I understand correctly, your thoughts are aligned with Norman's suggestions.
So this would be a standalone component. It can be started manually for performing specific depot clean-up tasks. Eventually, a higher-level management component could also use it.
Cheers, Alice
On 3/10/23 10:25, Martin Stein wrote:
Hi Alice,
AFAIK, this doesn't exist so far. I personally would really appreciate to have such a tool at hand but don't have the time to dive into it currently. However, I suspect that this kind of tool is part of a complete new abstraction layer (software management) on top of the current rather manual/low-level depot management (and not a mere addition to the latter) because I see other tasks connected to it like detecting and applying updates of packages in the "installed list" and managing the "installed list".
Cheers, Martin
Genode users mailing list users@lists.genode.org https://lists.genode.org/listinfo/users
Hi Alice,
I discovered Norman's mail right after sending my response :) But yes, I'd also say that our ideas are compatible. Your plans sound very reasonable to me and I appreciate that you're willing to get your hands on it! Don't hesitate to ask if you run into uncertainties!
Cheers, Martin
On 10.03.23 17:44, Alice Domage wrote:
Dear Martin,
Thank you for sharing your thought!
If I understand correctly, your thoughts are aligned with Norman's suggestions.
So this would be a standalone component. It can be started manually for performing specific depot clean-up tasks. Eventually, a higher-level management component could also use it.
Cheers, Alice
On 3/10/23 10:25, Martin Stein wrote:
Hi Alice,
AFAIK, this doesn't exist so far. I personally would really appreciate to have such a tool at hand but don't have the time to dive into it currently. However, I suspect that this kind of tool is part of a complete new abstraction layer (software management) on top of the current rather manual/low-level depot management (and not a mere addition to the latter) because I see other tasks connected to it like detecting and applying updates of packages in the "installed list" and managing the "installed list".
Cheers, Martin
Genode users mailing list users@lists.genode.org https://lists.genode.org/listinfo/users
Genode users mailing list users@lists.genode.org https://lists.genode.org/listinfo/users