[depot_autoremove] Removing outdated PGKs from the depot

Mon Mar 13 17:57:25 CET 2023

Hi Alice,

> The way I envision the implementation is as follows:
> 
> 1. It creates a graph representing the depot state by traversing it. The 
> graph is implemented with a dictionary.
> Each node uses as a key a 'Depot::Archive::Path' and as a value a list 
> of 'Depot::Archive::Path' that are dependencies neighbours.
> Graph nodes can be of any archive type.
> 
> 2. First, it goes through the packages. As you said, it registers 
> dependencies. It also creates nodes for any dependencies
> archive pointing to their referenced 'pkgs'. Thus, this creates loops in 
> the graph between dependencies.
> 
> 3. It iterates over its config and performs the required actions.
> 
> 4. When a package is deleted, it traverses the neighbour dependencies 
> list. Colours them for deletion, and remove the package reference.
> If a node has an empty list of neighbours, it can be deleted safely, as 
> it isn't in use any more.

Maybe it is beneficial to break down the problem even further. In fact, 
depot archive types do not arbitrary depend on one another. 
Specifically, binary archives cannot depend on each other. Also raw 
archives have no dependencies. Src archives can only depend on api 
archives but not on other src archives. Also api archives cannot have 
dependencies. For this current discussion, I'd leave out src and api 
archives anyway.

The only case where a dependency tree of multiple levels is formed are 
pkg archives depending on other pkg archives. With this observation, I 
would only look at pkg archives at first. Scanning the depot for the 
list of pkg archives should be quick enough. For each pkg, I would ask: 
"should this pkg be removed?". The answer is given by the <config>. To 
implement this step, there is no need to build an internal data structure.

Then, after having removed pkg archives, I'd read the content of all 
remaining 'archives' files present in the depot, putting each line into 
a dictionary (removing duplicates that way). Now we know all archives 
that are still required.

With this list (dictionary) gathered, we can again go through the depot. 
For each bin or raw archive, we'd look whether it is featured in our 
list or not. If not, we can remove the sub directory. For each pkg 
archive, we look if it is either featured in our list or if it is tagged 
as manually installed by the user. If neither is the case, we can remove 
it as well, and remember that we should do another iteration of garbage 
collection (now with the pkg removed, further removals may become possible).

By breaking the problem down this way, there is no need to build a graph 
as internal data structure.

Transitive dependencies are handled by iterating the whole process as 
long as progress happens.

> When a package depends on another package, it will be coloured for 
> deletion as any other dependency.

But what if a pkg was manually installed by the user (lets say 
"blue_backdrop") and also happens to be a dependency of another 
dependent pkg (like "blue_backdrop_with_logo") installed separately?

In this case, I would expect to keep the "blue_backdrop" when 
uninstalling only the dependent pkg "blue_backdrop_with_logo". If the 
"blue_backdrop" had been installed as a mere side effect of installing 
"blue_backdrop_with_logo", I would expect to have it automatically 
removed along with "blue_backdrop_with_logo".

To take this decision, I think we have to preserve the information of 
how each pkg entered the depot. Hence, my suggestion to explicitly mark 
the pkg archives that entered the depot by user intent.

> This way, I believe there is no need for persistent annotation of 'pkg' 
> dependencies by the 'depot_download_manager'. I am concerned by the 
> performance of such an algorithm and would have to finish a first 
> implementation for certainty. As the dictionary is implemented with an 
> AVL, it should perform in a reasonable time.

I would not be too concerned about performance at this point. The most 
costly step (apart from the deletion of files) is probably the gathering 
of the content of all 'archives' files found in the depot. To get a 
feeling of what to expect, you may issue the following command in your 
genode directory (with the depot you are currently working with):

   genode$ cat depot/*/pkg/*/*/archives

Cheers
Norman

-- 
Dr.-Ing. Norman Feske
Genode Labs

https://www.genode-labs.com · https://genode.org

Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden
Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth