Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more parameters for caching #551

Open
sta-szek opened this issue Oct 19, 2023 · 3 comments
Open

Add more parameters for caching #551

sta-szek opened this issue Oct 19, 2023 · 3 comments
Labels
feature request New feature or request to improve the current logic needs eyes

Comments

@sta-szek
Copy link

sta-szek commented Oct 19, 2023

Description:
Expose actions/cache properties in actions/setup-java:

  • restore-keys -> cache-restore-keys
  • key -> cache-key

Justification:
Current implementation has some drawback that prevent us from using caching functionallity in setup-java.

The problem is that cache-key (docs) for maven is hardcoded to all pom.xml files in given repository.
It means, whenever somebody decide to change any property in pom like: name, description, version, scm details, build configuration the action will compute new hash and will create new cache.

Here I believe this is kind of trade-off, with assumption that changes to pom.xml contains only dependency changes, while in our case 80% of changes are versions/description.

Problem 1: 80% chache miss
That is problem when we have 100 active projects in one repo and every project at least once per day change such details (mostly version).
That causes the cache to be missed and new is created, which obviously causes that all dependencies are re-downloaded.
In given case, having cache is the same as not having cache -> in most cases the key is different and new is created.

Effect: 80% of time cache does not work.

Problem 2: Cache race condition
Another problem is that when there are multiple workflows for the same project, imagine two: build app, build docs.
That is common practice on our side to split workflows in order to be able to rerun individual failed WF, not whole build.

In such case, if both action use cache, then there will be created two cache entries:
Screenshot 2023-10-19 at 13 25 24

And I noticed that somehow this first (5mb) wins all the time, so next time when we try to use cache it again download missing 295mb of data...

Effect: 100% of time, the faster (less dependencies) workflow wins, causing the cache to be "almost empty" and we need to always download new files

Problem 3: Project race condition
On top of that, since the cache-key uses all pom.xml, we have single hash for many projects, which obviously have different dependencies.
E.g.
projectA - 100 dependencies from google
projectB - 50 dependencies from github

In given case, projectB will use the 5MB cache from projectA which are completely different.
So it will download all dependencies again and at the end cache wont be updated.

Where i think the problem lays:
I also found this line , which says:

// No "restoreKeys" is set, to start with a clear cache after dependency update (see #269)

Somehow i have the feeling the assumption was done having single-project repo in mind and also that in pom.xml we can change dependencies only.

Proposal 1: Reconfigure default cache-key and cache-restore-keys

Here the idea would be of building cache gradually, for all apps single cache.

cache-key: setup-java-${{ platform }}-${{ packageManager }}-${{ fileHash }}-${{ day-of-year }}-${{ timestamp }}
cache-restore-keys: 
  setup-java-${{ platform }}-${{ packageManager }}-${{ fileHash }}-${{ day-of-year }}
  setup-java-${{ platform }}-${{ packageManager }}-${{ fileHash }}
  setup-java-${{ platform }}-${{ packageManager }}

That would allow to:

  1. Get exact cache that was created by any app this day
  2. If not found, get cache for this set of files (pom fileHash)
  3. If not found (e.g. pom changed) then take any that is available and most fresh

This idea is not perfect - it can cause huge cache size and very often cache deleteion when cache size hits 10gb the old would be deleted.
But at least it allows to effectivly use caches, or rather to allow to use cache at all in huge monorepos.

Also, maybe day-of-year and timestamp are not the best. However the idea stays the same: try to find biggest cache and extend it, save at the end so next apps can extend more.

Proposal 2: Allow to configure cache-key and cache-restore-keys
So everybody can control on his own.
Currently i unfortunately have to add actions/cache on my own, but i will also have better control.
Still i think we could improve the situation with proposal 1 so more people can benefit.

Proposal 3: Forcibly override cache
setup-java-${{ platform }}-${{ packageManager }}-${{ fileHash }} would work if we can forcibly override it, so different apps can "add their dependencies" to cache.
By overriding forcibly i mean to really override if GH allows or delete and create.
Not sure how it will behave if we try to delete while another WF is downloading.

All in all, i have a bit feeling that the cache is working fine with singlerepo and monorepo which is having very simple workflows that run few times per day.
Once we put a bit more load 100+ project with 500+ builds where 300 of them are chaning pom.xml, then it does not work well. 🤔

What do you think?

@sta-szek sta-szek added feature request New feature or request to improve the current logic needs triage labels Oct 19, 2023
@IvanZosimov
Copy link
Contributor

Hi, @sta-szek 👋 Thanks for the feature request, we will take a look at it!

@ecki
Copy link

ecki commented Mar 15, 2024

I have a similar problem (it never finds the cache, maybe related to subdir, not sure). But while investigating this I wondered why it would need a cache key based on the POM at all. Because even if the Pom has changed the old m2 repo is more worth than none. In fact only drastic changes to the Pom (complete new repo url) would partially justify it.

so I had the idea you can (meanwhile?) specify the dependency file which is used for the hash. What I try is to use the .github/workflow file instead. This should stay constant most of the time. Would that solve your problem as well?

Minor thing the „cache not found“ message in setup Java should probably print the key (and also the matched dependency files), that would make my debugging easier. (edit, turning on debug modes prints the details. Turns out the not cached had two reasons, one that Pom changed nearly every time and the second that the cache which was hashed two times was meanwhile removed since so many other one shot cache uploads purged it)

@JackPGreen
Copy link

I have a similar problem (it never finds the cache, maybe related to subdir, not sure). But while investigating this I wondered why it would need a cache key based on the POM at all. Because even if the Pom has changed the old m2 repo is more worth than none. In fact only drastic changes to the Pom (complete new repo url) would partially justify it.

If you use an older cache as a base, dirty entries will never get removed and the cache will become too large (either for GitHub to store, or to be more efficient to restore than simply downloading what you actually need).

Related discussion here - #269

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request to improve the current logic needs eyes
Projects
None yet
Development

No branches or pull requests

4 participants