-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs.find: cache path ids #286
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
@shcheklein, On Lines 484 to 489 in 4897344
So it uses |
okay, one more consideration here - it's still not optimal for DVC I think. It will be running an extra query on each find to fetch roots, right? may be even two? |
may be also not an issue, depends on how we feed things to it ... do we ever ask find('files/md5')? w/o any prefix? (that would mean an extra cost of getting to the children |
yes, we do |
For
So, similarly here, fetching id for |
One cached initial call is fine. What happens next is when we are getting an extra call every single time. We need to run a query to get all chidren of the 'files/md'. That will be happening again and again unless i'm missing something (?). |
We cache only id to name (path) and back. We don't cache query results (like the list of subdirectories) afaiu. |
It should be just one extra query because on subsequent Lines 484 to 489 in 4897344
|
yep. which is not that bad - but still the same query again and again. I don't remember by now if we do that in parallel and rapidly (don't see a reason from the top of my head). The situation we want to avoid where that leads to 2x queries per second - that would be bad for us. If that's not the case- that's fine. |
I am not sure I understand. That was the same case before too. We used to query for union of all prefixes over and over again. We can think of using dircache in the future, similar to which is implemented in s3fs/gcsfs/adlfs. |
no, here we are making an extra call now to get first the list of |
@shcheklein, on first
On subsequent |
So at the end, subsequent |
0464326
to
b9fc857
Compare
LGTM, @skshetry ! Let's merge it and release. |
Thanks! |
if not cached: | ||
dir_ids = self._path_to_item_ids(base) | ||
self._cache_path_id(base, *dir_ids) | ||
|
||
dir_ids = [self._ids_cache["ids"].copy()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, coming back here to double check :) one potential race condition here is if _cache_path_id
that is being executed in some other thread got to the point where is update the first dictionary, but not yet the second. In this case dir_ids might not contain the cache for base yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With base, do you mean the root of the filesystem (aka self.base
) or the path passed in find
(aka base
here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the local var base
value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, looks like we need to lock self._cache_path_id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it'd have been okay if we used dirs
here instead of ids
. But I went with locking in #289.
GDriveFileSystem was previously caching dir ids of root, and was using those on
fs.find()
.This worked well when the remote cache was at the root, but now since dvc uses
/files/md5/
by default, the dir ids are no longer in the cache andfind
ends up returning an empty list.This PR checks if the path is cached, and if not, it caches the ID of the path.
Tests passes for dvc in iterative/dvc-gdrive#28