-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow open_datatree for zarr stores with many coordinate variables #9640
Comments
also from @aladinor in #9511 (comment) Following up on my previous post, I found out that when using However, I discovered that using the for path_group, store in stores.items():
# store_entrypoint = StoreBackendEntrypoint()
#
# with close_on_error(store):
# group_ds = store_entrypoint.open_dataset(
# store,
# mask_and_scale=mask_and_scale,
# decode_times=decode_times,
# concat_characters=concat_characters,
# decode_coords=decode_coords,
# drop_variables=drop_variables,
# use_cftime=use_cftime,
# decode_timedelta=decode_timedelta,
# )
group_ds = open_dataset(
filename_or_obj,
store=store,
group=path_group,
engine="zarr",
mask_and_scale=mask_and_scale,
decode_times=decode_times,
concat_characters=concat_characters,
decode_coords=decode_coords,
drop_variables=drop_variables,
use_cftime=use_cftime,
decode_timedelta=decode_timedelta,
**kwargs,
)
group_name = str(NodePath(path_group))
groups_dict[group_name] = group_ds I got the following results by running a test locally over the minimum reproducible example. 2024.9.1.dev23+g52f13d44
total time: 3.808659553527832 We went from ~5.2 to 3.8 seconds (around 1.37x faster). Please let me know your thoughts. |
from Martin on the pangeo discourse: I see that _chunk_getitems is being called 70 times. This corresponds to the number of coordinate variables in the tree. Each variable is having its values eagerly loaded, and this process happens serially: all the chunks of one coordinate variable are indeed fetched concurrently, but the next set of chunks isn’t requested until this is done. There are only a few chunks per coordinate; it would be totally possible to concurrently load all of the chunks in a single call. Zarr v3’s more pervasive async internal model may help with this, but I don’t know if xarray is (yet) ready to make use of it. |
This is necessary in order for us to do parent-child alignment in the new datatree model. But loading the coordinate variables in serial seems unnecessarily wasteful. Maybe we could take advantage of zarr v3's new async interface to load these in parallel under the hood? (tagging @TomAugspurger @jhamman to tell me if that's not possible) |
Originally posted by @aladinor in #9511 (comment)
Hi everyone,
I've been working with hierarchical structures to store weather radar. We’re leveraging xradar and datatree to manage these datasets efficiently. Currently, we are using the standard WMO Cfradial2.1/FM301 format to build a datatree model using
xradar
. Then, the data is stored inZarr
format.This data model stores historical weather radar datasets in
Zarr
format while supporting real-time updates as radar networks operate continuously. It leverages a Zarr-append pattern for seamless data integration.I think our data model works, at least in this beta stage; however, as the dataset grows, we’ve noticed longer load times when opening/reading the
Zarr
store usingopen_datatree
. As shown in the following snippet, the time to open the dataset grows as its size increases:For ~15 GB in size,
open_datatree
takes around 5.73 secondsFor ~80 GB in size,
open_datatree
takes around 11.6 secondsI've worked with larger datasets, which take more time to open/read.
The datatree structure contains 11 nodes, each representing a point where live-updating data is appended. This is a minimal reproducible example, in case you want to look at it.
and the output is
For more information about the data model, you can check this
raw2zarr
GitHub repo and the poster we presented at the ScyPy conference.The text was updated successfully, but these errors were encountered: