Slow open_datatree for zarr stores with many coordinate variables #9640

TomNicholas · 2024-10-17T01:44:28Z

Originally posted by @aladinor in #9511 (comment)

Hi everyone,

I've been working with hierarchical structures to store weather radar. We’re leveraging xradar and datatree to manage these datasets efficiently. Currently, we are using the standard WMO Cfradial2.1/FM301 format to build a datatree model using xradar. Then, the data is stored in Zarr format.

This data model stores historical weather radar datasets in Zarr format while supporting real-time updates as radar networks operate continuously. It leverages a Zarr-append pattern for seamless data integration.

I think our data model works, at least in this beta stage; however, as the dataset grows, we’ve noticed longer load times when opening/reading the Zarr store using open_datatree. As shown in the following snippet, the time to open the dataset grows as its size increases:

For ~15 GB in size, open_datatree takes around 5.73 seconds

For ~80 GB in size, open_datatree takes around 11.6 seconds

I've worked with larger datasets, which take more time to open/read.

The datatree structure contains 11 nodes, each representing a point where live-updating data is appended. This is a minimal reproducible example, in case you want to look at it.

import s3fs
import xarray as xr
from time import time


def main():
    print(xr.__version__)
    st = time()
    ## S3 bucket connection
    URL = 'https://js2.jetstream-cloud.org:8001/'
    path = f'pythia/radar/erad2024'
    fs = s3fs.S3FileSystem(anon=True,
                           client_kwargs=dict(endpoint_url=URL))
    file = s3fs.S3Map(f"{path}/zarr_radar/Guaviare_test.zarr", s3=fs)

    # opening datatree stored in zarr
    dtree = xr.backends.api.open_datatree(
        file,
        engine='zarr',
        consolidated=True,
        chunks={}
    )
    print(f"total time: {time() -st}")


if __name__ == "__main__":
    main()

and the output is

2024.9.1.dev23+g52f13d44
total time: 5.198976516723633

For more information about the data model, you can check this raw2zarr GitHub repo and the poster we presented at the ScyPy conference.

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-10-17T01:45:24Z

also from @aladinor in #9511 (comment)

Following up on my previous post, I found out that when using open_groups_as_dict, we create a StoreBackendEntrypoint() that allows us to retrieve the datasets for each node.

https://github.com/pydata/xarray/blob/f01096fef402485092c7132dfd042cc8f467ed09/xarray/backends/zarr.py#L1367C2-L1382C47

However, I discovered that using the open_dataset method instead of StoreBackendEntrypoint() improves the reading/opening time

        for path_group, store in stores.items():
            # store_entrypoint = StoreBackendEntrypoint()
            # 
            # with close_on_error(store):
            #     group_ds = store_entrypoint.open_dataset(
            #         store,
            #         mask_and_scale=mask_and_scale,
            #         decode_times=decode_times,
            #         concat_characters=concat_characters,
            #         decode_coords=decode_coords,
            #         drop_variables=drop_variables,
            #         use_cftime=use_cftime,
            #         decode_timedelta=decode_timedelta,
            #     )

            group_ds = open_dataset(
                filename_or_obj,
                store=store,
                group=path_group,
                engine="zarr",
                mask_and_scale=mask_and_scale,
                decode_times=decode_times,
                concat_characters=concat_characters,
                decode_coords=decode_coords,
                drop_variables=drop_variables,
                use_cftime=use_cftime,
                decode_timedelta=decode_timedelta,
                **kwargs,
            )
            group_name = str(NodePath(path_group))
            groups_dict[group_name] = group_ds

I got the following results by running a test locally over the minimum reproducible example.

2024.9.1.dev23+g52f13d44
total time: 3.808659553527832

We went from ~5.2 to 3.8 seconds (around 1.37x faster).

Please let me know your thoughts.

TomNicholas · 2024-10-17T01:46:16Z

from Martin on the pangeo discourse:

I see that _chunk_getitems is being called 70 times. This corresponds to the number of coordinate variables in the tree. Each variable is having its values eagerly loaded, and this process happens serially: all the chunks of one coordinate variable are indeed fetched concurrently, but the next set of chunks isn’t requested until this is done. There are only a few chunks per coordinate; it would be totally possible to concurrently load all of the chunks in a single call.

Zarr v3’s more pervasive async internal model may help with this, but I don’t know if xarray is (yet) ready to make use of it.

TomNicholas · 2024-10-17T01:49:27Z

I see that _chunk_getitems is being called 70 times. This corresponds to the number of coordinate variables in the tree. Each variable is having its values eagerly loaded

This is necessary in order for us to do parent-child alignment in the new datatree model. But loading the coordinate variables in serial seems unnecessarily wasteful. Maybe we could take advantage of zarr v3's new async interface to load these in parallel under the hood?

(tagging @TomAugspurger @jhamman to tell me if that's not possible)

TomNicholas added topic-performance topic-zarr Related to zarr storage library topic-DataTree Related to the implementation of a DataTree class io labels Oct 17, 2024

TomNicholas added the topic-backends label Oct 17, 2024

TomNicholas mentioned this issue Oct 17, 2024

Performance of deep DataTrees #9511

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow open_datatree for zarr stores with many coordinate variables #9640

Slow open_datatree for zarr stores with many coordinate variables #9640

TomNicholas commented Oct 17, 2024 •

edited

Loading

TomNicholas commented Oct 17, 2024

TomNicholas commented Oct 17, 2024

TomNicholas commented Oct 17, 2024

Slow open_datatree for zarr stores with many coordinate variables #9640

Slow open_datatree for zarr stores with many coordinate variables #9640

Comments

TomNicholas commented Oct 17, 2024 • edited Loading

TomNicholas commented Oct 17, 2024

TomNicholas commented Oct 17, 2024

TomNicholas commented Oct 17, 2024

TomNicholas commented Oct 17, 2024 •

edited

Loading