-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize split_every
#279
Comments
In the dataframe world split_every is less about reducing graph size, and more about reducing waiting on a single machine. It tends to be really important for clusters with many workers, and datasets with many partitions. In practice, any moderate value is fine. 8-32 is probably typical. The graph size thing here tends not to be a big deal. We're talking about at most I may not understand the situation in the xarray case though. |
Interesting. Are there any network considerations? Do we like to send many small things over the network over fewer bigger things? Seems like the latter unless the workers end up waiting for too long? That National Water Model workload would be a good thing to experiment with. It has both many workers and reduces over many partitions. IIRC we can set |
Mostly workers don't like it if 100 other machines try to send them stuff all at the same time. It's better to have other folks accept batches, reduce them down, and then pass on. The negative side of this is that there are more sequential hops. But only log_k as many. And then obviously if your intermediates are large, then you won't want to send very many of them to one task. |
I know a place where it's easy to experiment 🙂 |
The
split_every
parameter controls how many blocks are combined at every combine stage.If we know where the group labels are (so number of groups in a block, and number of elements in a group in each block), we can estimate memory use of the intermediates for a given reduction, and optimize
split_every
to reduce the graph size.@mrocklin do you think this would be a decent win?
@tomwhite Is there a version of this in cubed?
The text was updated successfully, but these errors were encountered: