Optimize `split_every` #279

dcherian · 2023-10-20T22:36:31Z

The split_every parameter controls how many blocks are combined at every combine stage.

If we know where the group labels are (so number of groups in a block, and number of elements in a group in each block), we can estimate memory use of the intermediates for a given reduction, and optimize split_every to reduce the graph size.

@mrocklin do you think this would be a decent win?

@tomwhite Is there a version of this in cubed?

The text was updated successfully, but these errors were encountered:

mrocklin · 2023-10-20T23:11:52Z

In the dataframe world split_every is less about reducing graph size, and more about reducing waiting on a single machine. It tends to be really important for clusters with many workers, and datasets with many partitions. In practice, any moderate value is fine. 8-32 is probably typical.

The graph size thing here tends not to be a big deal. We're talking about at most n tasks.

I may not understand the situation in the xarray case though.

dcherian · 2023-10-20T23:21:39Z

Interesting. Are there any network considerations? Do we like to send many small things over the network over fewer bigger things? Seems like the latter unless the workers end up waiting for too long?

That National Water Model workload would be a good thing to experiment with. It has both many workers and reduces over many partitions. IIRC we can set split_every with the context manager?

mrocklin · 2023-10-20T23:23:36Z

Mostly workers don't like it if 100 other machines try to send them stuff all at the same time. It's better to have other folks accept batches, reduce them down, and then pass on.

The negative side of this is that there are more sequential hops. But only log_k as many.

And then obviously if your intermediates are large, then you won't want to send very many of them to one task.

mrocklin · 2023-10-20T23:26:49Z

That National Water Model workload would be a good thing to experiment with. It has both many workers and reduces over many partitions. IIRC we can set split_every with the context manager

I know a place where it's easy to experiment 🙂

dcherian added the performance label Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `split_every` #279

Optimize `split_every` #279

dcherian commented Oct 20, 2023 •

edited

Loading

mrocklin commented Oct 20, 2023

dcherian commented Oct 20, 2023

mrocklin commented Oct 20, 2023

mrocklin commented Oct 20, 2023

Optimize split_every #279

Optimize split_every #279

Comments

dcherian commented Oct 20, 2023 • edited Loading

mrocklin commented Oct 20, 2023

dcherian commented Oct 20, 2023

mrocklin commented Oct 20, 2023

mrocklin commented Oct 20, 2023

Optimize `split_every` #279

Optimize `split_every` #279

dcherian commented Oct 20, 2023 •

edited

Loading