-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for valid examples #158
Comments
Agree that there should be a way to filter out invalid values. There's a newer duplicate issue at #162 on having a predicate function (had to look up https://dcl-prog.stanford.edu/function-predicate.html to know that predicate functions are those that return a At #162 (comment), @cmdupuis3 showed this example code snippet:
This code can be summarized as 3 main steps:
The fact that someone has to concat the tensors together after having already used While we could add a That said, we could theoretically add a |
@weiji14 thanks for showing interest in this problem! The term 'predicate function' makes way more sense and I should have used that terminology from the start. The main issue I see with the three additional steps is that the predicate gets applied to the batches sequentially and we lose the parallel and potentially distributed power of dask, which is critical for decently-scaled ML problems. I sometimes have >1tb sized dataarrays with dims (variable, y, x) with 10% valid xy coordinates. The target variable that is sparse might be 10+gbs and all that would have to come down sequentially to apply the predicate. Instead of trying to get BatchGenerator to solve this, I create "Training Datasets" with the first dimension being the batched dimension in advance. We persist to zarr or to cluster memory because we also shuffle, which is relatively expensive op. Then we can iterate over the first dim for batching. Not to open a can of worms, but I think adding a concept like "Training Dataset" to xbatcher to precompute costly predicate functions, reshaping/windowing and shuffling could help decouple the preprocessing from batch serving and be more performant. Then again, anyone can do this in advance and then use the BatchGenerator over the first dim in that dataset. We still don't do this because even with all those ops out of the way, batch generator still only loads one batch into memory at a time unless it is already persisted (if this can be afforded). This could be fine if the dataset is persisted, but is limited. This is obviously out of scope, but relates to #161 |
Hi @ljstrnadiii, thanks for elaborating on your workflow. Do you have something working now? I'm curious to see what you had to do to get this working in a parallel-performant way. |
The biggest step in gains for my use case comes from computing the training dataset in advance where the first dim contains the dim to batch over. something like
Does that add any clarification? |
Yeah, that's a lot clearer, thank you! |
Is your feature request related to a problem?
There is currently no support to serve batches that satisfy some valid criteria. It would be nice to filter out batches based on some criteria such as:
Consider this dataset:
If we are serving this to a machine learning process and we only care about where we have target data. Many of these examples will not be valid i.e. there will be no target value to use for training.
Describe the solution you'd like
I would like to see something like:
where we satisfy:
np.all(~np.isnan(batch[:,0,5,5]))
Describe alternatives you've considered
see: https://discourse.pangeo.io/t/efficiently-slicing-random-windows-for-reduced-xarray-dataset/2447
I typically filter out all valid "chips" or "patches" in advance and persist as a "training dataset" to get all the computation out of the way. The dims would look something like {'i': number of valid chips, 'variable': 2, 'x': 10, 'y': 10}. I could then simply use xbatcher to batch on the ith dimension.
Additional context
No response
The text was updated successfully, but these errors were encountered: