-
-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Approximate PR/ROC AUC Metics #2319
Comments
We've updated the colab notebook with more metrics and updated the test code with pytest methods. |
I searched a little bit and found that for computing AUC there are three approaches:
By space measures I mean the space being used between two consecutive batch updates, not the temporary space being used during a batch update. @JustinSzeto, @EricZimmermann what's your thoughts on that? |
I'm a little confused on what you're asking. The biggest use-case for this class is the aggregation of large sets of labels over time that cannot be stored in RAM done as a confusion matrix over discrete elements. The confusion matrix allows you to compute a set of statistics all in O(T) at the end of an epoch or run. If use the Wilconon-Mann-Whitney statistic, how would you aggregate them over mini-batches, assuming you cannot cache outputs from batch to batch? What would you do for all other stats? I see the application for just AOC-AUC but then you have to cache everything in memory until the end of an epoch which is what we try to avoid as well. Ex: we want to perform image segmentation on large images. Each image has 1e6 labels at 16-bits each. If youre dataset is large, storing this in ram is costly. Secondly, at the end of each epoch, computing all statistics would take up cycles and slow things down considerably. I think we could probably modify how the confusion matrix is done, considering there is a recursive threshold way that can work without the sorting in O(TN). (values between t_{i} and t_{i+1} would be values less than t_{i+1} minus values less that t_i). We could looking into how tf does it but I imagine this is how its done. I also want to note that these time complexities are a little tough to analyze in a parallelized system. We initially relied on the sort which is O(NlogN) to actually be done much faster since operations are not sequential. After that the cumsum is O(N) and the search is O(TlogN) (again parallel) |
You're right that trying to compute the metric at the end of run is excruciating both in terms of time and in terms of memory, given large data samples. Wilconon-Mann-Whitney could still be used without facing problems you mentioned by making a minor modification in it. Specifically by comparing predictions in a batch just with themselves and not with those in other batches. This way we only keep a single float in memory between batches. Am I correct? About your implementation vs. that of Tensorflow, yours is better I think because not only uses less memory during computation of a batch (O(N)vs. O(NT) ) but also takes less time if T>log(N) which might often be the case. Is this analysis right? |
@sadra-barikbin Yes you're right! We wrote this with medical image segmentation in mind (kind of like Brats) so these are relatively large assumptions which seem to work out in this use case. |
How would you integrate this is with the rest of the PR? The whole basis of this is to also have access to everything you could compute with a confusion matrix as well. I see this being nice somewhere... but not sure if this fits the remainder of the requirements. |
What requirements do we have other than computing AUC itself? |
We have all the confusion matrix derived stats (f1, mcc, balanced acc, etc) |
You're right that using Wilconon-Mann-Whitney, we cannot compute those you mentioned, but their implementation (of O(N) in time) does already exist in Ignite so user could use that. What advantages does this implementation provide for those derived stats? |
For segmentation there are class imbalances so looking at pr auc is useful. A big topic is detection / segmentation model calibration so this allows us to track these things over time. |
Your first statement is correct. About the second one, by tracking things over time you mean to have those derived stats over epochs? |
over batch updates with minimal compute requirements |
🚀 Feature
Taken from this colab by @EricZimmermann
Description
Problem
For large tensors, computing AUC metrics over multiple thresholds is exhaustive and slow. For a sufficiently large dataset, caching or saving outputs is too expensive and must be done in post.
Solution
Assuming the distribution of values are known and bounded, approximate the an integral via riemann sum over a set of fixed or variable step sizes.
Let a set of monotonically increasing thresholds$T ={t_1, t_2, \dots, t_n}$ be the step sizes used to approximate an integral (AUC). For each iteration, cache the counts (build a histogram) of each value falling into a bin between two thresholds. When complete, approximate a statistic using a ratio of counts.
Place calc on GPU for speed removing cpu bottleneck at each iteration
Applications
Optimizing a model for voxel level (each voxel treated as an independant sample) PR AUC / ROC AUC, ex: semantic pathology segmentation
Enable user to validate model based on best operating point setting (F1 for non 0.5 threshold)
Limitations
Domain (as thresholds) must be known ahead of time. This can be accounted for by setting a lot of thresholds over a very wide range where num thresholds << num voxels in output tensor
Future work can use a heuristic to widen / narrow num thresholds based on previous iterations
Context:
Code:
The text was updated successfully, but these errors were encountered: