-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed benchmark: IterDataPipe
noticeably slower than MapDataPipe
#492
Comments
IterDataPipe
is slower than MapDataPipe
IterDataPipe
noticeably slower than MapDataPipe
As @NivekT pointed out the overhead of profiler is significantly huge, potentially this PR would solve the performance regression for |
I see! This sounds like the main reason for the speed difference. I will eagerly wait for the PR to be approved and merged, then will be happy to re-run my speed benchmarks. |
Since the PR is landed, do you want to try to test nightly release of PyTorch and TorchData. |
thanks! i'll try to test this out soon. |
Hi! Can you also share the code you are using for benchmarks? |
🐛 Describe the bug
I ran a series of speed benchmarks comparing 3 methods of building a
datapipe
. This is a continuation of the conversation from #454 (comment)Brief context:
I start with
.csv
containing unique ID of each data sample. I apply a series of maps & filters. Finally, I generate tensors needed by my model for training. The last step is expensive in both compute & memory. I'm in the cheminformatics space where we generate various features for a given molecule or a protein. We may, for example, generate morgan fingerprints of dimension 2048, or a large 3d surface mesh of the protein (which can be thousands of elements per sample). Just to be clear, this is a very common motif in deep learning workflows in this space, and so, would be generally applicable to the domain.I am using
IterDataPipe
for all the operations until the last step. For the last step, I have tried 3 cases:IterDataPipe
MapDataPipe
IterDataPipe
, BUT run.to_map_datapipe()
to convert it toMapDataPipe
AFTER generating the tensors.Training time per epoch (in seconds) on 100k samples, as the average of 5 epochs (includes forward & backward pass + backprop on a fixed model, which is delibrately kept very simple).
I also measured the time to set up the datapipe
dp = build_datapipe(args)
going from .csv to torch.Tensors followed bydl = DataLoader(dp)
. For all 3 methods, it is the same, at 18 seconds. MapDataPipe (2) didn't take longer to set up than IterDataPipe (1).There are obvious benefits to
IterDataPipe
overMapDataPipe
. The main one that I'm most concerned about is error handling, where if I am somehow unable to generate the feature matrices & tensors for a given data sample, I could simply skip it and notyield
anything in__iter__
. withMapDataPipe
, I am forced to return something, likeNone
, which complicates things as I have to handle thisNone
later, like incollate_fn
.Note that 3) is very memory intensive and simply infeasible even for a relatively small dataset (100k samples), since we need to load the whole
IterDataPipe
into memory inself._load_map()
. (see #454)Versions
As I'm on Nix-OS, I'm not directly using
pip/conda
and it is difficult for me to run thecollect_env.py
script as it is. However, I can still provide the versions printed by<pkg>.__version__
:1.11.0+cu113
0.3.0
1.21.5
1.4.2
the benchmarks are done on an RTX3080 GPU with 32 GB RAM and 8 cores CPU.
please let me know if this is insufficient.
The text was updated successfully, but these errors were encountered: