speed benchmark: `IterDataPipe` noticeably slower than `MapDataPipe` #492

linminhtoo · 2022-06-02T08:30:01Z

🐛 Describe the bug

I ran a series of speed benchmarks comparing 3 methods of building a datapipe. This is a continuation of the conversation from #454 (comment)

Brief context:
I start with .csv containing unique ID of each data sample. I apply a series of maps & filters. Finally, I generate tensors needed by my model for training. The last step is expensive in both compute & memory. I'm in the cheminformatics space where we generate various features for a given molecule or a protein. We may, for example, generate morgan fingerprints of dimension 2048, or a large 3d surface mesh of the protein (which can be thousands of elements per sample). Just to be clear, this is a very common motif in deep learning workflows in this space, and so, would be generally applicable to the domain.

I am using IterDataPipe for all the operations until the last step. For the last step, I have tried 3 cases:

define my own IterDataPipe
define my own MapDataPipe
use my defined IterDataPipe, BUT run .to_map_datapipe() to convert it to MapDataPipe AFTER generating the tensors.

Training time per epoch (in seconds) on 100k samples, as the average of 5 epochs (includes forward & backward pass + backprop on a fixed model, which is delibrately kept very simple).

num_workers	(1) IterDataPipe	(2) MapDataPipe	(3) MapDataPipe after expensive step
0	149	128	21
4	87	62	25
8	89	69	25
16	115	70	31

I also measured the time to set up the datapipe dp = build_datapipe(args) going from .csv to torch.Tensors followed by dl = DataLoader(dp). For all 3 methods, it is the same, at 18 seconds. MapDataPipe (2) didn't take longer to set up than IterDataPipe (1).

There are obvious benefits to IterDataPipe over MapDataPipe. The main one that I'm most concerned about is error handling, where if I am somehow unable to generate the feature matrices & tensors for a given data sample, I could simply skip it and not yield anything in __iter__. with MapDataPipe, I am forced to return something, like None, which complicates things as I have to handle this None later, like in collate_fn.

Note that 3) is very memory intensive and simply infeasible even for a relatively small dataset (100k samples), since we need to load the whole IterDataPipe into memory in self._load_map(). (see #454)

Versions

As I'm on Nix-OS, I'm not directly using pip/conda and it is difficult for me to run the collect_env.py script as it is. However, I can still provide the versions printed by <pkg>.__version__:

torch: 1.11.0+cu113
torchdata: 0.3.0
numpy: 1.21.5
pandas: 1.4.2

the benchmarks are done on an RTX3080 GPU with 32 GB RAM and 8 cores CPU.

please let me know if this is insufficient.

The text was updated successfully, but these errors were encountered:

ejguan · 2022-06-02T13:17:41Z

As @NivekT pointed out the overhead of profiler is significantly huge, potentially this PR would solve the performance regression for IterDataPipe: pytorch/pytorch#78674

linminhtoo · 2022-06-03T03:59:22Z

As @NivekT pointed out the overhead of profiler is significantly huge, potentially this PR would solve the performance regression for IterDataPipe: pytorch/pytorch#78674

I see! This sounds like the main reason for the speed difference. I will eagerly wait for the PR to be approved and merged, then will be happy to re-run my speed benchmarks.

ejguan · 2022-06-07T17:41:01Z

Since the PR is landed, do you want to try to test nightly release of PyTorch and TorchData.
You can install torchdata via pip install –pre torch torchdata -f https://download.pytorch.org/whl/nightly/cpu

linminhtoo · 2022-06-08T05:00:15Z

thanks! i'll try to test this out soon.

VitalyFedyunin · 2022-07-06T19:16:13Z

Hi! Can you also share the code you are using for benchmarks?

linminhtoo changed the title ~~speed benchmark: IterDataPipe is slower than MapDataPipe~~ speed benchmark: IterDataPipe noticeably slower than MapDataPipe Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed benchmark: `IterDataPipe` noticeably slower than `MapDataPipe` #492

speed benchmark: `IterDataPipe` noticeably slower than `MapDataPipe` #492

linminhtoo commented Jun 2, 2022 •

edited

Loading

ejguan commented Jun 2, 2022

linminhtoo commented Jun 3, 2022

ejguan commented Jun 7, 2022

linminhtoo commented Jun 8, 2022

VitalyFedyunin commented Jul 6, 2022

speed benchmark: IterDataPipe noticeably slower than MapDataPipe #492

speed benchmark: IterDataPipe noticeably slower than MapDataPipe #492

Comments

linminhtoo commented Jun 2, 2022 • edited Loading

🐛 Describe the bug

Versions

ejguan commented Jun 2, 2022

linminhtoo commented Jun 3, 2022

ejguan commented Jun 7, 2022

linminhtoo commented Jun 8, 2022

VitalyFedyunin commented Jul 6, 2022

speed benchmark: `IterDataPipe` noticeably slower than `MapDataPipe` #492

speed benchmark: `IterDataPipe` noticeably slower than `MapDataPipe` #492

linminhtoo commented Jun 2, 2022 •

edited

Loading