Efficient Reshard Tool #49

blester125 · 2024-01-12T20:06:18Z

Some of the preprocessing can remove all document content (and some documents in some sources are blank to begin with). We should have an efficient way to remove these documents (there is an example in the dolma docs) and reshard them to that the resulting shards are balanced.

haileyschoelkopf · 2024-01-15T16:17:03Z

Took a glance at this to try to familiarize myself with the Dolma library. Would an efficient solution be to run dolma tag to tag with the char_length_v1 tag, then run dolma mix to exclude docs for which this tag value is 0, to create a filtered copy?

Or would it be preferred to run this in-place?--If no and the above is ok, I could implement this ASAP. if the latter, can look some more at how dolma does (re)sharding.

EDIT: ahh, sorry, just found https://github.com/allenai/dolma/blob/main/scripts/remove_empty_docs.py , taking a look at that now.

blester125 · 2024-05-24T16:54:47Z

Example where this would be useful, https://huggingface.co/datasets/blester125/foodista-dolma/tree/main/v0/documents

For the foodista data, the raw html is first saved into the dolma format and the contents is parsed out with a dolma parallel processor. This results in "text" fields that are much smaller and the resulting dolma shards aren't really worth being shards lol

craffel added the infrastructure label May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient Reshard Tool #49

Efficient Reshard Tool #49

blester125 commented Jan 12, 2024

haileyschoelkopf commented Jan 15, 2024 •

edited

Loading

blester125 commented May 24, 2024

Efficient Reshard Tool #49

Efficient Reshard Tool #49

Comments

blester125 commented Jan 12, 2024

haileyschoelkopf commented Jan 15, 2024 • edited Loading

blester125 commented May 24, 2024

haileyschoelkopf commented Jan 15, 2024 •

edited

Loading