Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient Reshard Tool #49

Open
blester125 opened this issue Jan 12, 2024 · 2 comments
Open

Efficient Reshard Tool #49

blester125 opened this issue Jan 12, 2024 · 2 comments

Comments

@blester125
Copy link
Collaborator

Some of the preprocessing can remove all document content (and some documents in some sources are blank to begin with). We should have an efficient way to remove these documents (there is an example in the dolma docs) and reshard them to that the resulting shards are balanced.

@haileyschoelkopf
Copy link
Collaborator

haileyschoelkopf commented Jan 15, 2024

Took a glance at this to try to familiarize myself with the Dolma library. Would an efficient solution be to run dolma tag to tag with the char_length_v1 tag, then run dolma mix to exclude docs for which this tag value is 0, to create a filtered copy?

Or would it be preferred to run this in-place?--If no and the above is ok, I could implement this ASAP. if the latter, can look some more at how dolma does (re)sharding.

EDIT: ahh, sorry, just found https://github.com/allenai/dolma/blob/main/scripts/remove_empty_docs.py , taking a look at that now.

@blester125
Copy link
Collaborator Author

Example where this would be useful, https://huggingface.co/datasets/blester125/foodista-dolma/tree/main/v0/documents

For the foodista data, the raw html is first saved into the dolma format and the contents is parsed out with a dolma parallel processor. This results in "text" fields that are much smaller and the resulting dolma shards aren't really worth being shards lol

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants