You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some of the preprocessing can remove all document content (and some documents in some sources are blank to begin with). We should have an efficient way to remove these documents (there is an example in the dolma docs) and reshard them to that the resulting shards are balanced.
The text was updated successfully, but these errors were encountered:
Took a glance at this to try to familiarize myself with the Dolma library. Would an efficient solution be to run dolma tag to tag with the char_length_v1 tag, then run dolma mix to exclude docs for which this tag value is 0, to create a filtered copy?
Or would it be preferred to run this in-place?--If no and the above is ok, I could implement this ASAP. if the latter, can look some more at how dolma does (re)sharding.
For the foodista data, the raw html is first saved into the dolma format and the contents is parsed out with a dolma parallel processor. This results in "text" fields that are much smaller and the resulting dolma shards aren't really worth being shards lol
Some of the preprocessing can remove all document content (and some documents in some sources are blank to begin with). We should have an efficient way to remove these documents (there is an example in the dolma docs) and reshard them to that the resulting shards are balanced.
The text was updated successfully, but these errors were encountered: