Data Provenance Initiative Errors #89

nkandpa2 · 2024-08-13T20:36:08Z

I tried running the DPI pipeline and encountered a few errors:

When running python download.py --include include.csv, I get some errors where DataProvenanceInitiative/Ultra_Permissive_Test cannot be founding HuggingFace. This is most likely an issue on HF's end and can be avoided by cloning the dataset to the local disk with git clone [email protected]:datasets/DataProvenanceInitiative/Ultra_Permissive_Test
Even with the DataProvenanceInitiative/Ultra_Permissive_Test available locally, running python download.py --include include.csv --hf <path to local dataset> fails since some data sources are present in include.csv but are not in DataProvenanceInitiative/Ultra_Permissive_Test. Specifically, the flan_sni, xp3x, and oig sources are missing. @shayne-longpre can you advise as to whether these sources should be added to the HF dataset or removed from include.csv?
The final issue is with running python to-dolma.py --include include.csv. This script references the MPL 2.0 and EPL 1.0 licenses that are not defined as permissive licenses in common-pile/licensed_pile/licenses.py. @craffel @blester125 @shayne-longpre these appear to be permissive copyleft licenses and I don't think they have non-commercial clauses. Should they be added to the list of allowed licenses for this project?

The text was updated successfully, but these errors were encountered:

shayne-longpre · 2024-08-14T20:16:54Z

@nkandpa2 thanks for flagging! I am working on (2).

For (1) what fix are you recommending—or do you just want to document the one you suggest? And for (3), I believe at the time we looked at them and thought they were good. Aviya put me in touch with someone (I'm forgetting now) and we reviewed them.

nkandpa2 · 2024-08-14T20:33:39Z

For (1), agreed, I think we just document that fix in the README.

For (3), if someone knowledgeable about these licenses has signed off on them then that sounds good to me. Looking back through the history for licenses.py, it doesn't seem that anyone explicitly removed these licenses. Rather, these two licenses were never added in the first place. We can just add those two in to the PermissiveLicenses object in licenses.pyand the to-dolma.py script should run just fine.

conceptofmind · 2024-08-15T15:21:23Z

I have run all the dpi sets. We are going to convert it to dolma and upload asap. Just so this work is not duplicated. Coordinating with @shayne-longpre on it.

craffel · 2024-08-15T15:58:20Z

@conceptofmind I assume you had to fix the code then? Can you make a PR?

conceptofmind · 2024-08-15T17:31:53Z

@conceptofmind I assume you had to fix the code then? Can you make a PR?

Shayne and I are having a call today to get it all sorted out and will open a PR when resolved.

conceptofmind · 2024-08-15T17:32:54Z

Sets are here without dolma format: https://huggingface.co/datasets/DataProvenanceInitiative/common_pile_subset

And dolma ones are pending our eval given flan-sni/oig

conceptofmind · 2024-08-15T17:52:31Z

And this is with dolma: https://huggingface.co/datasets/DataProvenanceInitiative/common_pile_subset_dolma

Pending oig/sni addition but should work for a v0

conceptofmind · 2024-09-09T20:34:17Z

PR: #90

Token count:

Found 13 files to process
characters: 6.96Gc [00:12, 573Mc/s]
bytes: 7.12Gb [00:12, 586Mb/s]d/s] 
tokens: 1.15Gt [00:12, 95.1Mt/s]
documents: 9.69Md [00:12, 798kd/s]
shards: 13.0s [00:12, 1.07s/s]Mc/s]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Provenance Initiative Errors #89

Data Provenance Initiative Errors #89

nkandpa2 commented Aug 13, 2024

shayne-longpre commented Aug 14, 2024

nkandpa2 commented Aug 14, 2024

conceptofmind commented Aug 15, 2024 •

edited

Loading

craffel commented Aug 15, 2024 •

edited

Loading

conceptofmind commented Aug 15, 2024 •

edited

Loading

conceptofmind commented Aug 15, 2024 •

edited

Loading

conceptofmind commented Aug 15, 2024 •

edited

Loading

conceptofmind commented Sep 9, 2024 •

edited

Loading

Data Provenance Initiative Errors #89

Data Provenance Initiative Errors #89

Comments

nkandpa2 commented Aug 13, 2024

shayne-longpre commented Aug 14, 2024

nkandpa2 commented Aug 14, 2024

conceptofmind commented Aug 15, 2024 • edited Loading

craffel commented Aug 15, 2024 • edited Loading

conceptofmind commented Aug 15, 2024 • edited Loading

conceptofmind commented Aug 15, 2024 • edited Loading

conceptofmind commented Aug 15, 2024 • edited Loading

conceptofmind commented Sep 9, 2024 • edited Loading

conceptofmind commented Aug 15, 2024 •

edited

Loading

craffel commented Aug 15, 2024 •

edited

Loading

conceptofmind commented Aug 15, 2024 •

edited

Loading

conceptofmind commented Aug 15, 2024 •

edited

Loading

conceptofmind commented Aug 15, 2024 •

edited

Loading

conceptofmind commented Sep 9, 2024 •

edited

Loading