Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Provenance Initiative Errors #89

Open
nkandpa2 opened this issue Aug 13, 2024 · 8 comments
Open

Data Provenance Initiative Errors #89

nkandpa2 opened this issue Aug 13, 2024 · 8 comments

Comments

@nkandpa2
Copy link
Collaborator

I tried running the DPI pipeline and encountered a few errors:

  1. When running python download.py --include include.csv, I get some errors where DataProvenanceInitiative/Ultra_Permissive_Test cannot be founding HuggingFace. This is most likely an issue on HF's end and can be avoided by cloning the dataset to the local disk with git clone [email protected]:datasets/DataProvenanceInitiative/Ultra_Permissive_Test

  2. Even with the DataProvenanceInitiative/Ultra_Permissive_Test available locally, running python download.py --include include.csv --hf <path to local dataset> fails since some data sources are present in include.csv but are not in DataProvenanceInitiative/Ultra_Permissive_Test. Specifically, the flan_sni, xp3x, and oig sources are missing. @shayne-longpre can you advise as to whether these sources should be added to the HF dataset or removed from include.csv?

  3. The final issue is with running python to-dolma.py --include include.csv. This script references the MPL 2.0 and EPL 1.0 licenses that are not defined as permissive licenses in common-pile/licensed_pile/licenses.py. @craffel @blester125 @shayne-longpre these appear to be permissive copyleft licenses and I don't think they have non-commercial clauses. Should they be added to the list of allowed licenses for this project?

@shayne-longpre
Copy link
Collaborator

@nkandpa2 thanks for flagging! I am working on (2).

For (1) what fix are you recommending—or do you just want to document the one you suggest? And for (3), I believe at the time we looked at them and thought they were good. Aviya put me in touch with someone (I'm forgetting now) and we reviewed them.

@nkandpa2
Copy link
Collaborator Author

For (1), agreed, I think we just document that fix in the README.

For (3), if someone knowledgeable about these licenses has signed off on them then that sounds good to me. Looking back through the history for licenses.py, it doesn't seem that anyone explicitly removed these licenses. Rather, these two licenses were never added in the first place. We can just add those two in to the PermissiveLicenses object in licenses.pyand the to-dolma.py script should run just fine.

@conceptofmind
Copy link
Contributor

conceptofmind commented Aug 15, 2024

I have run all the dpi sets. We are going to convert it to dolma and upload asap. Just so this work is not duplicated. Coordinating with @shayne-longpre on it.

@craffel
Copy link
Collaborator

craffel commented Aug 15, 2024

@conceptofmind I assume you had to fix the code then? Can you make a PR?

@conceptofmind
Copy link
Contributor

conceptofmind commented Aug 15, 2024

@conceptofmind I assume you had to fix the code then? Can you make a PR?

Shayne and I are having a call today to get it all sorted out and will open a PR when resolved.

@conceptofmind
Copy link
Contributor

conceptofmind commented Aug 15, 2024

Sets are here without dolma format: https://huggingface.co/datasets/DataProvenanceInitiative/common_pile_subset

And dolma ones are pending our eval given flan-sni/oig

@conceptofmind
Copy link
Contributor

conceptofmind commented Aug 15, 2024

And this is with dolma: https://huggingface.co/datasets/DataProvenanceInitiative/common_pile_subset_dolma

Pending oig/sni addition but should work for a v0

@conceptofmind
Copy link
Contributor

conceptofmind commented Sep 9, 2024

PR: #90

Token count:

Found 13 files to process
characters: 6.96Gc [00:12, 573Mc/s]
bytes: 7.12Gb [00:12, 586Mb/s]d/s] 
tokens: 1.15Gt [00:12, 95.1Mt/s]
documents: 9.69Md [00:12, 798kd/s]
shards: 13.0s [00:12, 1.07s/s]Mc/s] 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants