-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Provenance Initiative Errors #89
Comments
@nkandpa2 thanks for flagging! I am working on (2). For (1) what fix are you recommending—or do you just want to document the one you suggest? And for (3), I believe at the time we looked at them and thought they were good. Aviya put me in touch with someone (I'm forgetting now) and we reviewed them. |
For (1), agreed, I think we just document that fix in the README. For (3), if someone knowledgeable about these licenses has signed off on them then that sounds good to me. Looking back through the history for |
I have run all the dpi sets. We are going to convert it to dolma and upload asap. Just so this work is not duplicated. Coordinating with @shayne-longpre on it. |
@conceptofmind I assume you had to fix the code then? Can you make a PR? |
Shayne and I are having a call today to get it all sorted out and will open a PR when resolved. |
Sets are here without dolma format: https://huggingface.co/datasets/DataProvenanceInitiative/common_pile_subset And dolma ones are pending our eval given flan-sni/oig |
And this is with dolma: https://huggingface.co/datasets/DataProvenanceInitiative/common_pile_subset_dolma Pending oig/sni addition but should work for a v0 |
PR: #90 Token count:
|
I tried running the DPI pipeline and encountered a few errors:
When running
python download.py --include include.csv
, I get some errors whereDataProvenanceInitiative/Ultra_Permissive_Test
cannot be founding HuggingFace. This is most likely an issue on HF's end and can be avoided by cloning the dataset to the local disk withgit clone [email protected]:datasets/DataProvenanceInitiative/Ultra_Permissive_Test
Even with the
DataProvenanceInitiative/Ultra_Permissive_Test
available locally, runningpython download.py --include include.csv --hf <path to local dataset>
fails since some data sources are present ininclude.csv
but are not inDataProvenanceInitiative/Ultra_Permissive_Test
. Specifically, theflan_sni
,xp3x
, andoig
sources are missing. @shayne-longpre can you advise as to whether these sources should be added to the HF dataset or removed frominclude.csv
?The final issue is with running
python to-dolma.py --include include.csv
. This script references the MPL 2.0 and EPL 1.0 licenses that are not defined as permissive licenses incommon-pile/licensed_pile/licenses.py
. @craffel @blester125 @shayne-longpre these appear to be permissive copyleft licenses and I don't think they have non-commercial clauses. Should they be added to the list of allowed licenses for this project?The text was updated successfully, but these errors were encountered: