-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wiki Data #82
Comments
The wikiarchive wikis are chugging along, I think the main bottleneck is I/O contention so there isn't much we can do to speed it up. The wikiarchive wikis don't tend to include the talk pages, which is another reason we will want to re-scrape in v2 The wikimedia data I have processed (wikipedia, wiktionary, etc) didn't have talk pages either. I'm updating code, redownloading, and reprocessing to include the talk pages (will post some stats about how much more data that is when done) wikitext parsing updateswtf_wikipedia seems to work well. One place it leaves weird artifacts is images with really long (could be caused by newlines) captions, it leaves the I've also been working on handling math parsing. wtf_wikipedia is pretty inconsistent on how it handles maths. It handles simple stuff well but strips out anything more complex. When processing 1 shard of the wikipedia dataset, 49% of the My current approach is to:
Some mathlike templates (for example An open question is the handling of unicode symbols. Lots of math articles have symbols like π or θ directly. In a latex conversion would we want to convert these to I might convert all the dumps with just wtf_wikipedia as a v0 it get approximate token counts while I work on refining the math handling. The end of the wikipedia page is generally a long list of references, links to related pages, etc. wtf_wikipedia converts these in a more plaintext format, but it is still pretty long and unnatural. I'm thinking of looking into removing these final sections. |
Nice! I dug out my old codes and notes on wtf_wikipedia processing. Apart from the math stuff you raised, I also found that it would dump out stuff like
Maybe that stuff has changed by now though. Regarding standalone symbols, I think it's appropriate to leave them as unicode rather than try to convert them to LaTeX. I think when I was messing with wtf_wikipedia before I also just had code that manually stripped out the references section. If it's not natural text, let's just remove it. |
I've seen some of the image thing, looking into how to fix that. I haven't seen the In: Out: thing yet. I've updated stuff to remove the references sections (also things like external links, see also, etc.) I found an issue that comes up in how editors tend to style math, basically the indentation is handled wrong and text from below an equation can end up before it. I opened an issue spencermountain/wtf_wikipedia#577 but I don't know JS/wtf wikipedia enough to solve it myself right now. My current workaround is to add newlines between |
hey! just stumbled on this, and am happy to help with any issues - it's really useful to find examples where things break. There is a max-length for images - happy to increase this, if there are known issues. I've been blessed to never learn latex, and this is probably the cause of the shy parsing. Happy to tweak a regex so more things pass through. I always assumed |
Hi Spencer, I think the issue with images is just that in our use case we don't want the "File:V Train AC.JPG|Class 310" text. I can see why that would be reasonable to include otherwise though. |
I have some math template munging that seems good enough for now. The general approach is to use a regex to find where to start editing, then iterate forward in the text to find the end of the scope. Then you loop over a bunch of these edits. Only a few cases support nesting of the same template (mset and abs are the main ones that needed it) but it does support nesting of different templates (you can have a overline template inside of a strong template for example) Examples:
I extracted all the
There are also a few cases where the output for There are some cases that I don't handle atm, for example, one template is There are also a few symbols that get stripped out of |
This seems reasonable for v1! |
I've been fixing a bug where a malformed template stops processing of all the other templates later in the document. Now the malformed on is removed and the rest is processed correctly. Working on running the preprocessor on all the data next |
I'm finished processing the MediaWiki wikis, working on the wikiteam archives next (delayed by having to move data around). The data so far is here https://huggingface.co/datasets/blester125/wiki The MediaWikis (which include talk pages) have ~14 billion tokens. Some Stats:
|
Nice. It seems like there's a lot of markup in the
|
🤔 which shard/example is this? |
It looks like huggingface was picking up the I uploaded a metadata file that restricts the dataset viewer to the `.../v0/documents/*jsonl.gz" files, now it seems to be showing the cleaned versions |
It doesn't seem to be updated for me. Here's the first entry in the dataset viewer:
|
It look like wtf_wikipedia only removed some of the in-line html/css. The original version has a lot more markup
|
Got it. In that case we may want to do a simple HTML removal pass (via bs4 or whatever). Also, is it an artifact of the dataset viewer that there are no newlines? |
Yeah it must be, the actual data has |
I tried this out, it seems a bit non-trivial/bs4 isn't well suited to remove html fragments in the text, the main issues are:
It also doesn't seem very consistent, for example: The div in the example above gets removed, but this div in another example doesn't
Not sure what the difference is, it isn't something like the divs gets closed in one example but not the other, they are not closed in either example |
I'm less worried about the example code getting stripped out (a small price to pay to make the rest of the text much more "natural"). I'm surprised the parser is brittle to things that "look like" an HTML tag, that's too bad - I would also guess that's parser-dependent though? And it's bizarre that the one of the divs is removed and not the other. I don't want you to go down a rabbit hole, so I will ask around a little. |
Poking around a little more I'm mostly seeing unclosed Separately there are many very short pages (many of which are almost empty user talk pages). Probably worth doing some heuristic filtering to remove them (though this could be done before training, not to the dataset itself) |
Are most of these unclosed tags coming from any particular source/namespace/shard? |
Poking around a little I see them on wikinews, wikibooks, wikiversity, etc... and apparently across shards. |
To be conservative, I only stripped out div and font tags to start with ( There were 115,201 tags that got stripped from There didn't seem to be any false positives (i.e. real looking that gets removed), but a few have a
I'll look through it later, but if you noticed any other tag types that were common let me know. I won't be able to actually run the processing until later |
Thanks. If we want to be more thorough we could write a little script that searches for HTML tags and dumps out the most common of them (if there are 115k div/font tags in one shard, they are probably overrepresented compared to tags that would appear otherwise in example code). |
This issues tracks all wiki processing as the different sources have been unified. Will be closing #7 and #1
There are three sources of wiki data we will be using:
Sources to create a list of wikis:
Data Processing Pipeline:
Data Collection
Archive Sources:
wikiteam
collection. When you allow thewikicollections
tag too it jumps to ~4million wikis.--worker_id
and--num_workers
metadata.identifier
field is used as the key, but our actual "source" field will use the domain name.wiki/archive
wiki/dump
wiki/scrape
${target_dir}/${id}
to${target_dir}/${id}[:n]/${id}[n:2n]/.../${id}. Things like
os.path.exists`, used to check for data already processed are currently really slow (~40 minutes to check if all ~325,000 exist on disk), this should help speed things up."text"
field has wikitext in it.Dump Sources:
"text"
field has wikitext in it.Currently unclear if a dump grabbed based on IA metadata should be saved into the same area as the wikimedia dumps where processing is based on what lives in the dirs or it should stay in the IA area where the metadata file dictates what is processed.
Scrape Sources:
wikicollection
tag in addition to thewikiteam
tagWikiText parsing:
Now that everything is in a unified sharded format, actual parsing of wikitext is essentially infinitely horizontally scalable.
"text"
field to plain text.The text was updated successfully, but these errors were encountered: