-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
US Government Publishing Office #64
Comments
The usgpo branch has some initial code for collecting this data. The main "collections" containing text files are the following:
There are other collections but the data in these are mostly PDFs. If we have a good way of extracting text from these we can consider the other collections as well. I've run the code against data from 2023-01-01 to current day and found 17K documents with 300M tokens. If we go with a larger date range like all documents since 2000, this extrapolates out to about 5B tokens. Could be more depending on our appetite for going further back in time. TODO
|
Did you try the USGPO Gov.info bulk data service I mentioned in this issue? Might be less work to download and process.
|
I had briefly looked into this but based on the file names and modification dates of the bulk data it seems like this is a subset of what's actually published by USGPO. Probably a good idea to check if there's anything in the bulk data that I missed scraping as this would be easy to incorporate. |
Is this ready for a PR? |
@alon-albalak are you working on this? This was one of the two examples of high-priority sources I sent you last week. |
I did not yet. I'm currently at ICLR, will get in touch with @nkandpa2 next week to see what still needs to be done! |
Like I mentioned in our recent call, I discovered that the "other collections", which you mention @nkandpa2, do in fact have massive plain texts files available, but they are hidden beneath a second API layer called granules I discovered this API layer while looking at the Congressional Hearings collection, because I was considering transcribing the hearing recordings and wanted to check how automated transcripts differ from the official ones. So let's take the hearings as an example to look at. Like other govinfo collections, the hearings are divided by the API into However, there is a difference between serial collections and non-serial collections in the download links returned by the package summary endpoint. Here is the JSON response for the package
The JSON response above contains a direct
However, as you can notice, the serial package JSON contains an additional key called
Packages from the hearings collection usually only contain one granule per package with the granule ID being identical to the package ID, but other serial collections, like the Federal Register (
The single hearing transcript for the granule |
US GPO is the agency responsible for publishing documents authored by the US federal government (and thus are public domain) and they provide an API for accessing these documents and associated metadata.
The text was updated successfully, but these errors were encountered: