-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Insert history and versions endpoint #33
Conversation
bf11102
to
46b4c8d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good! I can't see any significant issues!
The biggest thing to me is actually the commit history. I think it'd be useful to start getting into the habit of git commit --amend
and git rebase --interactive
so that as much as possible the commits tell a narrative-like story. I guess you'll be doing a squash rebase to merge this PR? Of course that's fine, but I think it'd be worth bearing in mind how the commit history reads as you're actually in the process of developing the PR. I'm guessing that with the help of --amend
and rebase --interactive
this PR could probably be represented in something like ~5 commits? Well-constructed commits are in themselves documentation, think of how git blame
can add context to future developers (including yourself!) when doing refactors or bug fixing. And of course expressive, story-telling commits can add another medium of context and communication to help the reviewer.
With such habits it can also be good to try to make sure that every commit passes CI. This adds a layer of confidence to each commit, expressing to future developers that reliability of the given code changes. It can also help with merge conflicts, especially those arising from rebased merges, where each commit's conflicts can be more reliably resolved by running the test suite at each step of the rebase.
src/db/models/publication.rs
Outdated
#[allow(clippy::unwrap_in_result, clippy::unwrap_used)] | ||
fn from_row(row: &AnyRow) -> Result<Self, sqlx::Error> { | ||
Ok(Self { | ||
name: row.try_get("name").unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to use ?
instead of unwrap()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was not sure sqlx
integrates with anyhow this well, but it worked! Thank you
date: row.try_get("date").unwrap(), | ||
stele: row.try_get("stele").unwrap(), | ||
revoked: row.try_get("revoked").unwrap(), | ||
last_valid_publication_name: row.try_get("last_valid_publication_name").ok(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, is ?
not possible?
src/db/models/publication_version.rs
Outdated
#[allow(clippy::unwrap_in_result, clippy::unwrap_used)] | ||
fn from_row(row: &AnyRow) -> Result<Self, sqlx::Error> { | ||
Ok(Self { | ||
version: row.try_get("version").unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
instead of unwrap()
?
DatabaseKind::Sqlite => { | ||
let mut connection = conn.pool.acquire().await?; | ||
|
||
let statement = format!(" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you seen the sqlx::query_file_as!
macro? https://github.com/launchbadge/sqlx/tree/main/examples/postgres/files It allows you to keep SQL statements in .sql
files. I think it's something to bear in mind as SQL statements get bigger. It's obviously not critical, but I think being able to write SQL in actual .sql
files helps keep SQL sane, because then we get things like editor syntax highlighting and linters like https://github.com/sqlfluff/sqlfluff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not see it, that looks awesome. I've got a second pr ready for review hopefully soon, so I'd defer adding separate sql files later. Created an issue.
src/history/rdf/graph.rs
Outdated
/// Stelae representation of an RDF graph. | ||
pub struct StelaeGraph { | ||
/// The underlying graph. | ||
pub g: FastGraph, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about fast_graph
instead of g
?
src/history/rdf/graph.rs
Outdated
loop { | ||
let el_uri = format!("http://www.w3.org/1999/02/22-rdf-syntax-ns#_{i}"); | ||
let elem_iri = SimpleTerm::Iri(IriRef::new_unchecked(MownStr::from_str(&el_uri))); | ||
if self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be easier to read if it were first put in a variable, like you do with item
a few lines below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, changed it
src/history/rdf/graph.rs
Outdated
pub fn items(&self) -> anyhow::Result<Vec<SimpleTerm>> { | ||
let container = &self.uri; | ||
let mut i = 1_u32; | ||
let mut l_ = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more descriptive name might be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
src/server/publish.rs
Outdated
"error: could not connect to database. Confirm that DATABASE_URL env var is set correctly." | ||
); | ||
tracing::error!("Error: {:?}", err); | ||
process::exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a minor point, but generally we only want one central call to exit()
in a program. The idea being that we bubble up all errors with ?
and only at some common overarching point in the code do we decide to exit based on the received error. This can get hard with threads and async, etc. But ideally, centralising the exit()
can help us avoid some unexpected bugs and generally improive architecture. But don't worry about it too much, it's just something to bear in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that makes a lot of sense. I've created an issue for it for now.
src/server/publish.rs
Outdated
@@ -220,7 +251,7 @@ pub fn init_app( | |||
"Failed to initialize routes for root Stele: {}", | |||
root.get_qualified_name() | |||
); | |||
std::process::exit(1); | |||
process::exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor point about centralising exit()
src/server/publish.rs
Outdated
@@ -259,7 +290,7 @@ pub fn init_app( | |||
"Failed to initialize routes for Stele: {}", | |||
guarded_stele.get_qualified_name() | |||
); | |||
std::process::exit(1); | |||
process::exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor point about centralising exit()
Also, would you be interested in a commit lint? It's really easy to add to CI, add a |
*use sqlx* Intended for serving historical documents and loading data from Postgres or SQlite *add generic connection over database* - Introduce a `db` module responsible for managing database - Add `connect` function that's generic over SQLite and Postgres *init database and apply migrations* Defaults to local SQLite database if `DATABASE_URL` is missing. After connecting, run any/all migrations that are necessary. *generate initial empty migration* To manage migrations, it's first necessary to install `sqlx-cli`. - `cargo install sqlx-cli`. Then, `sqlx migrate add <name>` to add a new migration. *introduce build.rs build script* Adding new migrations won't trigger cargo recompilation, so sqlx strongly suggests we trigger a recompilation on new migrations. *add top level db module* *connect to db on app start* After adding another field to the `AppState` struct the integration tests started failing. However, for the archive tests we don't really care about having database connection. To resolve, introduce a trait `GlobalState` that `init_app` implements and initializes the archive. This way we're flexibile in our testing framework and can use different structs to initialize the global app state. *create db.sqlite3 if not exists* *add postgres directory* We are not sure whether we'll support Postgres, but add directory
- Add `sophia` dependency for working with the rdf - Add `history` module - Use opaque `Any` type for database This abstracts away the concrete database pool that's used by underlying sqlx. Trade-off is that we need the database-url to differentiate between databases
The purpose of the submodule is to store any/all functions that work with rdf. Currently the underlying implementation relies on `sophia` to parse the graph and retrieve triples. Add `namespaces.rs` module that stores rdf namespaces. Currently stores the `oll` and `dcterms` ontology. I expected the `dcterms` to be supported out-of-the-box, but `sophia` supports only the most frequently used namespaces. *Add sqlx chrono feature* Will be necessary to work with dates in the database. *Add statements module for inserts and queries in db* The idea being that this module stores queries/insert statements for interacting with both databases. *Add chrono as dependency* Necessary because chrono exports serde Serialize and Deserialize traits. We then use these traits to map sqlx queries to structs. *Add models mod for sqlx structs* Created a centralized place for all database related structs. This approach is convenient because the sqlx queries will then serialize/deserialize to these structs using serde. *Rename from upsert to create* Much easier to reason about what the function is supposed to do. *Wrap insert changes into an atomic transaction*
The idea being we iterate over all publications using libgit2, and then for each publication we parse .rdf xml files into an rdf graph. After the graph is loaded, we iterate over document and collection versions from the graph and insert each into db. *Add more sqlx structs for database* *Wrap sophia graph with our StelaeGraph* Goal is to move similar code for working with rdf graph into a separate wrapper struct. This simplifies working with graphs as it hides away the rdf graph results complexity. *Use composite keys instead of autoincrement synthetic keys* *Split functionalities for working with rdf graph on* Since the graph queries will vary on the type of information that we need, the easiest is to create utility functions that return (s,p,o) depending on the queries.
This commit has the following significant changes: *Bulk insert multiple document and collection change objects into db* Idea being we create a query-builder and insert multiple values into an insert statement. On Windows, sqlx throws an error when inserting 4000+ rows at once. To resolve, chunk the rows to 2000 per bulk insert. A dd `BATCH_SIZE` constant, indicating how many rows will get inserted at once. Set to 2000 per insert. *Upgrade rustc compiler version* *Add utility methods for working with the `sophia` graph* RDF elements can be stored in a container, e.g. <rdf:_1 resource="..."> <rdf:_2 resource="..."> To resolve, introduce a `Bag` struct with `items` helper method. The idea being we iterate over all elements inside a container. This logic is similar to rdflib implementation of `items` in Python. To better understand this iteration logic, please see `rdflib/container.py` in rdflib Python. *Add missing sqlx structs to map objects to database rows* *Revoke same date publications* Set a boolean revoked to True for publications that exist on same date. Revoked publications aren't accounted for when querying for versions for a publication. *Support for publication partial update* Re-use the existing insert function, as most of the code is the same. The only difference is that we skip parsing/loading publications which are already inserted in the db. For publication directories which aren't in the DB, we continue inserting changes.
6445909
to
bffdd23
Compare
Makes the variables easier to read.
@tombh Thank you, this has been insightful! Agree about the commit history, so I've brought the PR down from 98 to 7 commits. Also a commit linter would be amazing. But I'd say we add that in another pr. I don't think we need to address all of these comments right now, since I have another PR ready for review hopefully soon. I've created a couple of issues [1, 2, 3]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of ok()
s to look at. Otherwise it looks good!
stele: row.try_get("stele")?, | ||
revoked: row.try_get("revoked")?, | ||
last_valid_publication_name: row.try_get("last_valid_publication_name").ok(), | ||
last_valid_version: row.try_get("last_valid_version").ok(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .ok()
s can be ?
as well right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like sqlx can't map ?
to Option in FromRow
trait by default. So can't be ?
right now :-(
version: row.try_get("version")?, | ||
publication: row.try_get("publication")?, | ||
stele: row.try_get("stele")?, | ||
build_reason: row.try_get("build_reason").ok(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.ok()
to ?
src/db/statements/queries.rs
Outdated
.bind(name) | ||
.fetch_one(&mut *connection) | ||
.await | ||
.ok() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.ok()
to ?
.bind(stele) | ||
.fetch_one(&mut *connection) | ||
.await | ||
.ok() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.ok()
to ?
- Introduce a `routing` module, intended as a central place to map all routing in stelae app. Introduced the `_api` scope with the versions endpoint - feat(api): initial versions endpoint - feat(app): call register_api routing function - feat(api): extract stele name from request
This commit introduces a new endpoint in stelae to get all the meaningful dates on which a document or a collection changed. Below is a summary of significant changes: - Add a new `api` module as a centralized place to look at all the endpoints stelae has. - refact: rename publish to app - feat(db): add Manager traits for model The reasoning being it's easier visually figure out which database queries work with which data structure. - feat(db): add version struct - feat(db): implement document and collection version queries Idea being we query for all the versions in which document or a collection changed. - feat(api): get all versions for a publication Executes the database query for document or library versions. - feat(api): map request/response body to structs Idea being it's more legibile to reason about the structure of the api. Rely on serde to serialize/deserialize values from and into json. - Introduce the `requests` and `response` modules for mapping the endpoint response to historical data for presentation - feat(api): support for historical messages In addition to the versions, we also display HTML messages that: i) show on which historical publication the user is currently on, ii) through which dates the currently viewed version was valid from (start and end date). If comparing versions, we show how many changes happened between the compare and version date.
- fix(api): assign Current value to variable after historical messages - refact: separate formatting logic into associated function `format_display_date` - lots of clippy lints
For the API, move the `versions.rs` into `versions/mod.rs`. The two modules for request and response are now in `request/mod.rs` and `response/mod.rs`. Historical messages are in the `response/messages.rs` module
These modules should make the code base more legibile. These changes include: - Moving `serve` current documents endpoint into `server/api/serve` module. This way we'll centralize all future endpoint logic in the `api` module - Centralize routing into `routes` mod. We'll expect to add new 'static' routing, while also keeping our dynamic logic in the same place. - Centralize global app state into `state` mod. - Small updates following the re-org of current codebase refact: rename structs in `state.rs` module to fix linting issues Address clippy lint warnings/pedantic errors related to generics in function arguments
Add `initialize_guarded_dynamic_routes` and `initialize_dynamic_routes` functions which, depending on whether the guarded header exists, initialize the app with or without the guarded header.
For queries which we want to exit out of earlier, use `?`. However, I've left .ok for queries which should continue executing (so expecting a null value), instead of propagating an error to the caller where not expected.
Thanks for the rebase pr! I've now rebased your #40 into this PR and addressed comments from the other PR and most of the |
Closes #17, #24
This PR is the first-pass implementation for inserting data from an RDF git repo into the database. We refer to the type of data in RDF that we work with as changes, versions or historical information about documents or collections. We refer to documents as legal material in a stele. Documents can get organized/categorized into (sub)directories, which we call collections. In addition to tracking changes to documents, we also track changes to collections of documents.
Rationale
The end goal is to implement an API that lists all the significant dates (that we refer to also as versions) that affected legal documents or collections for a stele across history.
Notes
To support the end-goal, this pr's adds:
For database access we used sqlx, for parsing/working with the RDF, we used sophia.
Overview
db
module that's supposed to be generic over database connections. Currently supports connecting and inserting changes to SQLite database withsqlx
. There's also Postgres, because we left open the possibility to easily extend into Postgres. Postgres support is still TBD. If we do decide that only SQLite is the way to go, we'll remove the Postgres code in a separate pr.changes
module parses the changes into a graph and inserts the data into db. Therdf
module has utility methods supposed to reduce boilerplate and make working with the graph easier.migrations
directory has the initial database schema/tables.models
contains sqlx structs that map to database tables.statements
module has SQLinserts
andqueries
modules for SQL statements.Introduce the /_api/versions endpoint. Idea being we query the database and, depending on the url request, return either all document or collection changes. The changes are mapped onto a
response::Versions
json response.To implement in a separate pr
models
. Something like aPublicationManager
, for example. Using traits could make theinserts
andqueries
more legible.We'll need to update the insert changes logic again, so I vote to defer this and this issue at that point.