Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise details handling for GraphQL #3134

Merged
merged 6 commits into from
Feb 19, 2025
Merged

Conversation

brucebolt
Copy link
Member

@brucebolt brucebolt commented Feb 11, 2025

This makes a number of performance improvements to processing of details for GraphQL queries. When we did profiling, we found DetailsPresenter is home to some of the most invoked and slowest methods within our own code when serving GraphQL responses.

The change in response time is only marginal for the document types we have already migrated to GraphQL, but has potential for larger improvements when we migrate documents containing a large amount of data in details, particularly those with extensive nesting.

Change 1: Only parse details that are requested

At the moment, we are putting the entire contents of the details field through the DetailsPresenter, which is sub-optimal as we are parsing values that clients are not requesting.

By using lookaheads, we can determine which fields from details have been requested and only parse those.

In a previous prototype, we had only transformed the govspeak in the body field. However some schemas permit mixed govspeak/HTML content in other details fields, or even nested within other fields. We therefore need to recursively parse all items within details, as is already done using the DetailsPresenter, to ensure no raw govspeak to presented to the client.

Note: the code here could've been simpler (i.e. not duplicate the object, just slice the details hash) but the ContentEmbedPresenter requires a full Edition object.

Change 2: Do not loop through details unless embed links exist

At the moment, we are looping through each item in details to look for any embedded content references. This is sub-optimal as we are searching through content for a tag that we know won't be there, if there is nothing to embed in the document.

We should skip doing this if there is nothing that could possibly be embedded, which we know by looking in the document's links.

Change 3: Reduce lines of code executed in DetailsPresenter

This presenter currently performs some of the same code in different methods that determine the type of content in the field, and looks for content types even when we know there won't be any present.

Therefore refactoring this presenter to perform as little code execution as possible to order to determine the correct outcome for the content type given.

Change 4: Remove useless memoization

Everywhere the details method is being called is not re-computing the response, therefore the memoization is pointless.

Removing it here, to avoid the unnecessary compute needed to check whether the instance variable has already been set.

Trello card

@brucebolt brucebolt force-pushed the optimise-details-graphql branch from 1d713b1 to e939446 Compare February 11, 2025 17:09
@brucebolt brucebolt changed the title Only parse details fields that are requested Optimise details handling for GraphQL Feb 11, 2025
@brucebolt brucebolt force-pushed the optimise-details-graphql branch 12 times, most recently from d10f907 to 71c2777 Compare February 17, 2025 13:01
@brucebolt brucebolt marked this pull request as ready for review February 17, 2025 14:36
@brucebolt brucebolt force-pushed the optimise-details-graphql branch from 71c2777 to 69ceda5 Compare February 17, 2025 14:47
Copy link
Member

@yndajas yndajas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - no particularly significant comments

@@ -108,25 +108,6 @@
it { is_expected.to match(expected_result) }
end

context "when we're passed hashes rather than arrays" do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is there no code change required to restrict this from happening? (Given the tests pass, presumably it works, even if it goes against the ADR?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see any schemas with this format currently in use, nor any suggestion we've ever used a hash that isn't inside an array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more meaning do we need to make any code changes in the presenter itself to restrict this from being introduced?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I don't think the presenter would be the right place to restrict this, since the content has already been accepted by Publishing API at this point. Looking in WellFormedContentTypesValidator and the associated tests, there doesn't seem to be a case where we ever intended for well-formed content types to not be wrapped in an array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an additional check into that validator and a test to ensure we don't get content like this published in the future.

Comment on lines +34 to +32
if obj.is_a?(Array) && obj.all?(Hash) && (parsed_obj = parsed_content(obj))
parsed_obj
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: what's the reason for putting the assignment in the conditional expression rather than directly in the return value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a possibility parsed_content can return nil, in which case we'll then proceed onto the next elsif.

I've done the assignment in the conditional expression so we don't need to evaluate parsed_content unless both the first two conditions are true.

Comment on lines -14 to +17
@details ||=
begin
updated = content_embed(content_item_details).presence || content_item_details
updated = recursively_transform_govspeak(updated)
updated[:change_history] = change_history if change_history.present?
updated
end
updated = content_embed(content_item_details).presence || content_item_details
updated = recursively_transform_govspeak(updated)
updated[:change_history] = change_history if change_history.present?
updated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: did you just do a search of the codebase to work out that nowhere is calling this method twice? Is the check for an existing instance variable non-negligible? Given this is a public method it feels plausible that future code could call it twice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. DetailsPresenter is only being used in three places and the details method is immediately called (and only once) in each place. I can't really think of any reason why we'd need to call it multiple times in this application.

Memoization probably uses only minimal compute here and won't have any significant effect on the overall response time, but does seem a little pointless.

@brucebolt brucebolt force-pushed the optimise-details-graphql branch 2 times, most recently from 1701e5c to fd49232 Compare February 18, 2025 17:22
In the `EditionType` specs, we are currently not requesting any fields
from `details`. This works at the moment, since the `EditionType`
processes the entire content of the `details` hash before filtering out
those selected by the client.

In a future commit, we will only process those the client has selected.
Therefore updating these tests to specify which fields should be
obtained from `details`.

The resolution context has been removed from the `includes` line, since
that version of `run_graphql_field` does not support `lookahead` as an
argument [1].

1: https://github.com/rmosolgo/graphql-ruby/blob/ec47e903e619923715de225c36fafeb28620a91e/lib/graphql/testing/helpers.rb#L125
At the moment, we are putting the entire contents of the `details` field
through the `DetailsPresenter`, which is sub-optimal as we are parsing
values that clients are not requesting.

By using lookaheads, we can determine which fields from `details` have
been requested and only parse those.

In a previous prototype, we had only transformed the govspeak in the
`body` field. However the schemas can permit mixed govspeak/HTML content
in any details field. We therefore need to parse all items within
`details`, as is already done using the `DetailsPresenter`.

This also allows removal of the optimisations for `change_history` added
in 2caf00e since we can now filter the
details to only include that field when requested.
At the moment, we are looping through each item in `details` to look for
any embedded content references. This is sub-optimal as we are searching
through content for a tag that we know won't be there, if there is
nothing to embed in the document.

We should skip doing this if there is nothing that could possibly be
embedded, which we know by looking in the document's links.
There is no schema that permits this format for specifying the content
is govspeak (all schemas require this in an array), so the test can be
removed.

This behaviour is confired by ADR-003 [1] and the definition of
`multiple_content_types` in the shared Publishing API base schema
definition.

Removing this test (and support for multi-part content to not be
included in an array)  allows us to simplify the `DetailsPresenter`, as
we will only ever be dealing with strings or arrays of hashes.

Also updating a validator to ensure no content with this incorrect
syntax can be published in the future.

1: https://github.com/alphagov/publishing-api/blob/main/docs/arch/adr-003-representation-for-multiple-content-types.md
This presenter currently performs some of the same code in different
methods that determine the type of content in the field, and looks for
content types even when we know there won't be any present.

Therefore refactoring this presenter to perform as little code execution
as possible to order to determine the correct outcome for the content
type given.

This performance improvement is particularly important for GraphQL, as
we will be converting govspeak to HTML at render-time.
Everywhere the `details` method is being called is not re-computing the
response, therefore the memoization is pointless.

Removing it here, to avoid the unnecessary compute needed to check
whether the instance variable has already been set.
@brucebolt brucebolt force-pushed the optimise-details-graphql branch from fd49232 to a89216a Compare February 19, 2025 08:53
@brucebolt brucebolt merged commit 41084aa into main Feb 19, 2025
12 checks passed
@brucebolt brucebolt deleted the optimise-details-graphql branch February 19, 2025 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants