Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ActiveRecord] Add Attributes for Async Queries #1219

Open
arielvalentin opened this issue Oct 31, 2024 · 7 comments · May be fixed by #1351
Open

[ActiveRecord] Add Attributes for Async Queries #1219

arielvalentin opened this issue Oct 31, 2024 · 7 comments · May be fixed by #1351
Assignees
Labels
feature New feature or request instrumentation keep Ensures stale-bot keeps this issue/PR open

Comments

@arielvalentin
Copy link
Collaborator

ActiveRecord 7.1 introduced async query methods1, which leverages concurrency primitives via Promise API to defer executing queries.

We would like to enrich ActiveRecord related spans to know whether or not they were executed from the async context.

Additional Notes

The sql.active_record notification payload includes a async attribute:
https://github.com/rails/rails/blob/616d3a7675c18ae2d38d3116de47e2e7308cfbd8/activerecord/lib/active_record/connection_adapters/abstract_adapter.rb#L1135

We may be able to enrich the ActiveRecord span and amend it as a Shared DB attribute for the driver itself however we do not currently use notifications in this instrumentation.

Worst case scenario, we monkey patch existing async methods, concurrency primitives.

Footnotes

  1. https://github.com/rails/rails/pull/44446

@arielvalentin arielvalentin added feature New feature or request instrumentation labels Oct 31, 2024
@bensheldon
Copy link
Contributor

Thank you for opening this! 🙇🏻

It would also be great if we could include the lock_wait attribute too, which represents how much time is spent in the foreground waiting for the query to complete (in a true async query, it would be 0.0ms):

https://github.com/rails/rails/blob/df9912116b29423b5bde3771b9829447a07d3f6c/activerecord/lib/active_record/log_subscriber.rb#L23-L24

Copy link
Contributor

github-actions bot commented Dec 1, 2024

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

@github-actions github-actions bot added the stale Marks an issue/PR stale label Dec 1, 2024
@arielvalentin arielvalentin added keep Ensures stale-bot keeps this issue/PR open and removed stale Marks an issue/PR stale labels Dec 1, 2024
@zvkemp
Copy link

zvkemp commented Dec 13, 2024

I can work on this.

@kaylareopelle
Copy link
Contributor

Thanks, @zvkemp! I've assigned this issue to you.

@zvkemp
Copy link

zvkemp commented Jan 15, 2025

We would like to enrich ActiveRecord related spans to know whether or not they were executed from the async context

@arielvalentin I have a few clarifying questions on this.

The current ActiveRecord::Instrumentation patches _query_by_sql, which happens before any decisions about asynchrony are made (the caller may request the query should be async, but can't require it). If async: true is passed, and the other conditions are met, the query asks to be added as a scheduled task to a thread pool (either global or per-connection, depending on ActiveRecord's config). The thread pool may not be running or the queue may be full, in which case the fallback action is executed (running the task on the calling thread, so not async). All of the above happens within the existing User query span (let's call this SpanA). If the task scheduling succeeded, SpanA measures the time it took to build and enqueue the query; if not, it measures the entire query execution.

The second wrinkle is in retrieving the result from the promise — even if it is successfully scheduled, if the task is still not enqueued by the time the original thread requests the result, the query is executed synchronously. At this point Span A is finished (so it's not permitted to change its name or other attributes), so the execution should be wrapped in a new span. If the task has started, but hasn't returned a result yet, the caller thread waits until it can acquire the tasks's mutex (indicating the thread pool is done with it), and records the lock_wait time.

So onto the question of how to enrich SpanA — in the case that it was actually async, there should probably be:

  • a span indicating the task was scheduled (SpanA, with a different name or attributes)
  • a new span that wraps the async task (let's call this SpanB)
  • a new span wrapping execute_or_wait tracking the lock_wait time (SpanC). This would have and end_timestamp of no less then SpanB's end_timestamp, but could start before SpanB (if the thread pool is very busy, and hasn't started on this task yet — should SpanC be the parent of SpanB in this case?)

The parentage of SpanB is a somewhat tricky problem — its fairly straightforward to wrap the task using SpanA as a parent, though because of asynchrony, SpanB may be started after SpanA is finished (is this even valid? The spec doesn't seem to prohibit it), like this:

|----SpanA------|
                        |-SpanB------------------|
                          |-----SpanC (lock_wait)--|

In the case where the task was scheduled but not picked up yet:

|----SpanA------|
                      |-----SpanC (lock_wait)-----|
                        |-SpanB------------------|

^ In this case, SpanB is still a child of SpanA, though in reality it would be more correct to make it a child of SpanC. I'm not sure how that re-parenting would work though, unless we delay deciding on a span context until the query is actually fired.

@arielvalentin
Copy link
Collaborator Author

@zvkemp these are all great points and thank you for the illustrations because they highlight the complexity around how our currently implementation does not work well with concurrent or parallel workflows. open-telemetry/opentelemetry-ruby#1766

Knowing that AR et.al. internals are using Concurrent Ruby is what prompted me to add its instrumentation as a dependency for Rails.
The Concurrent Ruby instrumentation attempts to do that but uses OTel objects instead of using the propagator API.

I do not have an answer here for this except that it may mean we need to model this differently, i.e. potentially having multiple spans, and/or implement this a way that uses context propagation differently between threads and fibers, which then requires the backend to stich the graph together properly, but we would not be able to enrich parent spans like this issue would like.

@zvkemp
Copy link

zvkemp commented Jan 16, 2025

I'll further add that lock_wait doesn't tell the whole story of why you might have waited; heavy load on the global thread pool can cause this as well, but time spent in the queue isn't tracked.

So here's a clearer picture of the 5 scenarios that can happen using load_async:

  1. The happy path– short query time and no waiting (lock_wait == 0, async):
|--schedule-query--|
                     |-query-|
                                # call .to_a
                                |-execute-or-wait-| (returns quickly)
  1. long query time (lock_wait > 0, async):
|--schedule-query--|
                     |--query-fires-here---------------|
                          # call .to_a
                          |--execute-or-wait------------|

^ this is the only scenario where ActiveRecord would have reported a non-zero/non-nil lock_wait time - the
query was successfully scheduled, and it started executing (but didn't finish) before its result
was requested.

  1. busy executor (all workers busy, queue has slots available, lock_wait == nil, not async):
|--schedule-query--|
                       # call .to_a
                       |--execute_or_wait---------|
                        |--query-fires-here------|

(to_a was called before a worker became available, so the executor was pre-empted). This may also happen if to_a is called immediately, before the executor has any chance to spawn a worker or assign it work (e.g. User.all.load_async.to_a). The call to Thread#pass is supposed to prevent this (but I'm not sure why that's even necessary, as you'll be waiting the same amount of time anyway).

  1. blocked executor: (queue is full, not async, lock_wait == 0)
|--schedule-query---------------|
       |--query-fires-here--|
                                   # call.to_a
                                   |-execute-or-wait-| (returns quickly)

(executor executes the task inline instead of enqueuing it).

  1. There's an additional scenario where the executor is not configured, in which case the async argument is ignored.

So lock_wait isn't even reported in 4 out of 5 scenarios, and in the one where it is, you're not really waiting to acquire the lock (I mean, you are, but only to determine when the executor has finished). The lock_wait time is equal to the time between calling to_a and whenever the query finishes. I think the three spans shown in all of these examples are worth including; lock_wait on its own— probably not.

As to the question of how to tell whether the span was async — we can certainly annotate that it was requested to be asynchronous (meaning that under some conditions it probably would be), but short of comparing thread object_ids, it's not really possible to reliably tell based on the callsite or arguments whether the query was executed in the request thread or in the executor's thread pool. I think there are some improvements to be made here, the main one being that currently, async loads (scenarios 1 or 2) have a span that only wraps the scheduler, not the query itself, so the traces are misleading at best.

@zvkemp zvkemp linked a pull request Jan 17, 2025 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request instrumentation keep Ensures stale-bot keeps this issue/PR open
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants