Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure to call close() on Scope returned from io.opentelemetry.context.Context.makeCurrent() #924

Open
wants to merge 17 commits into
base: series/2.x
Choose a base branch
from

Conversation

joost-de-vries
Copy link

@joost-de-vries joost-de-vries commented Dec 30, 2024

The documentation for io.opentelemetry.context.Context.makeCurrent() states
Every makeCurrent() must be followed by a Scope#close(). Breaking these rules may lead to memory leaks and incorrect scoping.

Running zio-opentelemetry with -Dio.opentelemetry.context.enableStrictContext=true leads to an error

Dec 30, 2024 11:08:42 AM io.opentelemetry.context.StrictContextStorage$PendingScopes run
SEVERE: Scope garbage collected before being closed.
java.lang.AssertionError: Thread [ZScheduler-Worker-2] opened a scope of OtelContext{currentSpan=datadog.opentelemetry.shim.trace.OtelSpan$NoopSpanContext@1f7a631e, rootSpan=datadog.opentelemetry.shim.trace.OtelSpan$NoopSpanContext@1f7a631e} here:
	at datadog.opentelemetry.shim.context.OtelContext.makeCurrent(OtelContext.java:90)
	at zio.telemetry.opentelemetry.context.ContextStorage$Native$.set$$anonfun$1(ContextStorage.scala:63)
	at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1027)
	at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1063)
	at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1090)
	at zio.internal.FiberRuntime.evaluateEffect(FiberRuntime.scala:412)
	at zio.internal.FiberRuntime.evaluateMessageWhileSuspended(FiberRuntime.scala:488)
	at zio.internal.FiberRuntime.drainQueueOnCurrentThread(FiberRuntime.scala:249)
	at zio.internal.FiberRuntime.run(FiberRuntime.scala:137)
	at zio.internal.ZScheduler$$anon$3.run(ZScheduler.scala:380)

This PR uses zio.Scope to make sure that the otel Scope returned by Context.makeCurrent is always closed.

@joost-de-vries joost-de-vries requested a review from a team as a code owner December 30, 2024 10:21
@CLAassistant
Copy link

CLAassistant commented Dec 30, 2024

CLA assistant check
All committers have signed the CLA.

@grouzen
Copy link
Contributor

grouzen commented Dec 30, 2024

@joost-de-vries Thanks for spotting this! I'll look into it when I have time and tell you what I think about your solution. The only concern I have is that these changes are not backward-compatible.

@joost-de-vries
Copy link
Author

@grouzen thank you for your quick response.

Yes, too bad that the Baggage api is affected. I couldn't think of a solution that doesn't. Maybe you have an idea?

I've tested this fix on datadog.
In our code we call after a http response Tracing.setAttribute("http.status_code", response.status.code.toString).
That status code was always missing in the span.
When I deployed this fix it started showing up. You can see the exact moment of the deploy
result-of-fix-on-datadog

@joost-de-vries
Copy link
Author

There's still a failing assertion java.lang.IllegalStateException: Thread [ZScheduler-Worker-7] opened scope, but thread [ZScheduler-Worker-3] closed it
The assertion occurs at the end of the 2nd (zio) argument of Tracing.root(...)(zio)
So the otel Scope is getting closed. But from a different thread.
I'm not sure whether that is an issue for zio code or no. The otel strict code expects to be called by a blocking java call.

@grouzen
Copy link
Contributor

grouzen commented Dec 30, 2024

Now I realize that it is required to change the Baggage API this way to reflect the .makeCurrent semantics.

@grouzen
Copy link
Contributor

grouzen commented Dec 30, 2024

@joost-de-vries Please pull the changes from series/2.x branch to fix the CI error.

@joost-de-vries
Copy link
Author

@grouzen I wonder how to test the fix. One way would be to mock de java OpenTelemetry Context.makeCurrent. And check that close() is called.

@joost-de-vries
Copy link
Author

joost-de-vries commented Jan 2, 2025

Thinking about the changed api of Baggage: I guess I can introduce a new trait. And add deprecation warnings to the existing.
Of course that raises the question what to call the new trait. And what to call the alternative to the OpenTelemetry function

 def baggage(logAnnotated: Boolean = false): URLayer[ContextStorage, Baggage] =
    Baggage.live(logAnnotated)

I guess we could call the existing trait BaggageDeprecated

@grouzen
Copy link
Contributor

grouzen commented Jan 2, 2025

Thinking about the changed api of Baggage: I guess I can introduce a new trait. And add deprecation warnings to the existing. Of course that raises the question what to call the new trait. And what to call the alternative to the OpenTelemetry function

 def baggage(logAnnotated: Boolean = false): URLayer[ContextStorage, Baggage] =
    Baggage.live(logAnnotated)

I guess we could call the existing trait BaggageDeprecated

Frankly, I'm still thinking about how to avoid having scoped baggage API. I need to find some time to read the Baggage OTEL spec. I assume it is fine to have nested OTEL scopes when dealing with baggage data because, as far as I remember, according to the spec, it stores the baggage data in the trace context rather than the span, which makes perfect sense to me.
Sorry for this stream of consciousness if it doesn't make any sense ;) Just don't have enough time at the moment to dig deeply into it.

@grouzen
Copy link
Contributor

grouzen commented Jan 2, 2025

@grouzen I wonder how to test the fix. One way would be to mock de java OpenTelemetry Context.makeCurrent. And check that close() is called.

Sounds good. I think it is a sane way of testing this kind of stuff.

@joost-de-vries
Copy link
Author

Thinking about the changed api of Baggage: I guess I can introduce a new trait. And add deprecation warnings to the existing. Of course that raises the question what to call the new trait. And what to call the alternative to the OpenTelemetry function

 def baggage(logAnnotated: Boolean = false): URLayer[ContextStorage, Baggage] =
    Baggage.live(logAnnotated)

I guess we could call the existing trait BaggageDeprecated

Frankly, I'm still thinking about how to avoid having scoped baggage API. I need to find some time to read the Baggage OTEL spec. I assume it is fine to have nested OTEL scopes when dealing with baggage data because, as far as I remember, according to the spec, it stores the baggage data in the trace context rather than the span, which makes perfect sense to me. Sorry for this stream of consciousness if it doesn't make any sense ;) Just don't have enough time at the moment to dig deeply into it.

Yes, I'm chewing on that too.
In my mind the Tracing working correctly is much more important than Baggage. Thankfully that api is not affected. So I would like the fix to Tracing to not get hold up by the Baggage api.
We can leave the Baggage api as is. (Unless we think of a transparent solution.) And introduce a ScopedBaggage trait. Or for the affected methods a scoped variant. Like def setScoped(name: String, value: String)(implicit trace: Trace): URIO[Scope, Unit]. That way current usage is unaffected.

Come to think of it: the latter solution is simplest probably.

@joost-de-vries
Copy link
Author

@grouzen I wonder how to test the fix. One way would be to mock de java OpenTelemetry Context.makeCurrent. And check that close() is called.

Sounds good. I think it is a sane way of testing this kind of stuff.

I've looked into this:
It seems the test for ContextStorage.native doesn't initialize the otel OpenTelemetry. So I don't know if the strict jvm argument will work.
Also for a jvm argument to be applied to the test it probably needs to run forked in a separate jvm.
I don't think we can mock the ThreadLocal.

@joost-de-vries
Copy link
Author

@grouzen I've implemented the latter approach. Existing code will compile unchanged.

(of course that raises the question what the implementation of the existing methods should be. I've implemented it with a local ZIO.scoped. If we want to keep the old behaviour we should provide a scope that is never finalized. )

@joost-de-vries
Copy link
Author

I've added a test to verify that ContextStorage.native adheres to the contract of otel Context.makeCurrent:

Context prevCtx = Context.current(); 
try (Scope ignored = ctx.makeCurrent()) {   
  assert Context.current() == ctx; 
  ... 
} 
assert Context.current() == prevCtx;

Most of those tests fail on master. But not all.

@grouzen
Copy link
Contributor

grouzen commented Jan 7, 2025

Hey! Sorry for being silent, don't have time to work on this atm.
I want to take a timeout to think more since I don't quite like the breaking changes and the scoped API in general. However, I still hope it may be possible to avoid this.
Here are some links to have a look if you are interested:

@joost-de-vries
Copy link
Author

joost-de-vries commented Jan 7, 2025

@grouzen I understand.
I agree it would be nice if we don't need to change the api.

Are you on zio discord? I can share our experiences with zio + datadog.

@grouzen
Copy link
Contributor

grouzen commented Jan 7, 2025

@grouzen I understand. I agree it would be nice if we don't need to change the api.

Are you on zio discord? I can share our experiences with zio + datadog.

Yeah, you can find me on the zio-telemetry channel as well. It would be lovely to hear about your experience!

@joost-de-vries
Copy link
Author

@grouzen we can split this issue in two: 1. the above fix for Tracing only. That won't affect current api 2. a fix for Baggage. We can then take a bit more time in coming up with an appropriate api.
What do you think?

It would be nice to get the fix for Tracing merged.

@grouzen
Copy link
Contributor

grouzen commented Feb 10, 2025

@grouzen we can split this issue in two: 1. the above fix for Tracing only. That won't affect current api 2. a fix for Baggage. We can then take a bit more time in coming up with an appropriate api. What do you think?

It would be nice to get the fix for Tracing merged.

Yes, lets do the first PR for Tracing. I spent lots of time researching this and now I think we need to change how we manage the context and context storage completely.

injectLogAnnotations *> modifyBuilder(_.put(name, value)).unit

// to preserve the existing behavior where the otel Scope returned from Context.makeCurrent is not closed
// see io.opentelemetry.context.Context.makeCurrent
private def unclosedScope[A](io: URIO[Scope, A]): UIO[A] =
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grouzen this seems to me the way to preserve existing behaviour. What do you think?

Also: perhaps good to link to an issue in that comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be checked with some end-to-end test. One of the options could be a manual verification using an optelemetry-instrumentation example. Additionally, it would be good to check how it works with dd-trace-java because it is the reason why this issue was created in the first place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just afraid we close the current span this way, and I believe this is not what we want to do while modifying baggage data, right? That's why I think we need to do manual tests. Also, adding a unit test to prove/disprove my assumption would be great.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of implementation would you propose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't be fixed this way, as I said before. I think the whole idea of having specific ZIO2 instrumentation in opentelemetry-java-instrumentation and dd-trace-java was a mistake, hence the current implementation of ContextStorage.native was a mistake too. It must be done differently and I'm working on this right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the better solution in your situation would be to use your fork with the fix for your case while I'm figuring out how to fix it properly to make everyone happy.

Copy link
Author

@joost-de-vries joost-de-vries Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the previous supervisor based approach?
For us it would be nice not to be dependent on the zio support in dd-trace-java. I get the sense that that's not maintained actively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution is quite straightforward - map between FiberRef-based and ThreadLocal-based ContextStorage implementations when needed manually. This automatically allows getting rid of finicky Supervisor-based implementations

@joost-de-vries
Copy link
Author

A different way of managing context storage. Interesting!

I removed the scoped methods from Baggage.
There's the question what to do with the implementation of existing methods. Can you have a look at that?

@grouzen
Copy link
Contributor

grouzen commented Feb 10, 2025

One more thing. I should have asked about it in the very beginning. Could you please provide a repo with a minimal example of reproducing this issue? I'm curious about how your setup differs from everyone else's and why nobody has experienced this issue before.
I think it is legit to have such a repo due to the severe backward incompatible changes you are asking to make.
What do you think?

@joost-de-vries
Copy link
Author

joost-de-vries commented Feb 11, 2025

I've looked into the telemetry tests in zio-telemetry for one that involves 1. global otel 2. jvm context storage 3. auto instrumentation 4. zio auto intrumentation. But didn't see one.
I do have a test like that with datadog auto instrumentation (dd-trace-java). That was quite a bit of work to setup.
An expected span attribute does show up with the above fix. But not without. So that shows the issue.
Unfortunately that's in a private repo. I guess I can show you in a call.

@grouzen
Copy link
Contributor

grouzen commented Feb 14, 2025

I've looked into the telemetry tests in zio-telemetry for one that involves 1. global otel 2. jvm context storage 3. auto instrumentation 4. zio auto intrumentation. But didn't see one. I do have a test like that with datadog auto instrumentation (dd-trace-java). That was quite a bit of work to setup. An expected span attribute does show up with the above fix. But not without. So that shows the issue. Unfortunately that's in a private repo. I guess I can show you in a call.

An open telemetry-instrumentation-example module may be considered an end-to-end test involving all the parts you mentioned.
In order to fix your case, we need either a complete example that reproduces your issue with dd-trace-java or a general understanding of your current setup. In any case, someone will need to implement such an e2e test at the end of the day :) I think jumping on a quick call could be a good start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants