Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ActionController::Live #772

Closed

Conversation

thomasmarshall
Copy link

This commit adds a patch to ensure ActionController::Live actions, which are processed in a new thread, are instrumented as a child span in the current trace. Previously, we would lose the current context and instrumented code inside ActionController::Live actions would appear as separate traces.

This is a patch rather than an ActiveSupport::Notifications handler. Although the action_controller.process_action event is processed in the new thread, we don't have a reference to the controller or anything to indicate that it is an ActionController::Live controller.

The patch needs to be included rather than prepended, as ActionController::Live is a module that is included in each controller as required by the host app, rather than a class to inherit from. As such, include is the only way to get our #process_action method into the right place.

We need to manually unset the :__opentelemetry_context_storage__ thread local, as this gets copied across from the parent thread and points to the same stack array. Calling #clear would modify that shared stack array, so we need to unset it instead.

Example

Before After
Screenshot 2023-12-19 at 11 22 39 Screenshot 2023-12-19 at 11 23 00
Instrumented code appears as separate traces Instrumented code is nested within the original span

Copy link

linux-foundation-easycla bot commented Dec 19, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: thomasmarshall / name: Thomas Marshall (d278eb4)

Copy link
Collaborator

@arielvalentin arielvalentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution.

I don't quite understand why setting the last context in the stack as the "root context" fixes this problem. What happens when the parent thread/fiber closes the context?

Doesn't clearing out the fiber local span context stack cause issues for the current thread or fiber?

Would using context propagation be safer here?

@arielvalentin
Copy link
Collaborator

Oh boy I did just look at the link you shared and Rails is copying fiber local variables to the child thread 😢

@open-telemetry/ruby-maintainers given that's Rails is potentially causing problems for the SDK here what options do we have?

I am not keen on breaking encapsulation of the storage mechanism used by the SDK. My instincts here are to use context propagation anytime a thread is created but I think in this case the code is trying to enrich the rack span, which is bound to a different thread.

Enriching a span bound to a different thread also makes me a bit uneasy because it may be finished at anytime and the attributes won't ever be amended to it if they were. I'm also worried about the context being closed and detached in the parent thread; but then again the fact that the Rails framework is copying the fiber local stack means this problem already exists for alive controller requests.

@thomasmarshall
Copy link
Author

Thanks for taking a look @arielvalentin!

My instincts here are to use context propagation anytime a thread is created

Do you mean using something like a tracepoint to propagate the context whenever a thread is created, rather than trying to just fix for this one specific case?

Is there a good example of how to instrument threaded code through context propagation elsewhere? I must admit I'm relatively new to OpenTelemetry and this codebase, but I'd love to work on a more suitable fix if this proposed patch has issues.

@arielvalentin
Copy link
Collaborator

Contexts are intended to be thread local and state should not be shared across threads.

What I mean when I say "context propagation" is to use the OpenTelemetry context propagators to extract the trace context to the child thread and treat any child threads as if they were their own "sub service" akin to messaging, http, or rpc.

This is the approach we use for Golang programs at GH that make heavy use of go-routines. I thought this model would have been a good fit for alive but now I don't know if it will work given the rack span is something we want to share with the process controller subscriber so I'm unsure what to do next.

If we use context propagation it will result in a new "ingress" point and make it look like the code running in the thread is its own service so I'm not confident in what I think the best thing to do given Rails is doing something I don't expect in sharing the parent threads fiber local variables.

@hannahramadan
Copy link
Contributor

Hi @thomasmarshall  and @arielvalentin! I work with OTel approver @kaylareopelle who mentioned this PR. Very cool work y'all are doing 🎉 I want to share another potential avenue for ActionController::Live instrumentation.

Recently, instrumentation was added for ActionController::Live#send_stream, one of the three public APIs for the class. One idea that moves away from patching would be to make a similar contribution to Rails for the remaining two methods. Not sure about any context sharing blockers with this approach but could be worth considering.

@arielvalentin
Copy link
Collaborator

@hannahramadan is that being backported to rails 6.1?

@hannahramadan
Copy link
Contributor

@arielvalentin Ah no — ActionController::Live#send_stream was a 7.0 introduction and wasn't backported.

@arielvalentin
Copy link
Collaborator

Thanks for the explanation. I'm not familiar with streaming responses so I'm not sure what the differences are between Live in Rails 6.1 and 7.

Assuming we're able to use the notifications for Rails 7 what would we provide users of 6.1?

@thomasmarshall
Copy link
Author

Enriching a span bound to a different thread also makes me a bit uneasy because it may be finished at anytime and the attributes won't ever be amended to it if they were.

@arielvalentin do you mean the current instrumentation that adds attributes to the rack span in a process_action.action_controller handler? With the previous approach where we were patching Metal#dispatch, this would have happened in the original thread, but in the handler approach it happens in the new thread (for Live actions).

My understanding is that the response is only sent when we start writing to the stream, so I would expect adding attributes to the rack span in Handlers::ActionController#start to be fine. Not sure about recording the exception in Handlers::ActionController#finish though 😕

I am not keen on breaking encapsulation of the storage mechanism used by the SDK.

A few options I can think of:

  • Fix Rails so it doesn't copy across this thread local. I'm not sure exactly how we'd do that, but might be possible.
  • Make unsetting the thread local slightly less hacky by exposing the STACK_KEY constant, so at least we wouldn't be referencing :__opentelemetry_context_storage__ directly.
  • Add a mechanism to the SDK to "reset" the context by deleting the thread local instead of clearing the array.

@hannahramadan thanks for the suggestion! I could definitely look into adding Rails instrumentation for Live actions. At the moment, process_action.action_controller does already happen in the subthread, so instrumentation kinda already exists. Maybe we could hook into that and use something like this for the live-specific logic, instead of patching Live#process_action in this PR:

if payload[:controller].constantize < ::ActionController::Live

Though with a notification handler we wouldn't be able to use block forms like Context.with_current or Tracer#in_span, but maybe that's not a big deal. Do you think that would be suitable?

@arielvalentin
Copy link
Collaborator

@thomasmarshall I am referring to the Span injected by the Rack instrumentation. Depending on the version you are using it is either injected by a "classic" Rack Middleware or a Rack::Events Handler:

https://github.com/open-telemetry/opentelemetry-ruby-contrib/blob/main/instrumentation/rack/lib/opentelemetry/instrumentation/rack/middlewares/tracer_middleware.rb#L82

https://github.com/open-telemetry/opentelemetry-ruby-contrib/blob/main/instrumentation/rack/lib/opentelemetry/instrumentation/rack/middlewares/event_handler.rb#L59C48-L59C48

The ActionPack instrumentation uses a Helper method Rack.current_span and enriches it with Rails specific attributes:

https://github.com/open-telemetry/opentelemetry-ruby-contrib/blob/main/instrumentation/action_pack/lib/opentelemetry/instrumentation/action_pack/handlers/action_controller.rb#L25

If we go back in time the previous version did the same thing, i.e. it loads the Rack span instead of creating a new one:

Both of these cases rely on Fiber Local Variables. If the same Rack Span and Context local stack is being shared across multiple unrelated requests, then that is a problem.

I do not know enough about SSE and Rack's implementation to understand this well enough to make an informed decision here.

BUT If I understand this comment correctly...

      # This processes the action in a child thread. It lets us return the
      # response code and headers back up the Rack stack, and still process
      # the body in parallel with sending data to the client.

https://github.com/rails/rails/blob/6b93fff8af32ef5e91f4ec3cfffb081d0553faf0/actionpack/lib/action_controller/metal/live.rb#L278C1-L280C62

The client is bound to a single TCP connection. In the case where the server is multi-threaded it will be a connection per thread.

That implies there is a single Rack span and fiber local context is created for the TCP Connection.

Rails then copies the fiber local variables to the child "handler" thread, which would end up including the Rack Span that was created in the thread handling the TCP connection:

https://github.com/rails/rails/blob/6b93fff8af32ef5e91f4ec3cfffb081d0553faf0/actionpack/lib/action_controller/metal/live.rb#L287C13-L287C13

Since the Live::Response has a specialized buffered response, Rack is able to send over initial http headers and line by line responses:

https://github.com/rails/rails/blob/main/actionpack/lib/action_dispatch/http/response.rb#L538

... But wait... this is using buffered responses!!!

Buffered Responses and the "Classic" Middleware

The Rack Classic Middleware implementation does not support buffered responses correctly!

It closes the span too early, before the response is completed because it first exits the middleware stack running the ensure block before it starts to stream the reply to the client:

https://github.com/open-telemetry/opentelemetry-ruby-contrib/blob/main/instrumentation/rack/lib/opentelemetry/instrumentation/rack/middlewares/tracer_middleware.rb#L91

Assuming the behavior/functionality is the same with Live::Response as it is with Rack::BodyProxy, then the Rack span is closed and it takes with it the context potentially leaving the SSE spans as Root/Disconnected traces.

In order to support buffered responses you will have to use the Rack::Events API:

https://www.rubydoc.info/gems/opentelemetry-instrumentation-rack/0.23.1/OpenTelemetry/Instrumentation/Rack/Middlewares/EventHandler

You can enable this feature by setting use_rack_events: true to your Rack Instrumentation configuration, which will be the default in v0.24.

What about the Fiber local problem?

I don't know what to do about this. I imagine this would cause a problem for any number of gems that relies on Fiber local storage to do things and I am quite frankly shocked this isn't causes a whole bunch of problems for other gems.

I need the @open-telemetry/ruby-maintainers to chime in here but if my hunch is right, then using Rack::Events is going to fix the problem where your spans are disconnected from one another and it will fix the timings on your Rack Spans to include the entire time spent streaming the response to the client.

What do we do next?

Would you be amenable to giving a go to using Rack::Events and reporting back results to us?

@thomasmarshall
Copy link
Author

Would you be amenable to giving a go to using Rack::Events and reporting back results to us?

@arielvalentin using the latest main (with the use_rack_events: true default) the spans are still disconnected. Is that what you meant for me to try? Or do you mean in combination with the fix in this PR for the broken context?

Screenshot 2024-01-03 at 09 23 31

@arielvalentin
Copy link
Collaborator

Nope. Just wanted to see if the Rack Events version works without the patches here.

Going to calls in @open-telemetry/ruby-maintainers once again. Any ideas? Thoughts?

@thomasmarshall thomasmarshall requested a review from a team January 5, 2024 05:44
This commit adds a patch to ensure `ActionController::Live` actions,
which are processed in a new thread, are instrumented as a child span
in the current trace.

This is a patch rather than an `ActiveSupport::Notifications` handler.
Although the `action_controller.process_action` event is processed in
the new thread, we don't have a reference to the controller or anything
to indicate that it is an `ActionController::Live` controller.

The patch needs to be included rather than prepended, as
`ActionController::Live` is a module that is included in each controller
as required by the host app, rather than a class to inherit from. As
such, `include` is the only way to get our `#process_action` method into
the right place.

We need to manually unset the `:__opentelemetry_context_storage__`
thread local, as this gets copied across from the parent thread and
points to the same stack array. Calling `#clear` would modify that
shared stack array, so we need to unset it instead.
Comment on lines +17 to +18
# Unset thread local to avoid modifying stack array shared with parent thread
Thread.current[:__opentelemetry_context_storage__] = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bad. We shouldn't have to reach into the API internals in this way. This happens because OpenTelemetry::Context.clear is implemented as stack.clear, which empties the backing array instead of replacing it with a new empty instance.

I think it'd be worthwhile to change the implementation of OpenTelemetry::Context.clear to be Thread.current[STACK_KEY] = [].

On the other hand, I think there is a race condition when fetching the current context. Specifically, since the thread-locals are copied, the backing array for the context stack is shared with the parent thread, and if that parent thread modifies its context before this thread retrieves it, the context used in this thread will be incorrect.

Side note: the tracing library we built at Shopify before OpenTelemetry Ruby patched ActionController::Live in a very similar way, other than unsetting the thread local. It worked there because the thread-local was a Span instead of an array, and we managed the "stack" as a linked list of spans.

I think it might be preferable to explicitly propagate context in some way, but I'm not familiar enough with ActionController::Live to know how to accomplish that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could leverage/patch ActiveSupport::IsolatedExecutionState.share_with in some way? 🤔

That would be a good place to "propagate" the context, probably (if we fix Context.clear) by:

current_context = OpenTelemetry::Context.current
OpenTelemetry::Context.clear
OpenTelemetry::Context.attach(current_context)

It's a bit of a hack, in that we're not going to Context.detach. The patch for Live could call OpenTelemetry::Context.clear after the in_span call.

I.e. our patch for ActiveSupport::IsolatedExecutionState.share_with would be:

def share_with(other)
  super
  current_context = OpenTelemetry::Context.current
  OpenTelemetry::Context.clear
  OpenTelemetry::Context.attach(current_context)
end

And our patch for Live would be simply:

attributes = {
  OpenTelemetry::SemanticConventions::Trace::CODE_NAMESPACE => self.class.name,
  OpenTelemetry::SemanticConventions::Trace::CODE_FUNCTION => action_name
}
Instrumentation.instance.tracer.in_span("#{self.class.name}##{action_name} stream", attributes: attributes) do
  super
end
OpenTelemetry::Context.clear

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fbogsany. I will open a PR in opentelemetry-ruby to propose your suggested change to Context.clear. I'll also probably close this PR and reopen a new one with a different approach, using explicit propagation like you suggest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a linked list implementation of the context stack in open-telemetry/opentelemetry-ruby#1597 but the performance impact was significant (~9% with the interpreter, as much as 24% with yjit).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open-telemetry/opentelemetry-ruby#1598 is merged. Once we cut a new release of the API, we can use that as a minimum version here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasmarshall - A new version of the API has been released! Could you incorporate it into this PR?

Copy link
Contributor

👋 This pull request has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this pull request will be closed eventually by the stale bot

@github-actions github-actions bot added the stale Marks an issue/PR stale label Mar 17, 2024
@arielvalentin arielvalentin added keep Ensures stale-bot keeps this issue/PR open and removed stale Marks an issue/PR stale labels Apr 1, 2024
@thomasmarshall
Copy link
Author

Would you be amenable to giving a go to using Rack::Events and reporting back results to us?

I tried the latest versions of opentelemetry-instrumentation-rack (0.24.0) and opentelemetry-instrumentation-action_pack (0.9.0) and this appears to be working correctly now with the Rack Events API, as you suspected @arielvalentin. I must have made a mistake when testing it on main when you asked me to before—apologies.

As such I beleive this PR is no longer needed, so I'll close it. Thank everyone for your help with this!

Screenshot 2024-04-02 at 17 28 51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep Ensures stale-bot keeps this issue/PR open
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants