Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate OpenTelemetry tracing in Cody Webview #6100

Merged
merged 1 commit into from
Nov 21, 2024
Merged

Conversation

arafatkatze
Copy link
Contributor

@arafatkatze arafatkatze commented Nov 8, 2024

Enhance the chat application by integrating OpenTelemetry for tracing and context management. This update includes:

  • Added OpenTelemetry dependencies for context and tracing.
  • Implemented tracing spans in chat interactions, feedback submission in Webview
  • Managed active chat context using OpenTelemetry's context API.
  • Introduced a telemetry service for managing tracing configuration.
  • Added a new class for exporting trace data in ChatController from the webview

These changes aim to improve observability and debugging capabilities by providing detailed trace information for chat interactions and operations

Test plan

  • Run sourcegraph instance locally
  • Run sg start otel
  • Run the debugger for vscode cody locally on this branch
  • Perform some chat operations
  • Go to http://localhost:16686 to see if Jaegar is running
  • Select Cody-Client as the service
  • See a trace with the title chat-interaction this is a trace coming from the webview
image

Here is a video
https://www.loom.com/share/557b0ea9dffd4561810f8d67879f7dfb

Changelog

Copy link

github-actions bot commented Nov 8, 2024

‼️ Hey @sourcegraph/cody-security, please review this PR carefully as it introduces the usage of an unsafe_ function or abuses PromptString.

@arafatkatze arafatkatze closed this Nov 9, 2024
@arafatkatze arafatkatze reopened this Nov 9, 2024
@arafatkatze arafatkatze changed the title Trace export Integrate OpenTelemetry tracing in Cody Webview Nov 13, 2024
Copy link
Contributor

@dominiccooney dominiccooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good start, feedback inline.

Instead of adding lots of low quality spans, let's concentrate on a good chat latency: from hitting submit to when we paint the first token. This is interesting because:

  • Core product scenario.
  • Has an extension-side metric we can compare and contrast to learn a bit about webview latency specifically.
  • Easy, in the sense that it starts and ends in the webview.
  • ...but still interesting. These "how long does it take to run postMessage" don't reflect user latency.

vscode/src/chat/chat-view/ChatController.ts Outdated Show resolved Hide resolved
vscode/src/chat/chat-view/ChatController.ts Outdated Show resolved Hide resolved
vscode/src/chat/chat-view/ChatController.ts Outdated Show resolved Hide resolved
vscode/src/chat/chat-view/ChatController.ts Outdated Show resolved Hide resolved
vscode/src/chat/chat-view/ChatController.ts Outdated Show resolved Hide resolved
vscode/webviews/App.tsx Outdated Show resolved Hide resolved
vscode/webviews/App.tsx Outdated Show resolved Hide resolved
vscode/webviews/Chat.tsx Outdated Show resolved Hide resolved
vscode/webviews/Chat.tsx Outdated Show resolved Hide resolved
vscode/webviews/Chat.tsx Outdated Show resolved Hide resolved
@arafatkatze arafatkatze force-pushed the trace-export branch 4 times, most recently from 5deba34 to 6c5c8c2 Compare November 16, 2024 02:39
@arafatkatze arafatkatze marked this pull request as ready for review November 19, 2024 05:01

// Determines if a span group is complete for export(Currently only for chat interactions)
// TODO: Extend with YAML configuration to support all span groups, not just chat interactions.
private isSpanGroupComplete(spanGroup: Set<ReadableSpan>): boolean {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO here but this looks okayish for now

@arafatkatze arafatkatze force-pushed the trace-export branch 2 times, most recently from 1131458 to 8fbcd4b Compare November 19, 2024 05:30
Copy link
Contributor

@dominiccooney dominiccooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is shaping up very nicely.

The conditions for what state we're in look arduous, any ideas about abstractions there so it is simpler?

This is still not user latency in the sense that HumanMessageEditor onSubmitClick is probably when we could start measuring something. Looks like we have already undergone some state changes by the time we start a span in what you have here?

vscode/src/services/open-telemetry/CodyTraceExportWeb.ts Outdated Show resolved Hide resolved
vscode/src/services/open-telemetry/CodyTraceExportWeb.ts Outdated Show resolved Hide resolved
const { auth } = await currentResolvedConfig()
if (!auth.accessToken) {
logError('TraceSender', 'Cannot send trace data: not authenticated')
throw new Error('Not authenticated')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we require this?

What's the plan for performance around login, which we will probably want to do soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we require this?

    const traceUrl = new URL('/-/debug/otlp/v1/traces', auth.serverEndpoint).toString()
    const response = await fetch(traceUrl, {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json',
            ...(auth.accessToken ? { Authorization: `token ${auth.accessToken}` } : {}),
        },
        body: spanData,
    })

When we send the telemetry data to the Otel Collector endpoint it does require the auth token so that's why.

What's the plan for performance around login, which we will probably want to do soon.

I gotta think through this to be honest, I can't tell you much right now unless I explore the code first

vscode/webviews/chat/Transcript.tsx Outdated Show resolved Hide resolved
vscode/webviews/chat/Transcript.tsx Outdated Show resolved Hide resolved
timeToFirstTokenSpan.current = undefined
hasRecordedFirstToken.current = true

// Also set on parent span for backwards compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backwards compatibility with what? This is a new trace...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it was just a bad LLM comment so I removed it, NOTE: its still a child of the parent trace of chat-interaction.

private agentIDE?: CodyIDE
private extensionAgentVersion?: string
constructor() {
if (!WebviewOpenTelemetryService.instance) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design of this singleton and guards could be better, I think, although these things are always messy...

Why not have the singleton, initialize it, and make it convenient to use. Then the constructor doesn't need conditional side effects...

If people want to new up other ones, is there any harm in that? Why have configure exit early?

Copy link
Contributor Author

@arafatkatze arafatkatze Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have the singleton, initialize it, and make it convenient to use. Then the constructor doesn't need conditional side effects...

Great question: I actually spent some time with this.
So for this code

const webviewTelemetryService = useMemo(() => {
const service = WebviewOpenTelemetryService.getInstance()
return service
}, [])
useEffect(() => {
if (config) {
webviewTelemetryService.configure({
isTracingEnabled: true,
debugVerbose: true,
agentIDE: config.clientCapabilities.agentIDE,
extensionAgentVersion: config.clientCapabilities.agentExtensionVersion,
})
}
}, [config, webviewTelemetryService])

I had to wait for config to fully load and only then do i get the value of config.clientCapabilities.agentIDE and config.clientCapabilities.agentExtensionVersion because they are needed to figure out which OS and IDe the request is coming from: so I had to make use of useEffect for that to load fully and only then I could do the configuration(although Ideally I would have just done the initialization once and be done with it). This seemed like a good enough way to pass config values in the configuration.

Why have configure exit early?

Once the configuration is loaded and set up, I want it to be globally accessible for the webview. This approach ensures that subsequent changes—such as those triggered by useEffect—don’t introduce unnecessary complications that make debugging issues (like traces not rendering correctly) more difficult. By enforcing a singleton pattern and using a fixed configuration after initialization, we can guarantee the stability of the trace pipeline. This way, you only need to focus on potential errors in trace handling, rather than worrying about misconfigurations or unstable configurations.

}
}

public dispose(): void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See you did guard some methods, but not this one. Is it simpler and better to just leave the global alone in these instance methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out! I've added a guard clause to the dispose method to ensure it only affects the global instance when appropriate. This should help maintain simplicity and prevent any unintended modifications to the global state.
Let me know if you have any other suggestions!

@arafatkatze
Copy link
Contributor Author

arafatkatze commented Nov 20, 2024

The conditions for what state we're in look arduous, any ideas about abstractions there so it is simpler?

Yes, I simplified the code a little bit by separating it out into functions and hopefully that's better.

This is still not user latency in the sense that HumanMessageEditor onSubmitClick is probably when we could start measuring something. Looks like we have already undergone some state changes by the time we start a span in what you have here?

Fixed that.

@arafatkatze arafatkatze force-pushed the trace-export branch 2 times, most recently from 9416e1c to d3a12dc Compare November 20, 2024 06:09
Copy link
Contributor

@dominiccooney dominiccooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some feedback inline. Let's land this and iterate on any niggling details.

vscode/webviews/Chat.tsx Outdated Show resolved Hide resolved
vscode/webviews/chat/Transcript.tsx Outdated Show resolved Hide resolved

const onEditSubmit = useCallback(
(editorValue: SerializedPromptEditorValue, intentFromSubmit?: ChatMessage['intent']): void => {
startSpanAndSubmit('edit', editorValue, intentFromSubmit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this before, but you're not including the editor serialization, etc. I think we should start the span when the user hits enter or clicks the button. (Even that is late... ideally we'd use this and count the time from the actual input to when we get the event: https://developer.mozilla.org/en-US/docs/Web/API/PerformanceEventTiming but I'm fine if we don't do that yet, it might create headaches for Cody Web. But we should start as soon as we get the key press/click event.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, so I added it like this


    const onUserAction = (action: 'edit' | 'submit', intentFromSubmit?: ChatMessage['intent']) => {
        // Start the span as soon as the user initiates the action
        const startMark = performance.mark('startSubmit')
        const spanManager = new SpanManager('cody-webview')
        const span = spanManager.startSpan('chat-interaction', {
            attributes: {
                sampled: true,
                'render.state': 'started',
                'startSubmit.mark': startMark.startTime,
            },
        })

        if (!span) {
            throw new Error('Failed to start span for chat interaction')
        }

        const spanContext = trace.setSpan(context.active(), span)
        setActiveChatContext(spanContext)

        // Serialize the editor value after starting the span
        const editorValue = humanEditorRef.current?.getSerializedValue()
        if (!editorValue) {
            console.error('Failed to serialize editor value')
            return
        }

        const commonProps = {
            editorValue,
            intent: intentFromSubmit || intentResults.current?.intent,
            intentScores: intentFromSubmit ? undefined : intentResults.current?.allScores,
            manuallySelectedIntent: !!intentFromSubmit,
        }

        if (action === 'edit') {
            editHumanMessage({
                messageIndexInTranscript: humanMessage.index,
                ...commonProps,
            })
        } else {
            submitHumanMessage({
                ...commonProps,
            })
        }
    }

so that serialization is within span. Hope that's better than before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to make more changes in followup PR too.

vscode/webviews/chat/Transcript.tsx Outdated Show resolved Hide resolved
vscode/webviews/chat/Transcript.tsx Show resolved Hide resolved
vscode/webviews/chat/Transcript.tsx Show resolved Hide resolved
vscode/webviews/chat/Transcript.tsx Outdated Show resolved Hide resolved
@dominiccooney
Copy link
Contributor

Kudos for writing a detailed test plan.

It may make sense to unit test the span construction. It's fine to do that in a follow-up PR. The whole mechanism will be continuously observed by the alerts, so an integration test is less critical for this.

@arafatkatze arafatkatze merged commit f747f6e into main Nov 21, 2024
18 of 20 checks passed
@arafatkatze arafatkatze deleted the trace-export branch November 21, 2024 00:43
vovakulikov added a commit that referenced this pull request Nov 22, 2024
vovakulikov added a commit that referenced this pull request Nov 22, 2024
Revert of #6100

#6100 introduces Cody Web
regression because it pulls into the bundle zlib, and apparently, the
zlib browser version doesn't work in the web-worker context. Hence, we
have a runtime error. I suggest we revert changes from
#6100 and investigate more into
what exactly is happening in web-worker Cody's Web bundle with the new
open telemetry reporter.

```
Uncaught TypeError: Cannot read properties of undefined (reading 'prototype')
    at ../../../cody/web/dist/agent.worker-Cnq9eXnn.mjs (50be7d08-d01b-472b-8932-6af884d43a68:488831:61)
    at __init (50be7d08-d01b-472b-8932-6af884d43a68:11:56)
    at 50be7d08-d01b-472b-8932-6af884d43a68:501380:1
```


## Test plan
- Check that Cody Web demo works (meaning has no runtime error on
initial run)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants