Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix local OTEL. Change OTEL to use tracer and ctx instead of span #2447

Merged
merged 21 commits into from
May 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .envrc
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ export PARAMETER_STORE_MODE='LOCAL'
export ALLOWED_IP_ADDRESSES='127.0.0.1'
export JWT_SECRET='3fd2e448ed2cec1fa46520f1b64bcb243c784f68db41ea67ef9abc45c12951d3e770162829103c439f01d2b860d06ed0da1a08895117b1ef338f1e4ed176448a' # pragma: allowlist secret

export REACT_APP_OTEL_COLLECTOR_URL='http://localhost:4318/v1/traces'
export REACT_APP_OTEL_COLLECTOR_URL='http://localhost:3030/local/otel'
export API_APP_OTEL_COLLECTOR_URL='http://localhost:4318/v1/traces'
export REACT_APP_LD_CLIENT_ID='this-value-can-be-set-in-local-if-desired'
export REACT_APP_NR_ACCOUNT_ID='this-value-is-ignored-local-but-used-in-deployed-env'
export REACT_APP_NR_TRUST_KEY='this-value-is-ignored-local-but-used-in-deployed-env'
Expand Down
30 changes: 13 additions & 17 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,17 @@ services:
# Jaeger
jaeger-all-in-one:
image: jaegertracing/all-in-one:latest
environment:
- COLLECTOR_OTLP_ENABLED=true
- COLLECTOR_ZIPKIN_HOST_PORT=:9411
ports:
- '16686:16686'
- '14268'
- '14250:14250'

# Collector
otel-collector:
image: otel/opentelemetry-collector:latest
command: ['--config=/etc/otel-collector-config.yaml']
volumes:
- ${PWD}/otel-collector-config.yml:/etc/otel-collector-config.yaml
ports:
- '4318:4318'
- '4317:4317'
- '55680:55680'
- '55681:55681'
depends_on:
- jaeger-all-in-one
- 6831:6831/udp
- 6832:6832/udp
- 5778:5778
- 16686:16686
- 14268:14268
- 14269:14269
- 14250:14250
- 9411:9411
- 4317:4317
- 4318:4318
35 changes: 28 additions & 7 deletions docs/technical-design/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,42 @@ title: Monitoring
# How to use and validate monitoring tools

## Background

We gather performance data and pay attention to metrics related to our production application. We receive alerts for particular error conditions so that we can investigate errors before they are reported by users. This document explains more about the monitoring tools in use in our application.

## Open Telemetry (OTEL)
Forthcoming.

We use [OTEL](https://opentelemetry.io/) for distributed tracing across services in our application. This includes both auto-instrumentation from various OTEL packages, like `@opentelemetry/instrumentation-aws-lambda'`, and manual instrumentation that developers can invoke to record events.

OTEL works around the concept of a trace, which is made up of one or more spans. The spans can come from both the front-end and back-end, and all actions that have been completed that have the same parent trace id will be displayed in New Relic under a single view. For example, if a user save's a submission as draft, a trace will be started on the front end and the backend will receive an HTTP request that include the trace ID in it's http headers. We then extract that trace ID and the invoked Apollo resolver will create a span with that trace ID as it's parent. When the trace is viewed in New Relic you will see the lifecycle of `app-web` connected to `app-api`.

### AWS Distro for OTEL

We use the [AWS Distro for OTEL](https://aws-otel.github.io/docs/introduction) to load a sidecar OTEL collector next to all of our lambdas. This is configured in `app-api` using the `collector.yml`, which is copied into our lambda's environments useing webpack. We can then just send all spans to `localhost`, which then ship them off to New Relic for display.

### OTEL in app-web

Our OTEL setup is stood up in `app-web` from the `TraceContext.tsx` file. We use the `AWSXRayPropagator` to generate and propagate trace IDs in our HTTP headers. We also use the `ZoneContextManager` which is recommended for dealing with async web operations. Since we can't send spans directly to our display system (New Relic) due to CORS issues, we have a small proxy in `app-api` that receives spans from our frontend and sends them off to the local collector to be shipped to New Relic.

### OTEL in app-api

We use the `otel_handler.ts` to create an OTEL tracer in `app-api`. In our Apollo GraphQL handler's `contextForRequestForFetcher` we put the tracer and the extracted parent trace ID that comes down in HTTP headers into our Apollo context, which is then passed to our resolvers. In each resolver we can then use the tracer to create a child span, which then can record either exceptions or success information about the request.

### Jaeger

See [Jaeger docs](https://www.jaegertracing.io/docs/1.35/getting-started/) or visit port [16686](http://localhost:16686) on local dev.

## New Relic

We export all monitoring and peformance data to New Relic. In order to access our New Relic dashboard, you'll first need to request access from someone on the team. Once access has been granted, the following should be used to sign in to your account:

### Setup
- New Relic browser monitoring is currently set up in `App.tsx` via inline script established in `newRelic.ts`. More about this approach in the [instrumentation for browser monitoring](https://docs.newrelic.com/docs/browser/new-relic-browser/page-load-timing-resources/instrumentation-browser-monitoring/).
- New Relic is hooked in OTEL data exports via `collector.yml`

- New Relic browser monitoring is currently set up in `App.tsx` via inline script established in `newRelic.ts`. More about this approach in the [instrumentation for browser monitoring](https://docs.newrelic.com/docs/browser/new-relic-browser/page-load-timing-resources/instrumentation-browser-monitoring/).
- New Relic is hooked in OTEL data exports via `collector.yml`

### Steps to check new relic dashboard

1. Open https://one.newrelic.com and log in to your account using your `@teamtrussworks.com` email address.
2. You will be redirected to the CMS SSO for Active Directory in Azure.
3. Log in with an email address consisting of your EUA ID + `@cloud.cms.gov`.
Expand All @@ -29,7 +49,8 @@ We export all monitoring and peformance data to New Relic. In order to access ou
6. Enter your MFA token. You should now be at our New Relic dashboard where all our OTEL metrics are being displayed.

### How to test monitoring
- Login as admin user
- Navigate to `/settings?test=crash` to test a full frontend crash
- Navigate to `/settings?test=error` to test logging an error
- To test backend crash - instructions forthcoming.

- Login as admin user
- Navigate to `/settings?test=crash` to test a full frontend crash
- Navigate to `/settings?test=error` to test logging an error
- To test backend crash - instructions forthcoming.
22 changes: 0 additions & 22 deletions otel-collector-config.yml

This file was deleted.

10 changes: 5 additions & 5 deletions services/app-api/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
"@aws-sdk/lib-storage": "^3.529.1",
"@launchdarkly/node-server-sdk": "9.4.0",
"@opentelemetry/exporter-trace-otlp-http": "^0.45.1",
"@opentelemetry/instrumentation-aws-lambda": "^0.41.1",
"apollo-server-core": "^3.11.1",
"apollo-server-lambda": "^3.5.0",
"archiver": "^7.0.1",
Expand All @@ -58,15 +59,14 @@
"@graphql-tools/jest-transform": "^2.0.0",
"@opentelemetry/api": "^1.6.0",
"@opentelemetry/core": "^1.18.1",
"@opentelemetry/resources": "^1.18.1",
"@opentelemetry/sdk-trace-base": "^1.18.1",
"@opentelemetry/sdk-trace-node": "^1.18.1",
"@opentelemetry/semantic-conventions": "^1.18.1",
"@opentelemetry/exporter-trace-otlp-http": "^0.45.1",
"@opentelemetry/id-generator-aws-xray": "^1.2.1",
"@opentelemetry/instrumentation": "^0.45.1",
"@opentelemetry/instrumentation-http": "^0.45.1",
"@opentelemetry/propagator-aws-xray": "^1.3.1",
"@opentelemetry/resources": "^1.18.1",
"@opentelemetry/sdk-trace-base": "^1.18.1",
"@opentelemetry/sdk-trace-node": "^1.18.1",
"@opentelemetry/semantic-conventions": "^1.18.1",
"@prisma/client": "^4.6",
"@types/archiver": "^6.0.2",
"@types/aws-lambda": "^8.10.83",
Expand Down
14 changes: 7 additions & 7 deletions services/app-api/serverless.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ custom:
qa_uploads_bucket_arn: ${env:CF_CONFIG_IGNORED_LOCALLY, cf:uploads-${sls:stage}.QAUploadsBucketArn}
secretsManagerSecret: aurora_postgres_${sls:stage} #pragma: allowlist secret
reactAppAuthMode: ${env:REACT_APP_AUTH_MODE}
reactAppOtelCollectorUrl: ${env:REACT_APP_OTEL_COLLECTOR_URL, ssm:/configuration/react_app_otel_collector_url}
apiAppOtelCollectorUrl: ${env:API_APP_OTEL_COLLECTOR_URL, ssm:/configuration/api_app_otel_collector_url}
dbURL: ${env:DATABASE_URL}
ldSDKKey: ${env:LD_SDK_KEY, ssm:/configuration/ld_sdk_key_feds}
allowedIpAddresses: ${env:ALLOWED_IP_ADDRESSES, ssm:/configuration/allowed_ip_addresses}
Expand Down Expand Up @@ -150,7 +150,7 @@ provider:
stage: ${sls:stage}
DATABASE_URL: ${self:custom.dbURL}
REACT_APP_AUTH_MODE: ${self:custom.reactAppAuthMode}
REACT_APP_OTEL_COLLECTOR_URL: ${self:custom.reactAppOtelCollectorUrl}
API_APP_OTEL_COLLECTOR_URL: ${self:custom.apiAppOtelCollectorUrl}
SECRETS_MANAGER_SECRET: ${self:custom.secretsManagerSecret}
EMAILER_MODE: ${self:custom.emailerMode}
PARAMETER_STORE_MODE: ${self:custom.parameterStoreMode}
Expand Down Expand Up @@ -189,7 +189,7 @@ functions:
path: otel
method: post
cors: true

graphql:
handler: src/handlers/apollo_gql.graphqlHandler
events:
Expand All @@ -208,15 +208,15 @@ functions:
method: post
cors: true
authorizer:
name: third_party_api_authorizer
identitySource: method.request.header.Authorization
name: third_party_api_authorizer
identitySource: method.request.header.Authorization
- http:
path: v1/graphql/external
method: get
cors: true
authorizer:
name: third_party_api_authorizer
identitySource: method.request.header.Authorization
name: third_party_api_authorizer
identitySource: method.request.header.Authorization
timeout: 60 # aurora cold start can be long
vpc:
securityGroupIds:
Expand Down
4 changes: 2 additions & 2 deletions services/app-api/src/authn/thirdPartyAuthn.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ export async function userFromThirdPartyAuthorizer(
userId: string
) {
// setup otel tracing
const otelCollectorURL = process.env.REACT_APP_OTEL_COLLECTOR_URL
const otelCollectorURL = process.env.API_APP_OTEL_COLLECTOR_URL
if (!otelCollectorURL || otelCollectorURL === '') {
const errMsg =
'Configuration Error: REACT_APP_OTEL_COLLECTOR_URL must be set'
'Configuration Error: API_APP_OTEL_COLLECTOR_URL must be set'
throw errMsg
}

Expand Down
65 changes: 15 additions & 50 deletions services/app-api/src/handlers/apollo_gql.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import type { Span, Tracer } from '@opentelemetry/api'
import { propagation, ROOT_CONTEXT, SpanKind } from '@opentelemetry/api'
import type { Context as OTELContext, Span, Tracer } from '@opentelemetry/api'
import { propagation, ROOT_CONTEXT } from '@opentelemetry/api'
import { ApolloServer } from 'apollo-server-lambda'
import type {
APIGatewayProxyEvent,
Expand All @@ -20,7 +20,6 @@ import { NewPostgresStore } from '../postgres/postgresStore'
import { configureResolvers } from '../resolvers'
import { configurePostgres } from './configuration'
import { createTracer } from '../otel/otel_handler'
import { SemanticAttributes } from '@opentelemetry/semantic-conventions'
import {
newAWSEmailParameterStore,
newLocalEmailParameterStore,
Expand All @@ -29,21 +28,20 @@ import type { LDService } from '../launchDarkly/launchDarkly'
import { ldService, offlineLDService } from '../launchDarkly/launchDarkly'
import type { LDClient } from '@launchdarkly/node-server-sdk'
import * as ld from '@launchdarkly/node-server-sdk'
import { newJWTLib } from '../jwt'
import {
ApolloServerPluginLandingPageLocalDefault,
ApolloServerPluginLandingPageDisabled,
ApolloServerPluginLandingPageLocalDefault,
} from 'apollo-server-core'
import { newJWTLib } from '../jwt'

const requestSpanKey = 'REQUEST_SPAN'
let tracer: Tracer

let ldClient: LDClient

// The Context type passed to all of our GraphQL resolvers
export interface Context {
user: UserType
span?: Span
tracer?: Tracer
ctx?: OTELContext
}

// This function pulls auth info out of the cognitoAuthenticationProvider in the lambda event
Expand All @@ -57,9 +55,10 @@ function contextForRequestForFetcher(userFetcher: userFromAuthProvider): ({
}) => Promise<Context> {
return async ({ event, context }) => {
// pull the current span out of the LAMBDA context, to place it in the APOLLO context
// eslint-disable-next-line @typescript-eslint/no-explicit-any
const anyContext = context as any
const requestSpan = anyContext[requestSpanKey]
const stageName = process.env.stage
const tracer = createTracer('app-api-' + stageName)
const ctx = propagation.extract(ROOT_CONTEXT, event.headers)

const authProvider =
event.requestContext.identity.cognitoAuthenticationProvider
// This handler is shared with the third_party_API_authorizer
Expand Down Expand Up @@ -104,7 +103,8 @@ function contextForRequestForFetcher(userFetcher: userFromAuthProvider): ({
if (!userResult.isErr()) {
return {
user: userResult.value,
span: requestSpan,
tracer: tracer,
ctx: ctx,
}
} else {
throw new Error(
Expand Down Expand Up @@ -179,37 +179,6 @@ function ipRestrictionMiddleware(
}
}

// Tracing Middleware
function tracingMiddleware(wrapped: Handler): Handler {
return async function (event, context, completion) {
// get the parent context from headers
const ctx = propagation.extract(ROOT_CONTEXT, event.headers)
const span = tracer.startSpan(
'handleRequest',
{
kind: SpanKind.SERVER,
attributes: {
[SemanticAttributes.AWS_LAMBDA_INVOKED_ARN]:
context.invokedFunctionArn,
},
},
ctx
)

// Put the span into the LAMBDA context, in order to pass it into the APOLLO context in contextForRequestForFetcher
// We have to use any here because this context's type is not under our control.
// eslint-disable-next-line @typescript-eslint/no-explicit-any
const anyContext = context as any
anyContext[requestSpanKey] = span

const result = await wrapped(event, context, completion)

span.end()

return result
}
}

// This asynchronous function is started on the cold-load of this script
// and is awaited by our handler function
// Pattern is explained here https://serverlessfirst.com/function-initialisation/
Expand All @@ -225,7 +194,7 @@ async function initializeGQLHandler(): Promise<Handler> {
const stageName = process.env.stage
const applicationEndpoint = process.env.APPLICATION_ENDPOINT
const emailerMode = process.env.EMAILER_MODE
const otelCollectorUrl = process.env.REACT_APP_OTEL_COLLECTOR_URL
const otelCollectorUrl = process.env.API_APP_OTEL_COLLECTOR_URL
const parameterStoreMode = process.env.PARAMETER_STORE_MODE
const ldSDKKey = process.env.LD_SDK_KEY
const allowedIpAddresses = process.env.ALLOWED_IP_ADDRESSES
Expand Down Expand Up @@ -253,7 +222,7 @@ async function initializeGQLHandler(): Promise<Handler> {

if (otelCollectorUrl === undefined || otelCollectorUrl === '') {
throw new Error(
'Configuration Error: REACT_APP_OTEL_COLLECTOR_URL is required to run app-api'
'Configuration Error: API_APP_OTEL_COLLECTOR_URL is required to run app-api'
)
}

Expand Down Expand Up @@ -452,11 +421,7 @@ async function initializeGQLHandler(): Promise<Handler> {
},
})

// init tracer and set the middleware. tracer needs to be global.
tracer = createTracer('app-api-' + stageName)
const combinedHandler = ipRestrictionMiddleware(allowedIpAddresses)(
tracingMiddleware(handler)
)
const combinedHandler = ipRestrictionMiddleware(allowedIpAddresses)(handler)

// Locally, we wrap our handler in a middleware that returns 403 for unauthenticated requests
const isLocal = authMode === 'LOCAL'
Expand Down
4 changes: 2 additions & 2 deletions services/app-api/src/handlers/postgres_migrate.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@ import { migrate, newDBMigrator } from '../dataMigrations/dataMigrator'

export const main: Handler = async (): Promise<APIGatewayProxyResultV2> => {
// setup otel tracing
const otelCollectorURL = process.env.REACT_APP_OTEL_COLLECTOR_URL
const otelCollectorURL = process.env.API_APP_OTEL_COLLECTOR_URL
if (!otelCollectorURL || otelCollectorURL === '') {
const errMsg =
'Configuration Error: REACT_APP_OTEL_COLLECTOR_URL must be set'
'Configuration Error: API_APP_OTEL_COLLECTOR_URL must be set'
return fmtMigrateError(errMsg)
}
const serviceName = 'postgres-migrate'
Expand Down
8 changes: 4 additions & 4 deletions services/app-api/src/otel/otel_handler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions'
import { SimpleSpanProcessor } from '@opentelemetry/sdk-trace-base'
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node'
import { registerInstrumentations } from '@opentelemetry/instrumentation'
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http'
import { AWSXRayPropagator } from '@opentelemetry/propagator-aws-xray'
import { AWSXRayIdGenerator } from '@opentelemetry/id-generator-aws-xray'
import { AwsLambdaInstrumentation } from '@opentelemetry/instrumentation-aws-lambda'

export function createTracer(serviceName: string): Tracer {
const provider = new NodeTracerProvider({
Expand All @@ -20,7 +20,7 @@ export function createTracer(serviceName: string): Tracer {

// log to console and send to New Relic
const exporter = new OTLPTraceExporter({
url: process.env.REACT_APP_OTEL_COLLECTOR_URL,
url: process.env.API_APP_OTEL_COLLECTOR_URL,
headers: {},
})

Expand All @@ -32,8 +32,8 @@ export function createTracer(serviceName: string): Tracer {
})

registerInstrumentations({
instrumentations: [new HttpInstrumentation()],
instrumentations: [new AwsLambdaInstrumentation()],
})

return trace.getTracer('app-api')
return trace.getTracer(serviceName)
}
Loading
Loading