Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Descriptor Leak in PingSource #8188

Open
liuyuxuan0723 opened this issue Sep 9, 2024 · 5 comments
Open

File Descriptor Leak in PingSource #8188

liuyuxuan0723 opened this issue Sep 9, 2024 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@liuyuxuan0723
Copy link

Problem Description:

I followed the documentation https://knative.dev/docs/eventing/sources/ping-source/ to deploy and create several PingSources to trigger my functions. After running for a few days, all triggers stopped working. The logs from pingsource-mt-adapter are as follows:

2024-09-09 10:03:03	
{"level":"error","ts":"2024-09-09T02:03:03.737Z","logger":"pingsource-mt-adapter","caller":"mtping/runner.go:157","msg":"failed to send cloudevent result: {result 26 0  Post \"http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/dev-gen-task\": dial tcp 172.16.53.129:80: connect: cannot assign requested address (5x)} {source 15 0 /apis/v1/namespaces/faasprod-dev/pingsources/fnt-21321-8be55ff981d6e3e8 <nil>} {target 15 0 http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/dev-gen-task <nil>} {id 15 0 1118f752-e468-4fa4-8971-11b634b92413 <nil>}","commit":"cf0f9d5","stacktrace":"knative.dev/eventing/pkg/adapter/mtping.(*cronJobsRunner).cronTick.func1\n\tknative.dev/eventing/pkg/adapter/mtping/runner.go:157\ngithub.com/robfig/cron/v3.FuncJob.Run\n\tgithub.com/robfig/cron/v3@v3.0.1/cron.go:136\ngithub.com/robfig/cron/v3.(*Cron).startJob.func1\n\tgithub.com/robfig/cron/v3@v3.0.1/cron.go:312"}

2024-09-09 10:03:03	
{"level":"error","ts":"2024-09-09T02:03:03.585Z","logger":"pingsource-mt-adapter","caller":"mtping/runner.go:157","msg":"failed to send cloudevent result: {result 26 0  Post \"http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/push-shadow-vin-down\": dial tcp 172.16.53.129:80: connect: cannot assign requested address (5x)} {source 15 0 /apis/v1/namespaces/faasprod-dev/pingsources/fnt-21248-8be55ff981d6e3e8 <nil>} {target 15 0 http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/push-shadow-vin-down <nil>} {id 15 0 5e1acd95-26a3-44f1-a25f-2630609bc3a4 <nil>}","commit":"cf0f9d5","stacktrace":"knative.dev/eventing/pkg/adapter/mtping.(*cronJobsRunner).cronTick.func1\n\tknative.dev/eventing/pkg/adapter/mtping/runner.go:157\ngithub.com/robfig/cron/v3.FuncJob.Run\n\tgithub.com/robfig/cron/v3@v3.0.1/cron.go:136\ngithub.com/robfig/cron/v3.(*Cron).startJob.func1\n\tgithub.com/robfig/cron/v3@v3.0.1/cron.go:312"}
2024-09-09 10:03:03	
{"level":"error","ts":"2024-09-09T02:03:03.585Z","logger":"pingsource-mt-adapter","caller":"mtping/runner.go:157","msg":"failed to send cloudevent result: {result 26 0  Post \"http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/slipping-point/\": dial tcp 172.16.53.129:80: connect: cannot assign requested address (5x)} {source 15 0 /apis/v1/namespaces/faasprod-dev/pingsources/fnt-21228-2195be836486b1a6 <nil>} {target 15 0 http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/slipping-point/ <nil>} {id 15 0 158a4a33-56f4-4929-ad32-144035504ee8 <nil>}","commit":"cf0f9d5","stacktrace":"knative.dev/eventing/pkg/adapter/mtping.(*cronJobsRunner).cronTick.func1\n\tknative.dev/eventing/pkg/adapter/mtping/runner.go:157\ngithub.com/robfig/cron/v3.FuncJob.Run\n\tgithub.com/robfig/cron/v3@v3.0.1/cron.go:136\ngithub.com/robfig/cron/v3.(*Cron).startJob.func1\n\tgithub.com/robfig/cron/v3@v3.0.1/cron.go:312"}

Monitoring Information:

  • File Descriptor Monitoring:
    image-20240909112639239
  • CPU Monitoring:
    image-20240909112739582
  • Memory Monitoring:
    image-20240909112717317

Environment:

  • Knative release version: 1.11.2
  • k8s version: 1.24.0

Expected Behavior:

The pingsource-mt-adapter should run normally, and there should be no leaks.

@liuyuxuan0723 liuyuxuan0723 added the kind/bug Categorizes issue or PR as related to a bug. label Sep 9, 2024
@maschmid
Copy link
Contributor

I cannot reproduce the problem with "normally" behaving functions.

Trying it with the current Fission 1.20.4 , and eventing 1.11 installed from the Serverless Operator 1.32.2 on OpenShift

Creating a hello-js from the fission examples

fission env create --name nodejs --image ghcr.io/fission/node-env
curl -LO https://raw.githubusercontent.com/fission/examples/main/nodejs/hello.js
fission function create --name hello-js --env nodejs --code hello.js
fission route create --function hello-js --url /hello-js --name hello-js --method GET --method POST

and having several PingSources like

apiVersion: sources.knative.dev/v1
kind: PingSource
metadata:
  name: test-ping-source-1
  namespace: foobar
spec:
  contentType: application/json
  data: '{"message": "Hello world from 1!"}'
  schedule: '*/1 * * * *'
  sink:
    uri: http://router.fission.svc.cluster.local/hello-js
  timezone: Asia/Shanghai

shows only a stable number of open TCP connections.

I can reproduce the problem, kind of, if I add very long timeouts to the function

module.exports = async function(context) {
    await new Promise(r => setTimeout(r, 518400000));
    return {
        status: 200,
        body: "hello, world!\n"
    };
}

and adding a fntimeout to the fission function to override the default 60s

fission function create --name hello-js --env nodejs --code hello.js --fntimeout 604800

making the functions to never return does show that the pingsource does not have any timeouts on its own and the number of open TCP connections would be steadily rising if the sink would keep the connections open indefinitely and never return an HTTP response.

Can you see the number of open files rising on the fission side as well? (I'd expect the similar number of open files on the fission router)

@liuyuxuan0723
Copy link
Author

liuyuxuan0723 commented Sep 18, 2024

@maschmid
I have checked the file descriptors of the Fission router and indeed encountered the same issue. The regular trigger interval being shorter than the function execution time does lead to a continuous increase in the number of open file descriptors.

Fission router monitoring:
image

Currently, I have deployed about 4 or 5 functions, all with a timeout set to 60 seconds. The trigger interval is set to once per minute. I haven't observed any long-running executions or cases where an HTTP response is not returned. However, quite a few functions returned a 504 due to execution timeouts. Could this situation be causing the file descriptors to continuously increase?

Here is the pingsource YAML:

apiVersion: sources.knative.dev/v1
kind: PingSource
metadata:
  annotations:
    sources.knative.dev/creator: system:serviceaccount:fission-system:lifaas
    sources.knative.dev/lastModifier: system:serviceaccount:fission-system:lifaas
  creationTimestamp: "2024-09-05T10:02:18Z"
  generation: 1
  labels:
    functionName: dev-gen-task
  name: fnt-21321-8be55ff981d6e3e8
  namespace: faasprod-dev
  resourceVersion: "6207907587"
  uid: 8b60514e-6366-4ab5-89b5-8ca186c097c1
spec:
  contentType: application/json
  schedule: 0/1 * * * *
  sink:
    uri: http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/dev-gen-task
status:
  ceAttributes:
  - source: /apis/v1/namespaces/faasprod-dev/pingsources/fnt-21321-8be55ff981d6e3e8
    type: dev.knative.sources.ping
  conditions:
  - lastTransitionTime: "2024-09-10T09:48:38Z"
    status: "True"
    type: Deployed
  - lastTransitionTime: "2024-09-10T09:48:38Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-09-05T10:02:18Z"
    status: "True"
    type: SinkProvided
  observedGeneration: 1
  sinkUri: http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/dev-gen-task

@liuyuxuan0723
Copy link
Author

liuyuxuan0723 commented Sep 18, 2024

@maschmid
Can I control the timeout or retry policy for the pingsource? I suspect that the retries might be causing the leak.

@maschmid
Copy link
Contributor

I am not aware of any mechanism there, AFAICT there is no timeout on the context that sends the events
@pierDipi any thoughts? We should probably guard the dispatchers against a sink that keep connections indefinitely and never replies.

@pierDipi
Copy link
Member

pierDipi commented Dec 5, 2024

there is an environment variable with the timeout here:

- name: K_SINK_TIMEOUT
value: '-1'
maybe changing that to set a timeout can help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants