Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #785

Open
applike-ss opened this issue Nov 8, 2024 · 13 comments
Open

Memory leak #785

applike-ss opened this issue Nov 8, 2024 · 13 comments

Comments

@applike-ss
Copy link

I believe to have found a memory leak in the image-automation-controller.
Here's a screenshot of my grafana showing memory usage of the image-automation-controller over the last 7 days:
image
It only seems to affect the current leader (obvious, as it is the one doing the real work).
The restarts that can be seen by memory usage dropping are no crashes afaik (no logs about crashes), but seem to relate to instance scaling.

Image used: ghcr.io/fluxcd/image-automation-controller:v0.39.0
Args: --events-addr=http://notification-controller.flux-system.svc.cluster.local./ --watch-all-namespaces=true --log-level=info --log-encoding=json --enable-leader-election

There were ~80 image updates happening in these days, however they didn't seem related as the memory also increased on days without image updates.
image

@kingdonb
Copy link
Member

kingdonb commented Nov 21, 2024

I am running image-automation-controller with a single image update automation, and enabled metrics-server to check on the memory usage. My baseline (startup) memory usage is about 10mb and my usage after about 3 days was only 12mb.

I will continue to monitor and do some more experiments on my end, if you can run flux stats and include the output here, that will give us a clear picture of how many of each type of image resource you have on your cluster, that is a starting point.

Can you say if the 80 image updates were all one ImageUpdateAutomation, if you have more than one, etc. - the flux stats output will give us some of this information, if you can characterize your environment with some more detail (eg. single IUA, multiple tenants, etc.) that will also help.

There is then some guidance about how to extract a profile which may help us debug this issue, can you please try to follow these instructions from the debugging guide and let us know what you find? Specifically the section "Collecting a profile" which should be easy to obtain from the metrics port:

https://fluxcd.io/flux/gitops-toolkit/debugging/

Other information which may be useful, what are the intervals on Image resources set to - the ImageUpdateAutomation in particular. What does the spec look like? (Is there a different checkout branch and commit branch?) Is the git repository that the ImageUpdateAutomation is pointed at big or small? Do you see anything abnormal in the logs?

Anything else you can tell which differentiates your environment from the Image Update guide may prove meaningful.

@applike-ss
Copy link
Author

applike-ss commented Nov 22, 2024

Here's what I got from flux stats -A:

RECONCILERS          	RUNNING	FAILING	SUSPENDED	STORAGE
GitRepository        	18     	0      	0        	1.0 MiB
OCIRepository        	100    	1      	0        	287.8 KiB
HelmRepository       	22     	0      	0        	19.0 MiB
HelmChart            	137    	0      	0        	3.1 MiB
Bucket               	0      	0      	0        	-
Kustomization        	148    	1      	0        	-
HelmRelease          	136    	0      	0        	-
Alert                	0      	0      	0        	-
Provider             	0      	0      	0        	-
Receiver             	9      	0      	0        	-
ImageUpdateAutomation	19     	0      	0        	-
ImagePolicy          	51     	0      	0        	-
ImageRepository      	52     	0      	0        	-

Can you say if the 80 image updates were all one ImageUpdateAutomation

No it was a total numberacross all image update automations.

if you can characterize your environment with some more detail (eg. single IUA, multiple tenants, etc.) that will also help.

We are having a base repository for all of our clusters with common applications that are always needed, e.g. ingress controllers, logging, etc.
Then we have a Kustomization that adds multiple tenants for different workloads.
Each tenant has have their own IUA and multiple image policies.
Source and target branch are always the same.

There is then some guidance about how to extract a profile which may help us debug this issue, can you please try to follow these instructions from the debugging guide and let us know what you find? Specifically the section "Collecting a profile" which should be easy to obtain from the metrics port...

Here's the heap file, at this moment it had around 37MB of memory usage after running 42 hours:
heap.zip

what are the intervals on Image resources set to - the ImageUpdateAutomation in particular

Usual intervals are 1m for both ImageRepository and ImageUpdateAutomation.

Is the git repository that the ImageUpdateAutomation is pointed at big or small?

Since there is quite some of them, it is likely that there is one that's not small. We do however use the spec.ignore property to only whitelist a folder specific to flux resources, like so:

# exclude all
/*
# include compacter
!/flux

Do you see anything abnormal in the logs?

Nothing so far

Anything else you can tell which differentiates your environment from the Image Update guide may prove meaningful.

In fact, I don't see anything that would be different to the guide (https://fluxcd.io/flux/guides/image-update)

@stefanprodan
Copy link
Member

stefanprodan commented Nov 22, 2024

A 1m interval for ImageUpdateAutomation means the controller needs to run a Git remote-ls every minute, then Git clone, checkout, etc. The spec.ignore does not help with IUA because Git checkout would bring the whole repo in tmp storage. My guess is that Go GC doesn't have time to cleanup the memory as the controller is constantly running Git operations which probably are exhausting the CPU threads. The default memory limit is 1Gi did you lowered that? Also did you changed the CPU limit? In any case, I suggest setting the IUA interval to hours not minutes. When there is an image update, Flux issues an event and IAU triggers instantly, there is no need to DDOS the Git server.

@applike-ss
Copy link
Author

Our resources set for the image automation update controller are like this:

      resources:
        limits:
          cpu: '1'
          memory: 1Gi
        requests:
          cpu: 100m
          memory: 64Mi

Thanks for your input @stefanprodan, i will try raising the time and see if that changes the memory usage metrics.

@stefanprodan
Copy link
Member

stefanprodan commented Nov 22, 2024

Have you've seen the controller reach 1Gi, are you sure the restart is due to OOM? From what you've posted here I see no evidence of OOM if the controller gets to 100Mi that's normal as GC sees there is lots of free memory. An OOM is logged by kubelet, did that actually happen?

@applike-ss
Copy link
Author

I haven't seen it reaching 1Gi and i never mentioned (nor think) that it is OOM related restarts.
I'm pretty sure the restarts happen during instance scaling, which is fine.

@stefanprodan
Copy link
Member

A Memory leak would always result in OOM, hence the issue title is confusing to me.

@applike-ss
Copy link
Author

applike-ss commented Nov 22, 2024

It can not result in OOM when the memory is rising slow enough for our instances to be replaced by other ones before the OOM could occur

@stefanprodan
Copy link
Member

The extra memory you see can be reclaimed by the OS if it needs it, if you look at the memory dump you'll see this, there is no evidence of a memory leak as far as I can tell.

@applike-ss
Copy link
Author

Adjusting the interval from 1 minute to 5 minutes as helped a lot already and decreased the memory rise by ~60% over the same period of time.
However at peaks are still getting bigger over time (comparing against the time when it takes over the leader election lease).
While this is not critical for us, to me this still indicates something might be leaking memory over time. I do see ~16Mb in 4 days.

@kingdonb
Copy link
Member

When there is an image update, Flux issues an event and IAU triggers instantly, there is no need to DDOS the Git server.

It's great to hear that lengthening the interval for ImageUpdateAutomation from 1m to 5m had an impact, but you can set it even longer and the impact on end-user experience should be minimal. Let me explain why...

Where you should keep a short interval is in the ImageRepository resource. It acts like a source, so when ImageRepository reconciles finding a new set of tags, it gets a new resource version in the metadata or status, then it automatically reconciles ImagePolicy resources that are downstream. Those Image Policy resources, upon finding a newer tag fulfills the policy now, will also automatically trigger any ImageUpdateAutomation in the same namespace that are downstream of it. So nobody is waiting 1h for that interval.

We appreciate the report and I agree there might be something we can do to handle this better, a bug in the controller around the edge cases when intervals are arriving faster than the GC can process the old data out of the heap. I'm not familiar enough with the codebase to dive in and try to narrow it down. But please try setting the ImageUpdateAutomation to an even longer interval, either 10m or 1h, and either set a short interval for ImageRepository or set up a Receiver.

You should get a better end-user experience that way, no long delays between commits, without triggering any leak issue.

@kingdonb
Copy link
Member

To expand a little bit, I didn't want to bring up Receiver right away, but the broader characterization which I often make is that typically "apply" operations in Flux are more expensive than "check for changes" source-type operations.

We can trigger appliers indirectly by making sure the source kind updates often by polling at a short interval. When there is no change in a source, the source controller has caching opportunities so it can make that fetch inexpensive and basically a no-op. Similarly the Image Reflector Controller doesn't cost very much to fetch a list of tags and compare it to the previously observed list of tags. It's way less intensive polling a source frequently than doing a dry-run on the cluster or a full git clone operation as Image Update must do. A full clone is what's required in order to be able to push commits.

So IUA's clone is a lot more expensive than the Source Controller's clone, which only needs to fetch the head commit.

The Flux resources which apply from a source (or in IUA's case, generate a commit from a list of tags) are all configured automatically to create an internal watch on their upstreams, so when your GitRepository updates finding a new revision, it automatically notifies downstream Kustomizations so they can trigger reconcile immediately instead of waiting an interval. There is a similar relationship between ImageRepository/ImagePolicy/ImageUpdateAutomation.

So when people are tuning intervals, and they haven't set up Receivers at all, but they want changes to go to the cluster fast... I say don't set Kustomization to a short interval because it will DDOS your cluster's control-plane with unnecessary dry-runs. Same for IUA and full git clones; when nothing on the source has changed and/or nothing in the target repository needs to change, the only purpose of the dry-run is for drift correction. If you don't worry about people overwriting the tag in the git repository with an older one, then there's very little reason to reconcile IUA so frequently.

You can also make this behavior of sources triggering downstream resources instant without setting any short intervals at all by setting up a Receiver - in this case configure GitHub for a package webhook and a receiver that expects the package event, with an ImageRepository target, so no resource has to use a shorter interval. This is the best solution to scale to many Image sources, when polling them all on a short interval runs the risk of exhausting cluster resources.

I recently revisited the receiver guide from end-to-end to ensure that it works with ImageRepository, GitRepository, OCIRepository, and also cert-manager: https://fluxcd.io/flux/guides/webhook-receivers/

But there should be basically no case even when receivers aren't configured where a short interval like 1m or 5m for ImageUpdateAutomation is needed. It won't detect the changes from upstream any faster, those changes come from ImageRepository and that's the one that must reconcile first in order to allow the change to flow downstream through the policy (and then in turn to the image update resource.)

So that's where you should set your short interval, if that's the issue that you're trying to solve.

I think it is some kind of thread exhaustion issue, like Stefan suggested, I think the resources aren't getting cleaned up due to timeout or something, and that's what is causing the slow memory growth over a long period of time. If you can prevent the exhaustion/timeout from happening by setting a longer interval, then you shouldn't see any memory growth at all.

My IUA controller running for several days with just one resource still uses only 12mb. I will set up more repositories, larger repositories, and shorter intervals to try to stress-test it, but you shouldn't need that configuration unless you have special circumstances. I have set up one Receiver with the package event, so even with default intervals, new images published are committed by IUA very fast - and then they are deployed just as quickly with a second Receiver connecting the GitRepository to the push event, so a commit from ImageUpdateAutomation can be detected and reconciled instantly.

@applike-ss
Copy link
Author

Thank you for the detailed explanation. I will adjust accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants