Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CallbackStack.Node leak #1318

Merged
merged 1 commit into from
Apr 17, 2024
Merged

Conversation

abestel
Copy link
Contributor

@abestel abestel commented Apr 12, 2024

Some context to start: Cats Effects has been having memory leaks in CallbackStack since version 3.4.3.
See for example: typelevel/cats-effect#3935

I've been facing this memory leak in an application using fs2-kafka, and found that I'm not the only one (typelevel/cats-effect#3973).

Using a simple consumer > produce stream application, I monitored the size of the CallbackStack using the following command:

while sleep 1; do jcmd <pid> GC.class_histogram | grep 'cats.effect.CallbackStack$Node' ; done

I found that swapping the F.race(shutdown, fetch) for fetch stops the memory leak. This should not be an issue because the Stream is anyway interrupted on .interruptWhen(F.race(shutdown, stopReqs.get).void.attempt), but I'm not 100% convinced of this.

Some context to start: Cats Effects has been having memory leaks in CallbackStack since version 3.4.3.
See for example: typelevel/cats-effect#3935

I've been facing this memory leak in an application using fs2-kafka, and found that I'm not the only one (typelevel/cats-effect#3973).

Using a simple consumer > produce stream application, I monitored the size of the CallbackStack using the following command:

```
while sleep 1; do jcmd <pid> GC.class_histogram | grep 'cats.effect.CallbackStack$Node' ; done
```

I found that swapping the `F.race(shutdown, fetch)` for `fetch` stops the memory leak. This should not be an issue because the Stream is anyway interrupted on `.interruptWhen(F.race(shutdown, stopReqs.get).void.attempt)`, but I'm not 100% convinced of this.
@aartigao
Copy link
Contributor

That's interesting... I'm busy this weekend but I'd try to dig deep during the next week.

Thanks for taking the time to investigate and find a solution!

@aartigao
Copy link
Contributor

@abestel I wonder if this issue should be reported directly to CE. From your original message I understand that downgrading to 3.5.2 fixes the issue, am I wrong? If that's the case, the bug is on CE land. OTOH, if downgrading it's not working, then it's definitively us.

But the thing is that I don't see anything inherently harmful or a misuse of CE API in F.race. The code is difficult to follow (I'm working on refactoring this) but other than that, I don't see a reason for that to leak. But there were some fundamental changes, according to the release notes, regarding CallbackStack in 3.5.3 👉🏽 #3943

If you can try that with your sample app, that would've awesome. If not, at least can you share the code to reproduce? 🙏🏽

Pinging @armanbilge just in case this rings a bell

@armanbilge
Copy link

From your original message I understand that downgrading to 3.5.2 fixes the issue, am I wrong?

How is that? I understood that the memory leak has been occurring since 3.4.3:

Some context to start: Cats Effects has been having memory leaks in CallbackStack since version 3.4.3.

That makes sense, because v3.4.3 is the release where we changed the implementation of Deferred.

https://github.com/typelevel/cats-effect/releases/tag/v3.4.3

If the leak persists even with the latest Cats Effect v3.5.4, then I'm inclined to say that this is a Cats Effect bug that needs to be fixed.

@abestel
Copy link
Contributor Author

abestel commented Apr 14, 2024

From your original message I understand that downgrading to 3.5.2 fixes the issue, am I wrong?

As Arman said above, this has been happening since 3.4.3 (or at least before 3.5.x). I've been experience weird behaviours in my fs2-kafka applications at work for quite a while, and the main issue is that even running in Kubernetes does not fully "fix" the problem by rebooting the app because sometimes the GC just gets into a weird state where it consumes all the CPU without throwing a OOM, and the only way out is to manually kill the pod.

If the leak persists even with the latest Cats Effect v3.5.4, then I'm inclined to say that this is a Cats Effect bug that needs to be fixed.

It does.

My thinking while opening this PR is that the change should not have impacts functionally while fixing a weird behavior that may very well be a bug in Cats Effects. If we can mitigate the memory leak in fs2-kafka while troubleshooting Cats Effect, that would be a first win, but I understand if you'd rather skip this first step and troubleshoot the root cause directly.

If you can try that with your sample app, that would've awesome. If not, at least can you share the code to reproduce? 🙏🏽

I was actually trying to minimized the problem taking fs2-kafka out of the loop, but this is proving to be more complicated than I expected. I'll try to extract the reproducer I was using in a minimal repo and share it with you (most likely tomorrow).

Thank you both for the prompt replies!

@abestel
Copy link
Contributor Author

abestel commented Apr 15, 2024

Here is a simple repository to reproduce the memory leak: https://github.com/abestel/fs2-kafka-memory-leak

@abestel
Copy link
Contributor Author

abestel commented Apr 15, 2024

Digging deeper into this, I managed to get a minimal reproduction that is leaking pretty fast (easily millions of nodes in the CallbackStack in a few seconds):

object MemoryLeak extends IOApp.Simple {
  override def run: IO[Unit] =
    Resource.make(IO.never.start)(_.cancel.start.void).use { fiber =>
      fs2.Stream
        .repeatEval(
          IO.race(fiber.joinWithUnit, IO.unit)
        )
        .compile
        .drain
    }
}

This is inspired by:

  1. Background kafka process created in https://github.com/fd4s/fs2-kafka/blob/series/3.x/modules/core/src/main/scala/fs2/kafka/KafkaConsumer.scala#L111
  2. The fiber of this process is then used in the constructor of the KafkaConsumer in https://github.com/fd4s/fs2-kafka/blob/series/3.x/modules/core/src/main/scala/fs2/kafka/KafkaConsumer.scala#L706
  3. The fiber is used in the definition of awaitTermination in https://github.com/fd4s/fs2-kafka/blob/series/3.x/modules/core/src/main/scala/fs2/kafka/KafkaConsumer.scala#L660
  4. And awaitTermination is used in the shutdown "aggregate" in https://github.com/fd4s/fs2-kafka/blob/series/3.x/modules/core/src/main/scala/fs2/kafka/KafkaConsumer.scala#L159

At this point I have no idea if this a misuse of CE API or if it's a bug in CE however.

@aartigao
Copy link
Contributor

Awesome findings @abestel! 🙌🏽

Here's a sligthly simple version that also leaks 👇🏽

import cats.effect._

object Main extends IOApp.Simple {

  override def run: IO[Unit] = {

    def loop(fiber: FiberIO[Unit]): IO[Unit] = IO.race(fiber.joinWithUnit, IO.unit) >> loop(fiber)

    IO.never[Unit].start.flatMap(loop)
  }

}

With 3.5.2:

a.artigao@bcn-alan-artigao:~$ while sleep 1; do jcmd 191436 GC.class_histogram | grep 'cats.effect.CallbackStack' ; done
   1:      11363549      272725176  cats.effect.CallbackStack
 806:             1             16  cats.effect.CallbackStack$
   1:      11690946      280582704  cats.effect.CallbackStack
 806:             1             16  cats.effect.CallbackStack$
   1:      12029339      288704136  cats.effect.CallbackStack
 807:             1             16  cats.effect.CallbackStack$
   1:      12372064      296929536  cats.effect.CallbackStack
 807:             1             16  cats.effect.CallbackStack$
   1:      12656206      303748944  cats.effect.CallbackStack
 805:             1             16  cats.effect.CallbackStack$

With >= 3.5.3:

a.artigao@bcn-alan-artigao:~$ while sleep 1; do jcmd 192856 GC.class_histogram | grep 'cats.effect.CallbackStack$Node' ; done
   1:       2943526       70644624  cats.effect.CallbackStack$Node
   1:       3203883       76893192  cats.effect.CallbackStack$Node
   1:       3482526       83580624  cats.effect.CallbackStack$Node
   1:       3759762       90234288  cats.effect.CallbackStack$Node
   1:       4060133       97443192  cats.effect.CallbackStack$Node
   1:       4358185      104596440  cats.effect.CallbackStack$Node
   1:       4670690      112096560  cats.effect.CallbackStack$Node
   1:       4752425      114058200  cats.effect.CallbackStack$Node
   1:       5051627      121239048  cats.effect.CallbackStack$Node
   1:       5372415      128937960  cats.effect.CallbackStack$Node
   1:       5699771      136794504  cats.effect.CallbackStack$Node
   1:       6042548      145021152  cats.effect.CallbackStack$Node
   1:       6287672      150904128  cats.effect.CallbackStack$Node
   1:       6611084      158666016  cats.effect.CallbackStack$Node

@armanbilge Given that both leak, it's not related to the latest changes in CallbackStack. But... the question is: it leaks because this is an abuse of the CE API? WDYT?

@armanbilge
Copy link

armanbilge commented Apr 16, 2024

Yes, we consider this a misuse of the API. Fiber and Deferred deliberately make different trade-offs:

  • Deferred should not leak when there are repeated canceled gets, at a small performance cost to keep the data structure compact
  • Fiber optimizes for performance instead, so repeatedly canceling a joins will be a memory leak

So perhaps that code should be re-written to use Deferred instead of Fiber. Note that this is how Deferred and Fiber have always worked (modulo the recent bugs in Deferred).

However, this still wouldn't explain why the memory leak appeared in v3.4.3 which only changed Deferred. Unless the leak has been here all along?

@abestel
Copy link
Contributor Author

abestel commented Apr 16, 2024

However, this still wouldn't explain why the memory leak appeared in v3.4.3 which only changed Deferred. Unless the leak has been here all along?

That may be a mistake on my side, I started noticing the memory leak quite a while ago and it seemed to fit with my upgrading of dependencies (in particular a big jump in fs2-kafka hence a big jump in CE).

Now, even when trying the minimal reproductions with 3.4.2, the memory leak is still happening so it has most likely been there all along.

@abestel
Copy link
Contributor Author

abestel commented Apr 16, 2024

So perhaps that code should be re-written to use Deferred instead of Fiber.

Coming back to this, I'm not sure we need to do that change rather than what I'm suggesting in this PR. As far as I understand, the "culprit" is the repeated evaluation of F.race(shutdown, fetch), and I understand this as "cancel a potentially long fetch if the background process has been stopped".
The fetch itself is part of stream that is interrupted on shutdown:

              Stream
                .repeatEval {
                  stopReqs
                    .tryGet
                    .flatMap {
                      case None =>
                        fetchPartition

                      case Some(()) =>
                        // Prevent issuing additional requests after partition is
                        // revoked or shutdown happens, in case the stream isn't
                        // interrupted fast enough
                        F.unit
                    }
                }
                .interruptWhen(F.race(shutdown, stopReqs.get).void.attempt)
                .compile
                .drain

So the fetch is anyway canceled when shutdown completes. Doing something like this does cancel the IO.never after 10 seconds:

    Resource.make(IO.sleep(10.seconds).void.start)(_.cancel.start.void).use { fiber =>
      fs2.Stream
        .repeatEval(
          IO.never.onCancel(IO(println("canceled")))
        )
        .interruptWhen(fiber.joinWithUnit.attempt)
        .compile
        .drain
    }

@aartigao
Copy link
Contributor

I agree. This can be simplified as you suggest. But I'd like to look at it in more detail, just to be sure I don't miss anything.

If everything goes as expected, tomorrow I'll review and merge this PR. I won't make a release though, because I want to use the generated SNAPSHOT in our Staging env first.

@aartigao aartigao merged commit fb082a0 into fd4s:series/3.x Apr 17, 2024
8 checks passed
@aartigao
Copy link
Contributor

@samspills You might be interested in testing 3.4.0-8-fb082a0-SNAPSHOT to see if your leak is related to ☝🏽 (which likely is...)

@abestel abestel deleted the fix/memory_leak branch April 17, 2024 09:24
@abestel
Copy link
Contributor Author

abestel commented Apr 17, 2024

I'm also testing the snapshot today, and so far it looks like it's fixing my issue (even if the memory leak really starts acting out on longer time frames). Same load to handle, different versions of fs2-kafka.

Before

Memory (G1 Old Gen and committed heeap) has an upward trend, slow but steady. Garbage collector slowly gets less and less efficient, using more and more CPU.

Screenshot 2024-04-17 at 15 03 33

After

Pretty stable so far :-)
(I'm not sure what that random spike is but looks like Kubernetes doing something and treat it as an outlier).

Screenshot 2024-04-17 at 15 04 10

@samspills
Copy link

samspills commented Apr 17, 2024

Our leak was quite slow BUT tentatively our app is also looking good and no longer leaky :D :D

Thank you so much @abestel and @aartigao ❤️

@aartigao aartigao changed the title fix: memory leak in Cats Effects Fix CallbackStack.Node leak Apr 18, 2024
@erikvanoosten
Copy link

We want to try out the snapshot as well. Where can we find it? (I can't find it in Sonatype's snapshot repo.)

@abestel
Copy link
Contributor Author

abestel commented Apr 18, 2024

We want to try out the snapshot as well. Where can we find it? (I can't find it in Sonatype's snapshot repo.)

You can try the release directly: https://github.com/fd4s/fs2-kafka/releases/tag/v3.5.0

@Jasper-M
Copy link

so repeatedly canceling a joins will be a memory leak

This sounds like a trap that is a bit too easy to fall into. For instance background returns an IO that is in fact a join on the background fiber. It is not so extraordinary to want to race against that IO in case the background fiber stops unexpectedly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants