Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max loading factor steps reached: 100/100 #402

Open
bartman64 opened this issue Aug 26, 2022 · 19 comments
Open

Max loading factor steps reached: 100/100 #402

bartman64 opened this issue Aug 26, 2022 · 19 comments

Comments

@bartman64
Copy link

bartman64 commented Aug 26, 2022

Hi,

today we noticed an issue on our service. It stopped processing data and when we looked into the logs we saw a lot of messages stating:
"isPoolQueueLow(): Max loading steps reached 100/100"

After restarting the service it run again properly, but still I wanted to now if the logs were any indicator for the error, or if it was something else. When does this WARN message occur?

2022-08-26_08h24_57

@astubbs
Copy link
Contributor

astubbs commented Sep 2, 2022

Can you provide more logs / did you capture them? earlier entries? (you can email them to me)

@bartman64
Copy link
Author

Attached log, maybe it helps.

com-abc-rest-caller-testenv.log

@bartman64
Copy link
Author

Any idea when this WARN message could occur?

@astubbs
Copy link
Contributor

astubbs commented Sep 23, 2022

Hi, yes it's supposed to be when the processing is going faster than records can be queued up for processing - which is usually an indication of something going wrong. This issue is in the queue, will look more into it soon. However, it's a bit of a doozy. I don't suppose you noticed anything weird with the record processing when this happened? Was everything else appearing to be fine?

We need to add the metrics end point / info so when this happens we can get more direct info from the internal state of PC.

@astubbs
Copy link
Contributor

astubbs commented Sep 23, 2022

How often has this happened?
Are you able to reproduce it?
Was throughput going slow when this happened?

It's a relatively arbitrary max for the loading factor - but usually 3-5x is enough. 100 should be way too much. It prevents potential OOM errors with the queue sizes.

@bartman64
Copy link
Author

Hi, unfortunately we don't noticed anything strange before when this happens.
Last time it happened was yesterday and I tried to find some kind of indicator to why this is happening, but
from the logs everything seems to work fine, until out of nowhere those 100/100 log message appear.

How this works for us is that we get throughout the night a batch of data that our services process
and in the morning we check if everything worked as expected.
When we noticed that no data was processed that night, it was usually because of that error.
But this is something we can't really reproduces, since when restarting the service it works suddenly again.

"when the processing is going faster than records can be queued up for processing" This is an interesting information,
since how our service works is, it checks via a flag, if we even have to send the data to an endpoint or not. If that is not the case the processing is then probably way faster compared to calling the endpoint and waiting for a response.
If a batch of that type arrives the processing could be way faster then the queueing up of new message...

But why does the processing completely stop when it reaches max?

How often has this happened?
In average once a week, I think

Was throughput going slow when this happened?
Didn't have an eye on that yet

Thx for the response :)

@astubbs
Copy link
Contributor

astubbs commented Sep 23, 2022

So when this happens - no messages are getting processed? That's not what I expected 🤔

flag

ah - if this flag is set, your function returns immediately, without really doing much? Yes that could cause it. However, it shouldn't prevent any progress being made - it's just the system telling you that it's not going to increase the loading factor beyond 100. (which means that the queued messages is about 100 * maxConcurrency setting)...

But why does the processing completely stop when it reaches max?

yeah that shouldn't be the case.. what indication do you see that no progress is being made?

@bartman64
Copy link
Author

So we are using KafkaHQ to access our kafka topics and there we see that our service consumer group has a lag which doesn't decrease. And also we no longer see requests getting received from the target endpoint. Maybe the reason for the stop of processing is different and the WARN log message is just a result of that, because the queue doesn't get filled...

@astubbs
Copy link
Contributor

astubbs commented Oct 10, 2022

Which version are you running?

Are you not getting any failed processing logs?

Can you show your processing code?

@bartman64
Copy link
Author

We were running version 0.5.2.0, just upgraded to 0.5.2.3
Wasn't able to capture any failed processing logs before.
Maybe it could make sense to organize a call session somehow at some point, were we could discuss this and could go through the code?

@astubbs
Copy link
Contributor

astubbs commented Oct 20, 2022

yes, sure we can do that - please email me directly, or contact me on community slack.

@bartman64
Copy link
Author

Ok thanks, I will ping you, but currently I am busy with other topics ;)

@astubbs
Copy link
Contributor

astubbs commented Dec 9, 2022

FYI, I'm going to add a limiter so that the logs aren't spammed with this.
And - the whole system is going to be removed in upcoming versions, so won't be an issue then.

@colinkuo
Copy link

colinkuo commented Jan 20, 2023

Hi, @astubbs.
We are experiencing this issue now with version 0.5.2.3. When this happened, no messages were consumed from the assigned partition like what @bartman64 encountered previously. Furthermore, it's relatively easy to reproduce this issue. Restarting the problematic services didn't solve the issue because the issue did come back soon. Do you have any suggestions to mitigate this issue?
Thanks!

@astubbs
Copy link
Contributor

astubbs commented Jan 21, 2023

Thanks @colinkuo for the report! Just to check - was PC still making progress? And you just see it spamming the logs with this error?
.5 (releasing next week) has added a rate limiter to the warning. It’s not something to be too concerned about - it should only be performance related.

Can you tell me what your processing function does, or was doing at the time? Is it possible your function was completing records immediately with no delay?

I’m very much looking forward to merging the changes in 0.6 which will remove this system :)

@colinkuo
Copy link

Hi, @astubbs
When the error was spamming the logs, the CPU usage of the pods was low, and the offsets of assigned partitions were not moving fast enough. The outcome we've seen was that lag was built up only on the partitions assigned to the problematic Kubernetes pods. I cannot tell whether it was stuck or processed slowly.

According to our user function, it can take five ms on average. I'd say there might be some delay in processing a message in the user function.

Would you please elaborate on how removing the system can address this issue? Thanks!

@rkolesnev
Copy link
Member

rkolesnev commented Sep 12, 2023

Hi @colinkuo, @bartman64 - do you still experience this issue with version 5.0.2.7 - that has fixes for #547 and #606?

I am wondering if the issue here is the same as reported in #547...

The #637 seems to be somewhat related - in that it gets same warning logged - but it seems to be caused by commit during rebalance handling rather than PC not dropping stale work from queues correctly.

@bartman64
Copy link
Author

Hi, we didn't run into that particular issue anymore :)

@colinkuo
Copy link

Hi @rkolesnev We are starting to test 5.0.2.7 and will keep you posted if any updates. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants