Max loading factor steps reached: 100/100 #402

bartman64 · 2022-08-26T13:45:33Z

Hi,

today we noticed an issue on our service. It stopped processing data and when we looked into the logs we saw a lot of messages stating:
"isPoolQueueLow(): Max loading steps reached 100/100"

After restarting the service it run again properly, but still I wanted to now if the logs were any indicator for the error, or if it was something else. When does this WARN message occur?

astubbs · 2022-09-02T09:34:14Z

Can you provide more logs / did you capture them? earlier entries? (you can email them to me)

bartman64 · 2022-09-06T15:40:16Z

Attached log, maybe it helps.

com-abc-rest-caller-testenv.log

bartman64 · 2022-09-21T15:27:49Z

Any idea when this WARN message could occur?

astubbs · 2022-09-23T10:11:02Z

Hi, yes it's supposed to be when the processing is going faster than records can be queued up for processing - which is usually an indication of something going wrong. This issue is in the queue, will look more into it soon. However, it's a bit of a doozy. I don't suppose you noticed anything weird with the record processing when this happened? Was everything else appearing to be fine?

We need to add the metrics end point / info so when this happens we can get more direct info from the internal state of PC.

astubbs · 2022-09-23T10:13:41Z

How often has this happened?
Are you able to reproduce it?
Was throughput going slow when this happened?

It's a relatively arbitrary max for the loading factor - but usually 3-5x is enough. 100 should be way too much. It prevents potential OOM errors with the queue sizes.

bartman64 · 2022-09-23T11:29:13Z

Hi, unfortunately we don't noticed anything strange before when this happens.
Last time it happened was yesterday and I tried to find some kind of indicator to why this is happening, but
from the logs everything seems to work fine, until out of nowhere those 100/100 log message appear.

How this works for us is that we get throughout the night a batch of data that our services process
and in the morning we check if everything worked as expected.
When we noticed that no data was processed that night, it was usually because of that error.
But this is something we can't really reproduces, since when restarting the service it works suddenly again.

"when the processing is going faster than records can be queued up for processing" This is an interesting information,
since how our service works is, it checks via a flag, if we even have to send the data to an endpoint or not. If that is not the case the processing is then probably way faster compared to calling the endpoint and waiting for a response.
If a batch of that type arrives the processing could be way faster then the queueing up of new message...

But why does the processing completely stop when it reaches max?

How often has this happened?
In average once a week, I think

Was throughput going slow when this happened?
Didn't have an eye on that yet

Thx for the response :)

astubbs · 2022-09-23T12:58:04Z

So when this happens - no messages are getting processed? That's not what I expected 🤔

flag

ah - if this flag is set, your function returns immediately, without really doing much? Yes that could cause it. However, it shouldn't prevent any progress being made - it's just the system telling you that it's not going to increase the loading factor beyond 100. (which means that the queued messages is about 100 * maxConcurrency setting)...

But why does the processing completely stop when it reaches max?

yeah that shouldn't be the case.. what indication do you see that no progress is being made?

bartman64 · 2022-09-30T06:30:46Z

So we are using KafkaHQ to access our kafka topics and there we see that our service consumer group has a lag which doesn't decrease. And also we no longer see requests getting received from the target endpoint. Maybe the reason for the stop of processing is different and the WARN log message is just a result of that, because the queue doesn't get filled...

astubbs · 2022-10-10T11:56:54Z

Which version are you running?

Are you not getting any failed processing logs?

Can you show your processing code?

bartman64 · 2022-10-19T04:00:10Z

We were running version 0.5.2.0, just upgraded to 0.5.2.3
Wasn't able to capture any failed processing logs before.
Maybe it could make sense to organize a call session somehow at some point, were we could discuss this and could go through the code?

astubbs · 2022-10-20T18:55:22Z

yes, sure we can do that - please email me directly, or contact me on community slack.

bartman64 · 2022-11-04T15:49:00Z

Ok thanks, I will ping you, but currently I am busy with other topics ;)

astubbs · 2022-12-09T08:41:34Z

FYI, I'm going to add a limiter so that the logs aren't spammed with this.
And - the whole system is going to be removed in upcoming versions, so won't be an issue then.

colinkuo · 2023-01-20T23:54:20Z

Hi, @astubbs.
We are experiencing this issue now with version 0.5.2.3. When this happened, no messages were consumed from the assigned partition like what @bartman64 encountered previously. Furthermore, it's relatively easy to reproduce this issue. Restarting the problematic services didn't solve the issue because the issue did come back soon. Do you have any suggestions to mitigate this issue?
Thanks!

astubbs · 2023-01-21T10:12:11Z

Thanks @colinkuo for the report! Just to check - was PC still making progress? And you just see it spamming the logs with this error?
.5 (releasing next week) has added a rate limiter to the warning. It’s not something to be too concerned about - it should only be performance related.

Can you tell me what your processing function does, or was doing at the time? Is it possible your function was completing records immediately with no delay?

I’m very much looking forward to merging the changes in 0.6 which will remove this system :)

colinkuo · 2023-01-24T00:38:49Z

Hi, @astubbs
When the error was spamming the logs, the CPU usage of the pods was low, and the offsets of assigned partitions were not moving fast enough. The outcome we've seen was that lag was built up only on the partitions assigned to the problematic Kubernetes pods. I cannot tell whether it was stuck or processed slowly.

According to our user function, it can take five ms on average. I'd say there might be some delay in processing a message in the user function.

Would you please elaborate on how removing the system can address this issue? Thanks!

rkolesnev · 2023-09-12T08:37:03Z

Hi @colinkuo, @bartman64 - do you still experience this issue with version 5.0.2.7 - that has fixes for #547 and #606?

I am wondering if the issue here is the same as reported in #547...

The #637 seems to be somewhat related - in that it gets same warning logged - but it seems to be caused by commit during rebalance handling rather than PC not dropping stale work from queues correctly.

bartman64 · 2023-09-12T13:24:23Z

Hi, we didn't run into that particular issue anymore :)

colinkuo · 2023-09-15T05:48:21Z

Hi @rkolesnev We are starting to test 5.0.2.7 and will keep you posted if any updates. Thanks!

colinkuo mentioned this issue Sep 11, 2023

Parrallel consumer stuck in infinite rebalance #637

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Max loading factor steps reached: 100/100 #402

Max loading factor steps reached: 100/100 #402

bartman64 commented Aug 26, 2022 •

edited

Loading

astubbs commented Sep 2, 2022

bartman64 commented Sep 6, 2022

bartman64 commented Sep 21, 2022

astubbs commented Sep 23, 2022

astubbs commented Sep 23, 2022

bartman64 commented Sep 23, 2022

astubbs commented Sep 23, 2022 •

edited

Loading

bartman64 commented Sep 30, 2022

astubbs commented Oct 10, 2022

bartman64 commented Oct 19, 2022

astubbs commented Oct 20, 2022

bartman64 commented Nov 4, 2022

astubbs commented Dec 9, 2022

colinkuo commented Jan 20, 2023 •

edited

Loading

astubbs commented Jan 21, 2023

colinkuo commented Jan 24, 2023

rkolesnev commented Sep 12, 2023 •

edited

Loading

bartman64 commented Sep 12, 2023

colinkuo commented Sep 15, 2023

Max loading factor steps reached: 100/100 #402

Max loading factor steps reached: 100/100 #402

Comments

bartman64 commented Aug 26, 2022 • edited Loading

astubbs commented Sep 2, 2022

bartman64 commented Sep 6, 2022

bartman64 commented Sep 21, 2022

astubbs commented Sep 23, 2022

astubbs commented Sep 23, 2022

bartman64 commented Sep 23, 2022

astubbs commented Sep 23, 2022 • edited Loading

bartman64 commented Sep 30, 2022

astubbs commented Oct 10, 2022

bartman64 commented Oct 19, 2022

astubbs commented Oct 20, 2022

bartman64 commented Nov 4, 2022

astubbs commented Dec 9, 2022

colinkuo commented Jan 20, 2023 • edited Loading

astubbs commented Jan 21, 2023

colinkuo commented Jan 24, 2023

rkolesnev commented Sep 12, 2023 • edited Loading

bartman64 commented Sep 12, 2023

colinkuo commented Sep 15, 2023

bartman64 commented Aug 26, 2022 •

edited

Loading

astubbs commented Sep 23, 2022 •

edited

Loading

colinkuo commented Jan 20, 2023 •

edited

Loading

rkolesnev commented Sep 12, 2023 •

edited

Loading