-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
100% CPU usage (maybe caused by new poll implementation?) (help wanted) #807
Comments
@hathawsh also saw high CPU usage from the new poller implementation and submitted #589. That was merged before 3.3.0 was released.
It's possible this is the main loop spinning since Can you run 3.3.0 with
There weren't any changes to logging between 3.2.3 and 3.3.0. Since you said 3.2.3 works for you, I wouldn't suspect logging. |
Hi!
I haven't open the |
Hi, me again, here is the supervisord log with blather level enabled: |
Me too.every times when i login with http and restart a program.my server's CPU immediately go to 100%. |
This is 100% reproducible on 3.3.1 (anaconda package on OSX). Bummer...This is a show stopper for what I need. let me know if any other info is needed. I'm going to have to downgrade and see if an older version doesn't show this behavior. Update: I downgraded to 3.1.3 and this behavior is not present. Something definitely broke between those two versions, but I don't have any of the 3.2.x branch available to test easily. |
I can confirm this. Very frustrating 😞 |
@igorsobreira A number of users have reported high CPU usage from supervisord and believe it was introduced by your patch in #129. Another case of high CPU usage was found in #589 and confirmed to be caused by the patch. Could you please look into this? There's reproduce instructions above in #807 (comment). |
I want to add that of the 76 servers that I have running supervisord, I had to stop/start supervisord on seven of them in the past four days for this bug. |
Now an eighth. A bit more detail now. I am running supervisor 3.3.0. I installed it last May as soon as it came out because it had my patch in it that I really wanted. I restarted supervisor on all my hosts to get the new version and it has been going just fine. Last week, I made a global configuration change. I changed
On the host in particular that is crashing right now, these are the fds:
|
On about 2/3rds of my servers I have now upgraded to 3.3.1 and installed the patch. If it takes over a CPU again I'll let you know but it might be a few days. Sorry I didn't notice the conversation in the pull request. |
That patch did not solve my problem. I almost immediately had one of the updated systems (not the same one as before) go into a poll craziness:
|
Can do. I'll move to commenting over there instead. |
I hate to harp on this but #875 didn't solve my problem and I'm still restarting supervisord every few days on random hosts. I just wanted to bring up that this is still happening. |
@plockaby I suggest you try to use |
@plockaby I was able to reproduce one cause of high CPU and observe that #875 did fix it, which is why I merged #875. I haven't observed another cause of high CPU usage yet. If you are still experiencing problems with #875 applied, we are going to need reproduce instructions to show us how to get supervisord into that state so we can work on it. |
Supervisor 3.3.2 was released to PyPI and it includes the poller fixes (#875, #885). The only high CPU problem that I was able to reproduce is fixed by this release. If you are experiencing a high CPU problem on Supervisor 3.3.2, please open a new issue and include instructions to reproduce it. |
I've been on vacation so I'm just getting to this now. I tried 3.3.2 and it didn't solve the CPU spikes. I actually rolled all of my hosts back to 3.2.3 under the assumption that it was a problem with the new polling code. I'm still experiencing supervisord going into polling loops that use 100% CPU. I understand that you can't fix a problem that you can't reproduce. If I had any way to reproduce this consistently, I would show you. All I can give you, really, is situations where it is happening. Right now I can tell you that this is not caused by changes in 3.3.x. I can also tell you that it is (probably) not caused by multiple processes writing to their logs simultaneously as this doesn't happen with any more frequency on our heavily loaded hosts as compared to our lightly loaded hosts. In fact it has happened on hosts that have just one process controlled by supervisord just as often as it has happened on hosts that have forty processes. If I ever do figure out how to reproduce it then I will create a new ticket. Would you mind if I continue to post information about when I experience the issue? |
Of course not. Please do that. |
Interestingly I've discovered that it often recovers from the polling problem. I identified one of my hosts that had started spinning CPU on polling but it wasn't convenient to restart the supervisord at the time so I just left top running in a window on the side. Six days of 100% CPU and it just stopped as suddenly as it started. Another one of the hosts that I identified ran at 100% CPU for nearly five days before it stopped but then it started again six hours later. (I've identified all of these by looking at collectd CPU graphs.) I have an strace running on one of the hosts where the problem has cropped up several times and where I don't mind if the disk fills. When it starts on that host then I should have a bit more information about what is starting and/or stopping the problem. I have also noticed that it is not happening on Centos7 hosts. It so far has only happened on Centos6 hosts. Unfortunately most of my hosts are still on Centos6. :) |
I've narrowed the problem down even further to be specific to the interaction between supervisord and an event listener on the communications channel between the two (i.e. my event listener's stdout). The file descriptor that is generating all of the poll events is my event listener's stdout file. This explains why the problem would disappear: my event listener is configured to restart itself at random intervals and indeed when I restart it the 100% CPU spike goes away. I'm still examining the strace to figure out why the problem starts in the first place. The log itself isn't blowing up in size with insane numbers of writes. |
I was going to write a big long description with quotes from my strace but I think I discovered the problem. The problem happens when my event listener's log is rotated. Because it is an event listener and I really don't care about stdout for event listeners, my configuration is like this:
When the strace says that my program started going crazy, this appears in the strace:
That is the log file for my event listener and "5" is the file descriptor for its log file that I am currently spinning on polling:
I hope that is enough to go on? |
I have just discovered that this is not a problem specific to event listeners. I had a non-event listener program just trigger the same problem. However that's the first time that has happened that I can tell. |
Still an issue - just hit it on 4.1.0. In this case, I cannot start supervisorctl: [pryzbyj@telsasoft ~]$ while sleep 1; do ps h -ww -o time,comm 27978; done |
POLLOUT will return immediately if you're connecting to an unresponsive peer - cause heavy CPU usage. |
|
Can you please answer the question about epoll that I asked in closed issue?
|
Hi, I don't have any answers to that, sorry, I only closed it to keep all the comments on this issue in one place. |
Ok! |
This comment was marked as duplicate.
This comment was marked as duplicate.
A patch was proposed three times identically (originally here, the other two marked as duplicate): def poll(self, timeout):
fds = self._poll_fds(timeout)
readables, writables = [], []
time.sleep(1) # it was proposed to add this sleep() I don't recommend users apply this patch because it is probably just masking an underlying problem. This may appear to work but it may also cause
Reading through this thread, it seems that there have been several different causes of this issue, at least one of which was solved with a linked commit. I don't think any current maintainers know how to reproduce it. It would be extremely helpful if someone could send a minimal reproduce case that causes |
thanks for the moderation, and the clear explanation concerning the delay implied with this quick patch (not meant to solve the real issue). I forgot that it was me who posted a year ago how I avoided CPU spike, anyway my objective was to only point out the issue to a maintainer who might be able to really fix it. I posted the sleep fix, to allow people who were in the same case as me with production constraints, to avoid 100% CPU and keep using supervisord, at the cost of having some delay for supervisorctrl command, I've put 1s just to make sure people realize it has this cost. for your last paragraph, I'm not sure all CPU spikes are the results of that, but for my case, it was happening when supervisorctrl request stopping of a process, that catch the signal and needed to wait for some tasks to finish before quitting, as database shutdown, or webserver active requests to finish (it was basically catching signal and not exiting quickly as supervisor expected) actually this is how the problem can be replicated, and it is systematic, and it was mentioned in the post 1 year ago, but was not clear enough, I though a maintainer would find some technical description useful. basically if you want to replicate the issue, just make the process for which you request to stop ignore the stop signal. the supervisorctl will be continuously calling supervisord which will respond continuously without any break, and creates a continuous loop that eat up 100% of CPU till the underlying process decide to quit.(hopefully you have many CPUs then it is not 100%) maybe the only reason not so many people complain about this, is that not so many people manage process such that they wait for them more then 30mins to quit, actually the setup I'm dealing with is meant to allow processes waiting such long time to quit, and I can't afford to have 1 CPU out during all that time. [program:pgsql]
# if error then wait 10s before next retry
command=/supervisor/once_10.sh '/pgsql.sh postgres'
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stdout
stderr_logfile_maxbytes=0
autostart=true
autorestart=true
# only considers agent started after 10s
startsecs=10
# restart indefinetly
startretries=999999999999
stopwaitsecs=%(ENV_pod_stop_timeout)s
killasgroup=true
# https://www.postgresql.org/docs/9.4/server-shutdown.html
stopsignal=TERM [program:uwsgi]
# if error then wait 10s before next retry
command=/supervisor/%(ENV_P_UWSGI_POLICY)s_10.sh /uwsgi.sh uwsgi ${P_UWSGI_CMD}
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stdout
stderr_logfile_maxbytes=0
autostart=true
autorestart=true
# only considers agent started after 10s
startsecs=10
# restart indefinetly
startretries=999999999999
stopwaitsecs=%(ENV_pod_uwsgi_stop_timeout)s
killasgroup=true |
I have some guesses, I don't know if they are correct. Can you help take a look? @mnaberez Some cases of supervisord 100% CPU in Linux may be caused by the reuse of file descriptors. Supervisorctl uses socket to communicate with supervisord. Usually the socket file descriptor in supervisord is lowest-numbered file descriptor not currently open for the process. If supervisorctl ask for restart process, and the process will open a log file as In the loop, poll returns quickly(because there is really data to process) and CPU increased to 100%(busy wait). I get this a lot in supervisor 3.4.0. Maybe the master branch also has this problem. Here are some traces: ~ supervisord --version
3.4.0
~ cat /etc/supervisord.d/subprocess.ini
[program:subprocess]
command=bash -c "while true; do date; ls -lh /proc/self/fd; sleep 1; done"
autostart=true
autorestart=unexpected
startsecs=0
exitcodes=0
redirect_stderr=true
stdout_logfile_maxbytes=128KB
stdout_logfile_backups=0
stderr_logfile_maxbytes=128KB
stderr_logfile_backups=0
...
~ ls -lh /proc/`supervisorctl pid`/fd
total 0
lrwx------ 1 root root 64 Jul 20 17:36 0 -> /dev/null
l-wx------ 1 root root 64 Jul 20 17:36 1 -> pipe:[64823090]
l-wx------ 1 root root 64 Jul 20 17:37 13 -> pipe:[244841693]
lr-x------ 1 root root 64 Jul 20 18:07 18 -> pipe:[244841694]
l-wx------ 1 root root 64 Jul 20 17:36 2 -> pipe:[64823091]
l-wx------ 1 root root 64 Jul 20 17:37 3 -> /var/tmp/log/supervisor/supervisord.log
lrwx------ 1 root root 64 Jul 20 17:37 4 -> socket:[64899115]
lr-x------ 1 root root 64 Jul 20 17:37 7 -> /dev/urandom
l-wx------ 1 root root 64 Jul 20 17:37 9 -> /tmp/subprocess-stdout---supervisor-ChnhnK.log
...
~ strace -Ttt -f -p `supervisorctl pid`
strace: Process 96368 attached
strace: [ Process PID=96368 runs in x32 mode. ]
strace: [ Process PID=96368 runs in 64 bit mode. ]
17:53:54.026947 poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}], 3, 1000) = 1 ([{fd=9, revents=POLLIN}]) <0.000063>
17:53:54.027209 wait4(-1, 0x7ffc085cdd44, WNOHANG, NULL) = 0 <0.000050>
17:53:54.027413 poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}], 3, 1000) = 1 ([{fd=9, revents=POLLIN}]) <0.000040>
17:53:54.027504 wait4(-1, 0x7ffc085cdd44, WNOHANG, NULL) = 0 <0.000013>
17:53:54.027585 poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}], 3, 1000) = 1 ([{fd=9, revents=POLLIN}]) <0.000011> CPU usage of supervisord increased to 100% usually occurs after |
Evidence that the FDs are getting confused: I found this in a stderr logfile of one of our processes.
The process was running, but not able to write to stdout - in its logging output, I saw: |
A pull request has been submitted to fix the high CPU usage: #1581 If you are experiencing this issue and are able to try the patch, please do so and send your feedback. |
Thanks! We saw no issues since I deployed this to our customers in April. It seems like it's lost track of its FDs, probably during log rotation. lseek(5, 0, SEEK_CUR) = -1 EBADF (Bad file descriptor) $ sudo ls -l /proc/26111/fd/5 |
Here's an (other?) error that I found in a stderr logfile for an application running under supervisor. Exception ignored in: <function tail_f_producer.del at 0x7f92f5918f70> This is with the patch from [2a93d6b]. To be clear, I think this is a separate problem than the one that was fixed (and I've got no reason to think that the patch caused this issue). Maybe this issue should be closed and a new one created for other FD issues... Edit: I also observed this on an instance with v4.2.5 without the patch applied. |
I'm observing a process currently using 100% CPU use doing poll() with (I checked) the patch applied. Currently used ~27hr CPU time poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=33, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}, {fd=60, events=POLLIN|POLLPRI|POLLHUP}, {fd=63, events=POLLIN|POLLPRI|POLLHUP}, {fd=66, events=POLLIN|POLLPRI|POLLHUP}, {fd=69, events=POLLIN|POLLPRI|POLLHUP}, {fd=71, events=POLLIN|POLLPRI|POLLHUP}, {fd=57, events=POLLIN|POLLPRI|POLLHUP}, {fd=34, events=POLLIN|POLLPRI|POLLHUP}, {fd=37, events=POLLIN|POLLPRI|POLLHUP}, {fd=52, events=POLLIN|POLLPRI|POLLHUP}, {fd=51, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=49, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=54, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=14, events=POLLIN|POLLPRI|POLLHUP}, {fd=19, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=28, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=29, events=POLLIN|POLLPRI|POLLHUP}], 27, 1000) = 1 ([{fd=29, revents=POLLNVAL}]) There's nothing in the supervisor logfile. |
Did you resolve this Python exception error? I'm also facing the same issue. |
On Wed, Sep 13, 2023 at 11:12:23AM -0700, Jagadeesh B wrote:
Did you resolve this Python exception error? I'm also facing the same issue.
I didn't resolve it, no.
***@***.*** cachelogic]$ supervisord --version
**4.2.5**
Is that with the patch applied (or git HEAD) or unpatched 4.2.5 ?
|
I was experiencing this issue with Supervisor, and to address it, I attempted to upgrade Supervisor from version 4.2.2 to the latest version(4.2.5) using the command: pip3 install --upgrade supervisor However, after upgrading to the latest version, the issue I encountered still persists. I checked the issue section and found suggestions that upgrading the Supervisor would resolve it, but unfortunately, it didn't work in my case. I'm not sure about whether this version is patch-applied or not. If you have any insights or suggestions on how to resolve this issue, please feel free to share. |
yum install, version is 3.4.0 |
What do you mean when you say "yum install..." ? Did you mean to say that you hit the issue with that version or ?? |
Thanks @mnaberez for the recommendation. I had been seeing the issue intermittently where CPU usage spikes to 100%, and I've been able to replicate it after a few attempts of concurrent Here are my observations, pre and post patch:
Hopefully this helps, as I understand there is some demand for a new version including this change - ref #1635. Details of the patch: index 0a4f3e6..2265db9 100755
--- a/supervisor/supervisord.py
+++ b/supervisor/supervisord.py
@@ -222,6 +222,14 @@ class Supervisor:
raise
except:
combined_map[fd].handle_error()
+ else:
+ # if the fd is not in combined_map, we should unregister it. otherwise,
+ # it will be polled every time, which may cause 100% cpu usage
+ self.options.logger.warn('unexpected read event from fd %r' % fd)
+ try:
+ self.options.poller.unregister_readable(fd)
+ except:
+ pass
for fd in w:
if fd in combined_map:
@@ -237,6 +245,12 @@ class Supervisor:
raise
except:
combined_map[fd].handle_error()
+ else:
+ self.options.logger.warn('unexpected write event from fd %r' % fd)
+ try:
+ self.options.poller.unregister_writable(fd)
+ except:
+ pass
for group in pgroups:
group.transition() |
Hello, |
More evidence of FD confusion. Edit: I should've added that this instance does not have the patch applied. |
|
We were using supervisor 3.3.0 on Ubuntu 14.04 LTS.
Recently on some of our nodes on AWS, we spot very high cpu usage about supervisord, to get around that, we have to reload it, but it may happen again in one day.
Reading from
strace
, we spot there are excessive calls to both 'gettimeofday' and 'poll', so after that, we have to choose to downgrade supervisor to 3.2.3.I see there was #581, but I think it's irrelevant here, our wild guess is it just caused by the new poll implementation introduced in 3.3.0 (and maybe caused by simultaneous log outputs?)...
Thanks in advance!
The text was updated successfully, but these errors were encountered: