Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU usage (maybe caused by new poll implementation?) (help wanted) #807

Open
timonwong opened this issue Aug 6, 2016 · 75 comments
Open

Comments

@timonwong
Copy link

We were using supervisor 3.3.0 on Ubuntu 14.04 LTS.

Recently on some of our nodes on AWS, we spot very high cpu usage about supervisord, to get around that, we have to reload it, but it may happen again in one day.

Reading from strace, we spot there are excessive calls to both 'gettimeofday' and 'poll', so after that, we have to choose to downgrade supervisor to 3.2.3.

I see there was #581, but I think it's irrelevant here, our wild guess is it just caused by the new poll implementation introduced in 3.3.0 (and maybe caused by simultaneous log outputs?)...

Thanks in advance!

@mnaberez
Copy link
Member

mnaberez commented Aug 6, 2016

Recently on some of our nodes on AWS, we spot very high cpu usage about supervisord, to get around that, we have to reload it, but it may happen again in one day.

@hathawsh also saw high CPU usage from the new poller implementation and submitted #589. That was merged before 3.3.0 was released.

Reading from strace, we spot there are excessive calls to both 'gettimeofday' and 'poll', so after that, we have to choose to downgrade supervisor to 3.2.3.

It's possible this is the main loop spinning since poller.poll() does a poll and below it tick() does a gettimeofday.

Can you run 3.3.0 with loglevel=blat? That will produce as much debugging information as possible. Perhaps it will have some clues to why this is happening.

and maybe caused by simultaneous log outputs?

There weren't any changes to logging between 3.2.3 and 3.3.0. Since you said 3.2.3 works for you, I wouldn't suspect logging.

@timonwong
Copy link
Author

timonwong commented Aug 8, 2016

Hi!
Today I spot the problem again, here gz'd strace log (duration ~1s):
strace.log.tar.gz
And here is fds:

0 -> /dev/null
1 -> /dev/null
10 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
100 -> pipe:[32356299]
101 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
102 -> pipe:[32356298]
103 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
104 -> pipe:[32356300]
105 -> /data/supervisor_log/bg_task_00-stderr.log
106 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
107 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
108 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
109 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
11 -> pipe:[3024]
110 -> /data/supervisor_log/bg_task_00-stdout.log
111 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
112 -> /data/supervisor_log/bg_task_00-stderr.log
113 -> pipe:[32356302]
114 -> pipe:[32356301]
115 -> pipe:[32356303]
116 -> /data/supervisor_log/bg_task_00-stdout.log.1
117 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
118 -> /data/supervisor_log/bg_task_00-stderr.log
119 -> /data/supervisor_log/bg_task_00-stdout.log.8
12 -> pipe:[17721748]
121 -> pipe:[32356304]
124 -> /data/supervisor_log/bg_task_00-stderr.log
13 -> pipe:[17721749]
14 -> pipe:[11775]
15 -> pipe:[3025]
16 -> pipe:[11776]
17 -> /data/supervisor_log/bg_task_02-stdout.log
18 -> /data/supervisor_log/bg_task_02-stderr.log
19 -> pipe:[17721751]
2 -> /dev/null
20 -> pipe:[11777]
21 -> pipe:[17721750]
22 -> /data/supervisor_log/bg_task_03-stdout.log
23 -> /data/supervisor_log/bg_task_03-stderr.log
24 -> pipe:[11827]
25 -> pipe:[17721752]
26 -> pipe:[11828]
27 -> /data/supervisor_log/bg_task_01-stdout.log.1
28 -> /data/supervisor_log/bg_task_01-stderr.log.2
29 -> pipe:[17721754]
3 -> /data/supervisor_log/supervisord.log
30 -> pipe:[11829]
31 -> pipe:[17721753]
32 -> /data/supervisor_log/bg_task_04-stdout.log
33 -> /data/supervisor_log/bg_task_04-stderr.log
34 -> pipe:[17721755]
35 -> /data/supervisor_log/bg_task_01-stdout.log
36 -> /data/supervisor_log/bg_task_01-stderr.log.1
37 -> pipe:[17721757]
38 -> pipe:[17721756]
39 -> pipe:[17721758]
4 -> socket:[13073]
40 -> /data/supervisor_log/bg_task_01-stdout.log.3
41 -> /data/supervisor_log/bg_task_01-stderr.log
42 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
43 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
44 -> pipe:[17721759]
45 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
46 -> pipe:[17719642]
47 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
48 -> pipe:[17719643]
49 -> /data/supervisor_log/bg_task_01-stdout.log.2
5 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
50 -> pipe:[17719644]
51 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
52 -> /data/supervisor_log/bg_task_01-stderr.log.3
53 -> /data/supervisor_log/bg_task_05-stdout.log
54 -> /data/supervisor_log/bg_task_05-stderr.log
55 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
56 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
57 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
58 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
59 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
6 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
60 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
61 -> pipe:[30456289]
62 -> pipe:[30456290]
63 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
64 -> /data/supervisor_log/node_exporter.log
65 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
66 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
67 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
68 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
69 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
7 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
70 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
71 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
72 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
73 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
74 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
75 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
76 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
77 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
78 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
79 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
8 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
80 -> /data/supervisor_log/bg_task_00-stderr.log
81 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
82 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
83 -> /data/supervisor_log/bg_task_00-stderr.log
84 -> /data/supervisor_log/bg_task_00-stdout.log.2
85 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
86 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
87 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
88 -> /data/supervisor_log/bg_task_00-stderr.log
89 -> /data/supervisor_log/bg_task_00-stderr.log
9 -> pipe:[3023]
90 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
91 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
92 -> /data/supervisor_log/bg_task_00-stdout.log.8
93 -> pipe:[32356293]
94 -> pipe:[32356294]
95 -> pipe:[32356296]
96 -> pipe:[32356295]
97 -> pipe:[32356297]
98 -> /data/supervisor_log/bg_task_00-stdout.log.3
99 -> /data/supervisor_log/bg_task_00-stderr.log

I haven't open the blather level yet, I'll try it now, and will see what's going on.

@timonwong
Copy link
Author

Hi, me again, here is the supervisord log with blather level enabled:
supervisord.log.tar.gz

@blusewang
Copy link

blusewang commented Dec 21, 2016

Me too.every times when i login with http and restart a program.my server's CPU immediately go to 100%.
I wasted so much time on this bug!
I used it to manage my laravel queue. I hate it, but I have no choose!

@gonesurfing
Copy link

gonesurfing commented Dec 26, 2016

This is 100% reproducible on 3.3.1 (anaconda package on OSX).
-Start supervisord with http server on localhost:9001
-open new ipython console and import xmlrpclib and type
server = xmlrpclib.Server('http://localhost:9001')
-now run any server.getMethod and instantly cpu goes to 100% and stays there
-del(server) and cpu goes back to normal.

Bummer...This is a show stopper for what I need. let me know if any other info is needed. I'm going to have to downgrade and see if an older version doesn't show this behavior.

Update: I downgraded to 3.1.3 and this behavior is not present. Something definitely broke between those two versions, but I don't have any of the 3.2.x branch available to test easily.

@cenkalti
Copy link

I can confirm this. Very frustrating 😞

@mnaberez
Copy link
Member

@igorsobreira A number of users have reported high CPU usage from supervisord and believe it was introduced by your patch in #129. Another case of high CPU usage was found in #589 and confirmed to be caused by the patch. Could you please look into this? There's reproduce instructions above in #807 (comment).

@plockaby
Copy link
Contributor

I want to add that of the 76 servers that I have running supervisord, I had to stop/start supervisord on seven of them in the past four days for this bug.

@plockaby
Copy link
Contributor

plockaby commented Apr 16, 2017

Now an eighth. A bit more detail now. I am running supervisor 3.3.0. I installed it last May as soon as it came out because it had my patch in it that I really wanted. I restarted supervisor on all my hosts to get the new version and it has been going just fine. Last week, I made a global configuration change. I changed minfds to 100000 (because a specific process wanted it) and I set childlogdir which had not been set before. I installed that new configuration on all my hosts and went around restarting supervisord. Now hosts are going into this poll spin loop at random. This is the strace log. These are spinning like crazy, eating up a whole CPU.

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=22, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}, {fd=36, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=41, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}], 18, 1000) = 1 ([{fd=20, revents=POLLIN}])
gettimeofday({1492359703, 190995}, NULL) = 0
gettimeofday({1492359703, 191062}, NULL) = 0
gettimeofday({1492359703, 194873}, NULL) = 0
gettimeofday({1492359703, 194973}, NULL) = 0
gettimeofday({1492359703, 195072}, NULL) = 0
gettimeofday({1492359703, 195108}, NULL) = 0
gettimeofday({1492359703, 195153}, NULL) = 0
gettimeofday({1492359703, 195224}, NULL) = 0
gettimeofday({1492359703, 195254}, NULL) = 0
gettimeofday({1492359703, 195299}, NULL) = 0
gettimeofday({1492359703, 195327}, NULL) = 0
gettimeofday({1492359703, 195378}, NULL) = 0
gettimeofday({1492359703, 195446}, NULL) = 0
wait4(-1, 0x7ffea7758d04, WNOHANG, NULL) = 0
gettimeofday({1492359703, 195526}, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=22, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}, {fd=36, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=41, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}], 18, 1000) = 1 ([{fd=20, revents=POLLIN}])
gettimeofday({1492359703, 195874}, NULL) = 0
gettimeofday({1492359703, 195936}, NULL) = 0
gettimeofday({1492359703, 196000}, NULL) = 0
gettimeofday({1492359703, 196092}, NULL) = 0
gettimeofday({1492359703, 196166}, NULL) = 0
gettimeofday({1492359703, 196256}, NULL) = 0
gettimeofday({1492359703, 196336}, NULL) = 0
gettimeofday({1492359703, 196380}, NULL) = 0
gettimeofday({1492359703, 196520}, NULL) = 0
gettimeofday({1492359703, 196557}, NULL) = 0
gettimeofday({1492359703, 196599}, NULL) = 0
gettimeofday({1492359703, 196643}, NULL) = 0
gettimeofday({1492359703, 196689}, NULL) = 0
wait4(-1, 0x7ffea7758d04, WNOHANG, NULL) = 0
gettimeofday({1492359703, 196787}, NULL) = 0

On the host in particular that is crashing right now, these are the fds:

total 0
lr-x------ 1 root root 64 Apr 11 15:42 0 -> /dev/null
lrwx------ 1 root root 64 Apr 11 15:42 1 -> socket:[74731265]
lr-x------ 1 root root 64 Apr 11 15:42 10 -> pipe:[74731311]
lr-x------ 1 root root 64 Apr 11 15:42 11 -> pipe:[74731313]
l-wx------ 1 root root 64 Apr 11 15:42 12 -> /data/logs/supervisor/supermon.log
l-wx------ 1 root root 64 Apr 11 15:42 14 -> pipe:[77877459]
lr-x------ 1 root root 64 Apr 11 15:42 15 -> pipe:[74731314]
lr-x------ 1 root root 64 Apr 11 15:42 16 -> pipe:[77877460]
l-wx------ 1 root root 64 Apr 11 15:42 17 -> /data/logs/supervisor/supercron.log
l-wx------ 1 root root 64 Apr 11 15:42 18 -> /data/logs/supervisor/supercron.err
l-wx------ 1 root root 64 Apr 11 15:42 19 -> pipe:[74731318]
lrwx------ 1 root root 64 Apr 11 15:42 2 -> socket:[74731265]
l-wx------ 1 root root 64 Apr 11 15:42 20 -> /data/logs/supervisor/dart-agent.err
lr-x------ 1 root root 64 Apr 11 15:42 21 -> pipe:[74731319]
lr-x------ 1 root root 64 Apr 11 15:42 22 -> pipe:[77877461]
l-wx------ 1 root root 64 Apr 11 15:42 24 -> pipe:[74731321]
lr-x------ 1 root root 64 Apr 11 15:42 25 -> pipe:[74731320]
lr-x------ 1 root root 64 Apr 11 15:42 26 -> pipe:[74731322]
l-wx------ 1 root root 64 Apr 11 15:42 27 -> /data/logs/supervisor/statsd.log
l-wx------ 1 root root 64 Apr 11 15:42 28 -> /data/logs/supervisor/statsd.err
l-wx------ 1 root root 64 Apr 16 09:27 29 -> pipe:[74731324]
l-wx------ 1 root root 64 Apr 11 15:42 3 -> /data/logs/supervisord.log
lr-x------ 1 root root 64 Apr 16 09:27 30 -> pipe:[74731323]
lr-x------ 1 root root 64 Apr 16 09:27 31 -> pipe:[74731325]
l-wx------ 1 root root 64 Apr 16 09:27 32 -> /data/logs/supervisor/redis.log
l-wx------ 1 root root 64 Apr 16 09:27 33 -> /data/logs/supervisor/redis.err
l-wx------ 1 root root 64 Apr 16 09:27 34 -> pipe:[74731327]
lr-x------ 1 root root 64 Apr 16 09:27 35 -> pipe:[74731326]
lr-x------ 1 root root 64 Apr 16 09:27 36 -> pipe:[74731328]
l-wx------ 1 root root 64 Apr 11 15:42 37 -> /data/logs/supervisor/dmca-admin-web.log
l-wx------ 1 root root 64 Apr 11 15:42 38 -> /data/logs/supervisor/dmca-admin-web.err
l-wx------ 1 root root 64 Apr 16 09:27 39 -> pipe:[74731330]
lrwx------ 1 root root 64 Apr 11 15:42 4 -> socket:[74731287]
lr-x------ 1 root root 64 Apr 16 09:27 40 -> pipe:[74731329]
lr-x------ 1 root root 64 Apr 11 15:42 41 -> pipe:[74731331]
l-wx------ 1 root root 64 Apr 11 15:42 42 -> /data/logs/supervisor/pgcheck.log
l-wx------ 1 root root 64 Apr 11 15:42 43 -> /data/logs/supervisor/pgcheck.err
l-wx------ 1 root root 64 Apr 11 15:42 44 -> /data/logs/supervisor/dart-agent.log
lr-x------ 1 root root 64 Apr 11 15:42 45 -> pipe:[74731332]
l-wx------ 1 root root 64 Apr 11 15:42 47 -> /data/logs/supervisor/pgwatch.log
l-wx------ 1 root root 64 Apr 11 15:42 48 -> /data/logs/supervisor/pgwatch.err
l-wx------ 1 root root 64 Apr 11 15:42 5 -> /data/logs/supervisor/supermon.err
l-wx------ 1 root root 64 Apr 11 15:42 6 -> pipe:[74731309]
lr-x------ 1 root root 64 Apr 11 15:42 7 -> /dev/urandom
lr-x------ 1 root root 64 Apr 11 15:42 8 -> pipe:[74731310]
l-wx------ 1 root root 64 Apr 11 15:42 9 -> pipe:[74731312]

@mnaberez
Copy link
Member

@plockaby We're testing a fix for this over in #875. Are you able to try that patch?

@plockaby
Copy link
Contributor

On about 2/3rds of my servers I have now upgraded to 3.3.1 and installed the patch. If it takes over a CPU again I'll let you know but it might be a few days. Sorry I didn't notice the conversation in the pull request.

@plockaby
Copy link
Contributor

That patch did not solve my problem. I almost immediately had one of the updated systems (not the same one as before) go into a poll craziness:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}], 12, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1492367896, 593446}, NULL) = 0
gettimeofday({1492367896, 593515}, NULL) = 0
gettimeofday({1492367896, 593584}, NULL) = 0
gettimeofday({1492367896, 593637}, NULL) = 0
gettimeofday({1492367896, 593719}, NULL) = 0
gettimeofday({1492367896, 593788}, NULL) = 0
gettimeofday({1492367896, 593826}, NULL) = 0
gettimeofday({1492367896, 593863}, NULL) = 0
gettimeofday({1492367896, 593886}, NULL) = 0
gettimeofday({1492367896, 593906}, NULL) = 0
gettimeofday({1492367896, 593948}, NULL) = 0
gettimeofday({1492367896, 593987}, NULL) = 0
gettimeofday({1492367896, 594031}, NULL) = 0
wait4(-1, 0x7ffe0f36f7e4, WNOHANG, NULL) = 0
gettimeofday({1492367896, 594103}, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}], 12, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1492367896, 594331}, NULL) = 0
gettimeofday({1492367896, 594415}, NULL) = 0
gettimeofday({1492367896, 594491}, NULL) = 0
gettimeofday({1492367896, 594561}, NULL) = 0
gettimeofday({1492367896, 594642}, NULL) = 0
gettimeofday({1492367896, 594663}, NULL) = 0
gettimeofday({1492367896, 594699}, NULL) = 0
gettimeofday({1492367896, 594739}, NULL) = 0
gettimeofday({1492367896, 594769}, NULL) = 0
gettimeofday({1492367896, 594808}, NULL) = 0
gettimeofday({1492367896, 594836}, NULL) = 0
gettimeofday({1492367896, 594887}, NULL) = 0
gettimeofday({1492367896, 594934}, NULL) = 0
wait4(-1, 0x7ffe0f36f7e4, WNOHANG, NULL) = 0
gettimeofday({1492367896, 595005}, NULL) = 0
lr-x------ 1 root root 64 Apr 16 10:37 0 -> /dev/null
lrwx------ 1 root root 64 Apr 16 10:37 1 -> socket:[36606016]
lr-x------ 1 root root 64 Apr 16 10:37 10 -> pipe:[36606043]
lr-x------ 1 root root 64 Apr 16 10:37 11 -> pipe:[36606045]
l-wx------ 1 root root 64 Apr 16 10:37 12 -> /data/logs/supervisor/supermon.log
l-wx------ 1 root root 64 Apr 16 10:37 13 -> /data/logs/supervisor/supermon.err
l-wx------ 1 root root 64 Apr 16 10:37 14 -> pipe:[36606047]
lr-x------ 1 root root 64 Apr 16 10:37 15 -> pipe:[36606046]
lr-x------ 1 root root 64 Apr 16 10:37 16 -> pipe:[36606048]
l-wx------ 1 root root 64 Apr 16 10:37 17 -> /data/logs/supervisor/supercron.log
l-wx------ 1 root root 64 Apr 16 10:37 18 -> /data/logs/supervisor/supercron.err
l-wx------ 1 root root 64 Apr 16 10:37 19 -> pipe:[36606050]
lrwx------ 1 root root 64 Apr 16 10:37 2 -> socket:[36606016]
lr-x------ 1 root root 64 Apr 16 10:37 20 -> pipe:[36606049]
lr-x------ 1 root root 64 Apr 16 10:37 21 -> pipe:[36606051]
l-wx------ 1 root root 64 Apr 16 10:37 22 -> /data/logs/supervisor/dart-agent.log
l-wx------ 1 root root 64 Apr 16 10:37 24 -> pipe:[36606053]
lr-x------ 1 root root 64 Apr 16 10:37 25 -> pipe:[36606052]
lr-x------ 1 root root 64 Apr 16 10:37 26 -> pipe:[36606054]
l-wx------ 1 root root 64 Apr 16 10:37 27 -> /data/logs/supervisor/pgcheck.log
l-wx------ 1 root root 64 Apr 16 10:37 28 -> /data/logs/supervisor/pgcheck.err
l-wx------ 1 root root 64 Apr 16 10:37 3 -> /data/logs/supervisord.log
lr-x------ 1 root root 64 Apr 16 10:37 30 -> pipe:[36606055]
l-wx------ 1 root root 64 Apr 16 10:37 32 -> /data/logs/supervisor/pgwatch.log
l-wx------ 1 root root 64 Apr 16 10:37 33 -> /data/logs/supervisor/pgwatch.err
lrwx------ 1 root root 64 Apr 16 10:37 4 -> socket:[36606038]
l-wx------ 1 root root 64 Apr 16 10:37 5 -> /data/logs/supervisor/dart-agent.err
l-wx------ 1 root root 64 Apr 16 10:37 6 -> pipe:[36606041]
lr-x------ 1 root root 64 Apr 16 10:37 7 -> /dev/urandom
lr-x------ 1 root root 64 Apr 16 10:37 8 -> pipe:[36606042]
l-wx------ 1 root root 64 Apr 16 10:37 9 -> pipe:[36606044]

@mnaberez
Copy link
Member

@plockaby I posted this over in #875. Can you please work with us there to reproduce?

@plockaby
Copy link
Contributor

Can do. I'll move to commenting over there instead.

@plockaby
Copy link
Contributor

I hate to harp on this but #875 didn't solve my problem and I'm still restarting supervisord every few days on random hosts. I just wanted to bring up that this is still happening.

@blusewang
Copy link

@plockaby I suggest you try to use systemd instead supervisor!

@mnaberez
Copy link
Member

I hate to harp on this but #875 didn't solve my problem and I'm still restarting supervisord every few days on random hosts.

@plockaby I was able to reproduce one cause of high CPU and observe that #875 did fix it, which is why I merged #875. I haven't observed another cause of high CPU usage yet. If you are still experiencing problems with #875 applied, we are going to need reproduce instructions to show us how to get supervisord into that state so we can work on it.

@mnaberez
Copy link
Member

mnaberez commented Jun 3, 2017

Supervisor 3.3.2 was released to PyPI and it includes the poller fixes (#875, #885).

The only high CPU problem that I was able to reproduce is fixed by this release.

If you are experiencing a high CPU problem on Supervisor 3.3.2, please open a new issue and include instructions to reproduce it.

@mnaberez mnaberez closed this as completed Jun 3, 2017
@plockaby
Copy link
Contributor

I've been on vacation so I'm just getting to this now.

I tried 3.3.2 and it didn't solve the CPU spikes. I actually rolled all of my hosts back to 3.2.3 under the assumption that it was a problem with the new polling code. I'm still experiencing supervisord going into polling loops that use 100% CPU.

I understand that you can't fix a problem that you can't reproduce. If I had any way to reproduce this consistently, I would show you. All I can give you, really, is situations where it is happening. Right now I can tell you that this is not caused by changes in 3.3.x. I can also tell you that it is (probably) not caused by multiple processes writing to their logs simultaneously as this doesn't happen with any more frequency on our heavily loaded hosts as compared to our lightly loaded hosts. In fact it has happened on hosts that have just one process controlled by supervisord just as often as it has happened on hosts that have forty processes.

If I ever do figure out how to reproduce it then I will create a new ticket. Would you mind if I continue to post information about when I experience the issue?

@mnaberez
Copy link
Member

Would you mind if I continue to post information about when I experience the issue?

Of course not. Please do that.

@plockaby
Copy link
Contributor

plockaby commented Jun 28, 2017

Interestingly I've discovered that it often recovers from the polling problem. I identified one of my hosts that had started spinning CPU on polling but it wasn't convenient to restart the supervisord at the time so I just left top running in a window on the side. Six days of 100% CPU and it just stopped as suddenly as it started. Another one of the hosts that I identified ran at 100% CPU for nearly five days before it stopped but then it started again six hours later. (I've identified all of these by looking at collectd CPU graphs.)

I have an strace running on one of the hosts where the problem has cropped up several times and where I don't mind if the disk fills. When it starts on that host then I should have a bit more information about what is starting and/or stopping the problem.

I have also noticed that it is not happening on Centos7 hosts. It so far has only happened on Centos6 hosts. Unfortunately most of my hosts are still on Centos6. :)

@plockaby
Copy link
Contributor

plockaby commented Jun 30, 2017

I've narrowed the problem down even further to be specific to the interaction between supervisord and an event listener on the communications channel between the two (i.e. my event listener's stdout). The file descriptor that is generating all of the poll events is my event listener's stdout file.

This explains why the problem would disappear: my event listener is configured to restart itself at random intervals and indeed when I restart it the 100% CPU spike goes away. I'm still examining the strace to figure out why the problem starts in the first place. The log itself isn't blowing up in size with insane numbers of writes.

@mnaberez mnaberez reopened this Jun 30, 2017
@plockaby
Copy link
Contributor

plockaby commented Jun 30, 2017

I was going to write a big long description with quotes from my strace but I think I discovered the problem. The problem happens when my event listener's log is rotated. Because it is an event listener and I really don't care about stdout for event listeners, my configuration is like this:

stdout_logfile = /data/logs/supervisor/dart-agent.log
stdout_logfile_maxbytes = 1MB
stdout_logfile_backups = 0

When the strace says that my program started going crazy, this appears in the strace:

open("/data/logs/supervisor/dart-agent.log", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0

That is the log file for my event listener and "5" is the file descriptor for its log file that I am currently spinning on polling:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=12, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}], 14, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1498848417, 352015}, NULL) = 0
gettimeofday({1498848417, 352079}, NULL) = 0
gettimeofday({1498848417, 352141}, NULL) = 0
gettimeofday({1498848417, 352204}, NULL) = 0
gettimeofday({1498848417, 352268}, NULL) = 0
gettimeofday({1498848417, 352331}, NULL) = 0
gettimeofday({1498848417, 352483}, NULL) = 0
gettimeofday({1498848417, 352590}, NULL) = 0
gettimeofday({1498848417, 352769}, NULL) = 0
gettimeofday({1498848417, 352858}, NULL) = 0
wait4(-1, 0x7fff3600b334, WNOHANG, NULL) = 0
gettimeofday({1498848417, 352992}, NULL) = 0

I hope that is enough to go on?

@plockaby
Copy link
Contributor

I have just discovered that this is not a problem specific to event listeners. I had a non-event listener program just trigger the same problem. However that's the first time that has happened that I can tell.

@justinpryzby
Copy link

Still an issue - just hit it on 4.1.0. In this case, I cannot start supervisorctl:
supervisorctl -c .../supervisord.conf
error: <class 'ConnectionResetError'>, [Errno 104] Connection reset by peer: file: /usr/lib64/python3.6/socket.py line: 586

[pryzbyj@telsasoft ~]$ while sleep 1; do ps h -ww -o time,comm 27978; done
07:47:51 supervisord
07:47:52 supervisord
07:47:53 supervisord
07:47:54 supervisord

@kewde
Copy link

kewde commented Mar 21, 2022

POLLOUT will return immediately if you're connecting to an unresponsive peer - cause heavy CPU usage.

@mnaberez
Copy link
Member

#1515 (comment)

We found high CPU consumption (our java application 70%, supervisor 30%) when using supervisor
After running strace, it turned out that this consumption is associated with redirecting application's stdout to log file - supervisor using system calls read/write/select in python2 and poll in python3 - and apparently the main cpu time is spent on select syscall. Also we can't use kqueue since we use ubuntu

We need logs, so we want to reduce CPU consumption without turning them off
So far, the only thing that comes to mind is using epoll instead of select/poll to wait for descriptors in the hope that it will help

Are there any plans to add epoll to pollers list?
If so, when will it happen? And if not, is it worth adding its support on my own? Are there any pitfalls here?

Or maybe you can suggest how to solve this problem in some other way?

@aleph0naught
Copy link

@mnaberez

Can you please answer the question about epoll that I asked in closed issue?

So far, the only thing that comes to mind is using epoll instead of select/poll to wait for descriptors in the hope that it will help

Are there any plans to add epoll to pollers list?
If so, when will it happen? And if not, is it worth adding its support on my own? Are there any pitfalls here?

@mnaberez
Copy link
Member

Hi, I don't have any answers to that, sorry, I only closed it to keep all the comments on this issue in one place.

@aleph0naught
Copy link

Hi, I don't have any answers to that, sorry, I only closed it to keep all the comments on this issue in one place.

Ok!
I reopened new issue with more specific title
I think this is independent task from the supervisor's point of view (task about adding epoll poller), despite the fact that it is related to this issue

@ollofx

This comment was marked as duplicate.

@mnaberez
Copy link
Member

A patch was proposed three times identically (originally here, the other two marked as duplicate):

   def poll(self, timeout):
        fds = self._poll_fds(timeout)
        readables, writables = [], []
        time.sleep(1)  # it was proposed to add this sleep()

I don't recommend users apply this patch because it is probably just masking an underlying problem. This may appear to work but it may also cause supervisord to become sluggish. For example, sending supervisorctl commands may take 1 second longer to respond. Log files may appear to freeze and then dump a lot of data at once after 1 second. In the worse case, processes running under supervisord may hang or crash if their pipes fill up because supervisord can't empty them fast enough. The interval could be lowered, e.g. time.sleep(0.1), but supervisord is a single-threaded event loop that's not supposed to block. I don't think we could merge a patch like that for these reasons.

poll_fds is intended to return immediately only if there is any change on the file descriptors it monitors, or a timeout occurs. If the timeout occurs, supervisord does some very minimally processing and immediately goes back into poll_fds. In the case of supervisord spinning on poll_fds, it means poll_fds is always returning immediately. It's not supposed to do that. The fix would be figure out why it is always returning immediately and solve it.

Reading through this thread, it seems that there have been several different causes of this issue, at least one of which was solved with a linked commit. I don't think any current maintainers know how to reproduce it. It would be extremely helpful if someone could send a minimal reproduce case that causes supervisord to spin at 100% CPU. For example: with this minimal configuration file, run supervisord on this operating system, and you will see it spin. This problem doesn't always occur, e.g. it has never been seen in pre-release testing or on CI, which is why reproduce instructions are needed.

@ollofx
Copy link

ollofx commented Jun 29, 2022

thanks for the moderation, and the clear explanation concerning the delay implied with this quick patch (not meant to solve the real issue).

I forgot that it was me who posted a year ago how I avoided CPU spike, anyway my objective was to only point out the issue to a maintainer who might be able to really fix it.

I posted the sleep fix, to allow people who were in the same case as me with production constraints, to avoid 100% CPU and keep using supervisord, at the cost of having some delay for supervisorctrl command, I've put 1s just to make sure people realize it has this cost.

for your last paragraph, I'm not sure all CPU spikes are the results of that, but for my case, it was happening when supervisorctrl request stopping of a process, that catch the signal and needed to wait for some tasks to finish before quitting, as database shutdown, or webserver active requests to finish (it was basically catching signal and not exiting quickly as supervisor expected)

actually this is how the problem can be replicated, and it is systematic, and it was mentioned in the post 1 year ago, but was not clear enough, I though a maintainer would find some technical description useful.

basically if you want to replicate the issue, just make the process for which you request to stop ignore the stop signal.

the supervisorctl will be continuously calling supervisord which will respond continuously without any break, and creates a continuous loop that eat up 100% of CPU till the underlying process decide to quit.(hopefully you have many CPUs then it is not 100%)

maybe the only reason not so many people complain about this, is that not so many people manage process such that they wait for them more then 30mins to quit, actually the setup I'm dealing with is meant to allow processes waiting such long time to quit, and I can't afford to have 1 CPU out during all that time.

[program:pgsql]
# if error then wait 10s before next retry
command=/supervisor/once_10.sh '/pgsql.sh postgres'
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stdout
stderr_logfile_maxbytes=0
autostart=true
autorestart=true
# only considers agent started after 10s
startsecs=10
# restart indefinetly
startretries=999999999999
stopwaitsecs=%(ENV_pod_stop_timeout)s
killasgroup=true
# https://www.postgresql.org/docs/9.4/server-shutdown.html
stopsignal=TERM
[program:uwsgi]
# if error then wait 10s before next retry
command=/supervisor/%(ENV_P_UWSGI_POLICY)s_10.sh /uwsgi.sh uwsgi ${P_UWSGI_CMD}
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stdout
stderr_logfile_maxbytes=0
autostart=true
autorestart=true
# only considers agent started after 10s
startsecs=10
# restart indefinetly
startretries=999999999999
stopwaitsecs=%(ENV_pod_uwsgi_stop_timeout)s
killasgroup=true

@LinJinghua
Copy link

I have some guesses, I don't know if they are correct. Can you help take a look? @mnaberez

Some cases of supervisord 100% CPU in Linux may be caused by the reuse of file descriptors.

Supervisorctl uses socket to communicate with supervisord. Usually the socket file descriptor in supervisord is lowest-numbered file descriptor not currently open for the process.
Soon supervisorctl quit and the socket file descriptor is closed. But combined_map has added this fd(combined_map.update(socket_map)) and register in poller(self.options.poller.register_readable(fd)). The code is https://github.com/Supervisor/supervisor/blob/3.4.0/supervisor/supervisord.py#L185..L208 .

If supervisorctl ask for restart process, and the process will open a log file as stdout. At this point, this log file descriptor(fd) is most likely the one held by the previous socket. Because this file descriptor is valid, it will never be removed by _ignore_invalid( eventmask & select.POLLNVAL will always be false). PollPoller.readables will always container this file descriptor, and poll for read. But at this time, combined_map no longer containers this file descriptor( dispatcher.del_channel will delete it from socket_map ). The event corresponding to this file descriptor will never be processed.

In the loop, poll returns quickly(because there is really data to process) and CPU increased to 100%(busy wait).

I get this a lot in supervisor 3.4.0. Maybe the master branch also has this problem.

Here are some traces:

~ supervisord --version
3.4.0
~ cat /etc/supervisord.d/subprocess.ini
[program:subprocess]
command=bash -c "while true; do date; ls -lh /proc/self/fd; sleep 1; done"
autostart=true
autorestart=unexpected
startsecs=0
exitcodes=0
redirect_stderr=true
stdout_logfile_maxbytes=128KB
stdout_logfile_backups=0
stderr_logfile_maxbytes=128KB
stderr_logfile_backups=0
...

~ ls -lh /proc/`supervisorctl pid`/fd
total 0
lrwx------ 1 root root 64 Jul 20 17:36 0 -> /dev/null
l-wx------ 1 root root 64 Jul 20 17:36 1 -> pipe:[64823090]
l-wx------ 1 root root 64 Jul 20 17:37 13 -> pipe:[244841693]
lr-x------ 1 root root 64 Jul 20 18:07 18 -> pipe:[244841694]
l-wx------ 1 root root 64 Jul 20 17:36 2 -> pipe:[64823091]
l-wx------ 1 root root 64 Jul 20 17:37 3 -> /var/tmp/log/supervisor/supervisord.log
lrwx------ 1 root root 64 Jul 20 17:37 4 -> socket:[64899115]
lr-x------ 1 root root 64 Jul 20 17:37 7 -> /dev/urandom
l-wx------ 1 root root 64 Jul 20 17:37 9 -> /tmp/subprocess-stdout---supervisor-ChnhnK.log
...

~ strace -Ttt -f -p `supervisorctl pid`
strace: Process 96368 attached
strace: [ Process PID=96368 runs in x32 mode. ]
strace: [ Process PID=96368 runs in 64 bit mode. ]
17:53:54.026947 poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}], 3, 1000) = 1 ([{fd=9, revents=POLLIN}]) <0.000063>
17:53:54.027209 wait4(-1, 0x7ffc085cdd44, WNOHANG, NULL) = 0 <0.000050>
17:53:54.027413 poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}], 3, 1000) = 1 ([{fd=9, revents=POLLIN}]) <0.000040>
17:53:54.027504 wait4(-1, 0x7ffc085cdd44, WNOHANG, NULL) = 0 <0.000013>
17:53:54.027585 poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}], 3, 1000) = 1 ([{fd=9, revents=POLLIN}]) <0.000011>

CPU usage of supervisord increased to 100% usually occurs after supervisorctl retart subprocess. It is generally impossible for the log file(fd=9) to be registered to PollPoller. I guess it was introduced after the socket descriptor was reused.

@justinpryzby
Copy link

Evidence that the FDs are getting confused: I found this in a stderr logfile of one of our processes.

POST /RPC2 HTTP/1.1
Host: localhost
Accept-Encoding: identity
User-Agent: Python-xmlrpc/3.8
Content-Type: text/xml
Accept: text/xml
Content-Length: 122

<?xml version='1.0'?>
<methodCall>
<methodName>supervisor.getAllProcessInfo</methodName>
<params>
</params>
</methodCall>
POST /RPC2 HTTP/1.1
Host: localhost
Accept-Encoding: identity
User-Agent: Python-xmlrpc/3.8
Content-Type: text/xml
Accept: text/xml
Content-Length: 122

<?xml version='1.0'?>
<methodCall>
<methodName>supervisor.getAllProcessInfo</methodName>
<params>
</params>
</methodCall>

The process was running, but not able to write to stdout - in its logging output, I saw:
BrokenPipeError: [Errno 32] Broken pipe

@mnaberez mnaberez pinned this issue Mar 5, 2023
@mnaberez mnaberez changed the title 100% CPU usage (maybe caused by new poll implementation?) 100% CPU usage (maybe caused by new poll implementation?) (help wanted) Mar 5, 2023
@mnaberez
Copy link
Member

A pull request has been submitted to fix the high CPU usage: #1581

If you are experiencing this issue and are able to try the patch, please do so and send your feedback.

@justinpryzby
Copy link

justinpryzby commented Jun 3, 2023

Thanks! We saw no issues since I deployed this to our customers in April.
But just now I see a process stuck writing to its stderr, which supervisor is failing to read from.
This may be a separate independent error than the one that was fixed...

It seems like it's lost track of its FDs, probably during log rotation.
2023-06-02 23:47:00,246 CRIT uncaptured python exception, closing channel <POutputDispatcher at 139775577852032 for <Subprocess at 139775577618176 with name xxxxxxxxx in state RUNNING> (stdout)> (<class 'OSError'>:[Errno 9] Bad file descriptor [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/supervisord.py|runforever|218] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|handle_read_event|281] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|record_output|215] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|_log|184] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|info|327] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|log|345] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|emit|227] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|doRollover|264])

lseek(5, 0, SEEK_CUR) = -1 EBADF (Bad file descriptor)
write(5, "\n>>Sat Jun 3 00:53:57 2023%a\n-E"..., 4079) = -1 EBADF (Bad file descriptor)

$ sudo ls -l /proc/26111/fd/5
ls: cannot access /proc/26111/fd/5: No such file or directory

@justinpryzby
Copy link

justinpryzby commented Jul 6, 2023

Here's an (other?) error that I found in a stderr logfile for an application running under supervisor.

Exception ignored in: <function tail_f_producer.del at 0x7f92f5918f70>
Traceback (most recent call last):
File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/http.py", line 649, in del
self._close()
File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/http.py", line 675, in _close
self.file.close()
OSError: [Errno 9] Bad file descriptor
Exception ignored in: <function tail_f_producer.del at 0x7f92f5918f70>

This is with the patch from [2a93d6b]. To be clear, I think this is a separate problem than the one that was fixed (and I've got no reason to think that the patch caused this issue). Maybe this issue should be closed and a new one created for other FD issues...

Edit: I also observed this on an instance with v4.2.5 without the patch applied.

@justinpryzby
Copy link

I'm observing a process currently using 100% CPU use doing poll() with (I checked) the patch applied.

Currently used ~27hr CPU time
root 2533 1 5 jun21 ? 1-03:22:30 /usr/bin/python3 /usr/local/bin/supervisord -c /home/telsasoft/server/supervisord/supervisord.conf

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=33, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}, {fd=60, events=POLLIN|POLLPRI|POLLHUP}, {fd=63, events=POLLIN|POLLPRI|POLLHUP}, {fd=66, events=POLLIN|POLLPRI|POLLHUP}, {fd=69, events=POLLIN|POLLPRI|POLLHUP}, {fd=71, events=POLLIN|POLLPRI|POLLHUP}, {fd=57, events=POLLIN|POLLPRI|POLLHUP}, {fd=34, events=POLLIN|POLLPRI|POLLHUP}, {fd=37, events=POLLIN|POLLPRI|POLLHUP}, {fd=52, events=POLLIN|POLLPRI|POLLHUP}, {fd=51, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=49, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=54, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=14, events=POLLIN|POLLPRI|POLLHUP}, {fd=19, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=28, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=29, events=POLLIN|POLLPRI|POLLHUP}], 27, 1000) = 1 ([{fd=29, revents=POLLNVAL}])
wait4(-1, 0x7ffdd4f1cf20, WNOHANG, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=33, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}, {fd=60, events=POLLIN|POLLPRI|POLLHUP}, {fd=63, events=POLLIN|POLLPRI|POLLHUP}, {fd=66, events=POLLIN|POLLPRI|POLLHUP}, {fd=69, events=POLLIN|POLLPRI|POLLHUP}, {fd=71, events=POLLIN|POLLPRI|POLLHUP}, {fd=57, events=POLLIN|POLLPRI|POLLHUP}, {fd=34, events=POLLIN|POLLPRI|POLLHUP}, {fd=37, events=POLLIN|POLLPRI|POLLHUP}, {fd=52, events=POLLIN|POLLPRI|POLLHUP}, {fd=51, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=49, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=54, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=14, events=POLLIN|POLLPRI|POLLHUP}, {fd=19, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=28, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=29, events=POLLIN|POLLPRI|POLLHUP}], 27, 1000) = 1 ([{fd=29, revents=POLLNVAL}])
wait4(-1, 0x7ffdd4f1cf20, WNOHANG, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=33, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}, {fd=60, events=POLLIN|POLLPRI|POLLHUP}, {fd=63, events=POLLIN|POLLPRI|POLLHUP}, {fd=66, events=POLLIN|POLLPRI|POLLHUP}, {fd=69, events=POLLIN|POLLPRI|POLLHUP}, {fd=71, events=POLLIN|POLLPRI|POLLHUP}, {fd=57, events=POLLIN|POLLPRI|POLLHUP}, {fd=34, events=POLLIN|POLLPRI|POLLHUP}, {fd=37, events=POLLIN|POLLPRI|POLLHUP}, {fd=52, events=POLLIN|POLLPRI|POLLHUP}, {fd=51, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=49, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=54, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=14, events=POLLIN|POLLPRI|POLLHUP}, {fd=19, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=28, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=29, events=POLLIN|POLLPRI|POLLHUP}], 27, 1000) = 1 ([{fd=29, revents=POLLNVAL}])

There's nothing in the supervisor logfile.
Let me know if I can collect more info somehow.

@Jagadeeshftw
Copy link

Thanks! We saw no issues since I deployed this to our customers in April. But just now I see a process stuck writing to its stderr, which supervisor is failing to read from. This may be a separate independent error than the one that was fixed...

It seems like it's lost track of its FDs, probably during log rotation. 2023-06-02 23:47:00,246 CRIT uncaptured python exception, closing channel <POutputDispatcher at 139775577852032 for <Subprocess at 139775577618176 with name xxxxxxxxx in state RUNNING> (stdout)> (<class 'OSError'>:[Errno 9] Bad file descriptor [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/supervisord.py|runforever|218] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|handle_read_event|281] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|record_output|215] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|_log|184] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|info|327] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|log|345] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|emit|227] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|doRollover|264])

lseek(5, 0, SEEK_CUR) = -1 EBADF (Bad file descriptor) write(5, "\n>>Sat Jun 3 00:53:57 2023%a\n-E"..., 4079) = -1 EBADF (Bad file descriptor)

$ sudo ls -l /proc/26111/fd/5 ls: cannot access /proc/26111/fd/5: No such file or directory

Did you resolve this Python exception error? I'm also facing the same issue.
[velocix@ac-0 cachelogic]$ supervisord --version
4.2.5
[velocix@ac-0 cachelogic]$ tail -1 /var/log/supervisor/supervisord.log
2023-09-13 18:00:08,052 CRIT uncaptured python exception, closing channel <POutputDispatcher at 140414136427688 for <Subprocess at 140414136085808 with name vxsnmpcdnselfcheck in state RUNNING> (stderr)> (<class 'OSError'>:[Errno 29] Illegal seek [/usr/local/lib/python3.6/site-packages/supervisor/supervisord.py|runforever|218] [/usr/local/lib/python3.6/site-packages/supervisor/dispatchers.py|handle_read_event|276] [/usr/local/lib/python3.6/site-packages/supervisor/dispatchers.py|record_output|210] [/usr/local/lib/python3.6/site-packages/supervisor/dispatchers.py|_log|179] [/usr/local/lib/python3.6/site-packages/supervisor/loggers.py|info|327] [/usr/local/lib/python3.6/site-packages/supervisor/loggers.py|log|345] [/usr/local/lib/python3.6/site-packages/supervisor/loggers.py|emit|227] [/usr/local/lib/python3.6/site-packages/supervisor/loggers.py|doRollover|264])
[velocix@ac-0 cachelogic]$

@justinpryzby
Copy link

justinpryzby commented Sep 13, 2023 via email

@Jagadeeshftw
Copy link

Jagadeeshftw commented Sep 14, 2023

@justinpryzby

I was experiencing this issue with Supervisor, and to address it, I attempted to upgrade Supervisor from version 4.2.2 to the latest version(4.2.5) using the command:

pip3 install --upgrade supervisor

However, after upgrading to the latest version, the issue I encountered still persists. I checked the issue section and found suggestions that upgrading the Supervisor would resolve it, but unfortunately, it didn't work in my case.

I'm not sure about whether this version is patch-applied or not.

If you have any insights or suggestions on how to resolve this issue, please feel free to share.
#1602

@mouday
Copy link

mouday commented Mar 21, 2024

yum install, version is 3.4.0

@justinpryzby
Copy link

yum install, version is 3.4.0

What do you mean when you say "yum install..." ? Did you mean to say that you hit the issue with that version or ??

@hedgedawg
Copy link

A pull request has been submitted to fix the high CPU usage: #1581

If you are experiencing this issue and are able to try the patch, please do so and send your feedback.

Thanks @mnaberez for the recommendation.

I had been seeing the issue intermittently where CPU usage spikes to 100%, and I've been able to replicate it after a few attempts of concurrent supervisorctl restart commands for multiple processes managed by supervisor.

Here are my observations, pre and post patch:

Hopefully this helps, as I understand there is some demand for a new version including this change - ref #1635.

Details of the patch:

index 0a4f3e6..2265db9 100755
--- a/supervisor/supervisord.py
+++ b/supervisor/supervisord.py
@@ -222,6 +222,14 @@ class Supervisor:
                         raise
                     except:
                         combined_map[fd].handle_error()
+                else:
+                    # if the fd is not in combined_map, we should unregister it. otherwise,
+                    # it will be polled every time, which may cause 100% cpu usage
+                    self.options.logger.warn('unexpected read event from fd %r' % fd)
+                    try:
+                        self.options.poller.unregister_readable(fd)
+                    except:
+                        pass
 
             for fd in w:
                 if fd in combined_map:
@@ -237,6 +245,12 @@ class Supervisor:
                         raise
                     except:
                         combined_map[fd].handle_error()
+                else:
+                    self.options.logger.warn('unexpected write event from fd %r' % fd)
+                    try:
+                        self.options.poller.unregister_writable(fd)
+                    except:
+                        pass
 
             for group in pgroups:
                 group.transition()

@andreigabreanu
Copy link

Hello,
It seems there is still an issue with the high CPU, not sure if related to this ticket.
We're on version: 4.2.5-1 - Debian 12. After a restart, the CPU is between 0 and 20% (normal usage). It happens every few weeks, where the supervisor process goes to 100% cpu and remains there.

@justinpryzby
Copy link

justinpryzby commented Jul 26, 2024

More evidence of FD confusion.
TelsaSupervisor> reload
Really restart the remote supervisord process y/N? y
error: <class 'http.client.BadStatusLine'>, 2024-07-26 14:34:46 UDP: [10.254.1.52]:55053->[10.28.1.45]:162 [UDP: [10.254.1.52]:55053->[10.28.1.45]:162]:
: file: /usr/lib64/python3.6/http/client.py line: 289

Edit: I should've added that this instance does not have the patch applied.

@mnaberez
Copy link
Member

#1581 (comment):

Encountered similar issue where supervisord is taking high CPU because it is busy processing POLLERR on pipe fd for which the other end of the pipe does not exist. Running version 4.0.3. Question: 1. Not sure if this fix will help here as we don't handle POLLERR as part of def _ignore_invalid(self, fd, eventmask): can we add logic to handle POLLERR as part of def _ignore_invalid(self, fd, eventmask): if (eventmask & select.POLLNVAL) or (eventmask & select.POLLERR): 2. Another symptom is restarting one of the process on our system resolves the issue, which indicates that some fd cleanup post chile supervisor process restarts kicks in and cleans up that fd. Also the issue happens after 50+ days on our system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.