100% CPU usage (maybe caused by new poll implementation?) (help wanted) #807

timonwong · 2016-08-06T05:27:06Z

We were using supervisor 3.3.0 on Ubuntu 14.04 LTS.

Recently on some of our nodes on AWS, we spot very high cpu usage about supervisord, to get around that, we have to reload it, but it may happen again in one day.

Reading from strace, we spot there are excessive calls to both 'gettimeofday' and 'poll', so after that, we have to choose to downgrade supervisor to 3.2.3.

I see there was #581, but I think it's irrelevant here, our wild guess is it just caused by the new poll implementation introduced in 3.3.0 (and maybe caused by simultaneous log outputs?)...

Thanks in advance!

The text was updated successfully, but these errors were encountered:

mnaberez · 2016-08-06T23:38:05Z

Recently on some of our nodes on AWS, we spot very high cpu usage about supervisord, to get around that, we have to reload it, but it may happen again in one day.

@hathawsh also saw high CPU usage from the new poller implementation and submitted #589. That was merged before 3.3.0 was released.

Reading from strace, we spot there are excessive calls to both 'gettimeofday' and 'poll', so after that, we have to choose to downgrade supervisor to 3.2.3.

It's possible this is the main loop spinning since poller.poll() does a poll and below it tick() does a gettimeofday.

Can you run 3.3.0 with loglevel=blat? That will produce as much debugging information as possible. Perhaps it will have some clues to why this is happening.

and maybe caused by simultaneous log outputs?

There weren't any changes to logging between 3.2.3 and 3.3.0. Since you said 3.2.3 works for you, I wouldn't suspect logging.

timonwong · 2016-08-08T15:33:29Z

Hi!
Today I spot the problem again, here gz'd strace log (duration ~1s):
strace.log.tar.gz
And here is fds:

0 -> /dev/null
1 -> /dev/null
10 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
100 -> pipe:[32356299]
101 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
102 -> pipe:[32356298]
103 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
104 -> pipe:[32356300]
105 -> /data/supervisor_log/bg_task_00-stderr.log
106 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
107 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
108 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
109 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
11 -> pipe:[3024]
110 -> /data/supervisor_log/bg_task_00-stdout.log
111 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
112 -> /data/supervisor_log/bg_task_00-stderr.log
113 -> pipe:[32356302]
114 -> pipe:[32356301]
115 -> pipe:[32356303]
116 -> /data/supervisor_log/bg_task_00-stdout.log.1
117 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
118 -> /data/supervisor_log/bg_task_00-stderr.log
119 -> /data/supervisor_log/bg_task_00-stdout.log.8
12 -> pipe:[17721748]
121 -> pipe:[32356304]
124 -> /data/supervisor_log/bg_task_00-stderr.log
13 -> pipe:[17721749]
14 -> pipe:[11775]
15 -> pipe:[3025]
16 -> pipe:[11776]
17 -> /data/supervisor_log/bg_task_02-stdout.log
18 -> /data/supervisor_log/bg_task_02-stderr.log
19 -> pipe:[17721751]
2 -> /dev/null
20 -> pipe:[11777]
21 -> pipe:[17721750]
22 -> /data/supervisor_log/bg_task_03-stdout.log
23 -> /data/supervisor_log/bg_task_03-stderr.log
24 -> pipe:[11827]
25 -> pipe:[17721752]
26 -> pipe:[11828]
27 -> /data/supervisor_log/bg_task_01-stdout.log.1
28 -> /data/supervisor_log/bg_task_01-stderr.log.2
29 -> pipe:[17721754]
3 -> /data/supervisor_log/supervisord.log
30 -> pipe:[11829]
31 -> pipe:[17721753]
32 -> /data/supervisor_log/bg_task_04-stdout.log
33 -> /data/supervisor_log/bg_task_04-stderr.log
34 -> pipe:[17721755]
35 -> /data/supervisor_log/bg_task_01-stdout.log
36 -> /data/supervisor_log/bg_task_01-stderr.log.1
37 -> pipe:[17721757]
38 -> pipe:[17721756]
39 -> pipe:[17721758]
4 -> socket:[13073]
40 -> /data/supervisor_log/bg_task_01-stdout.log.3
41 -> /data/supervisor_log/bg_task_01-stderr.log
42 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
43 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
44 -> pipe:[17721759]
45 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
46 -> pipe:[17719642]
47 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
48 -> pipe:[17719643]
49 -> /data/supervisor_log/bg_task_01-stdout.log.2
5 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
50 -> pipe:[17719644]
51 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
52 -> /data/supervisor_log/bg_task_01-stderr.log.3
53 -> /data/supervisor_log/bg_task_05-stdout.log
54 -> /data/supervisor_log/bg_task_05-stderr.log
55 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
56 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
57 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
58 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
59 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
6 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
60 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
61 -> pipe:[30456289]
62 -> pipe:[30456290]
63 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
64 -> /data/supervisor_log/node_exporter.log
65 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
66 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
67 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
68 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
69 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
7 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
70 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
71 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
72 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
73 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
74 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
75 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
76 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
77 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
78 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
79 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
8 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
80 -> /data/supervisor_log/bg_task_00-stderr.log
81 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
82 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
83 -> /data/supervisor_log/bg_task_00-stderr.log
84 -> /data/supervisor_log/bg_task_00-stdout.log.2
85 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
86 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
87 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
88 -> /data/supervisor_log/bg_task_00-stderr.log
89 -> /data/supervisor_log/bg_task_00-stderr.log
9 -> pipe:[3023]
90 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
91 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
92 -> /data/supervisor_log/bg_task_00-stdout.log.8
93 -> pipe:[32356293]
94 -> pipe:[32356294]
95 -> pipe:[32356296]
96 -> pipe:[32356295]
97 -> pipe:[32356297]
98 -> /data/supervisor_log/bg_task_00-stdout.log.3
99 -> /data/supervisor_log/bg_task_00-stderr.log

I haven't open the blather level yet, I'll try it now, and will see what's going on.

timonwong · 2016-08-10T15:21:54Z

Hi, me again, here is the supervisord log with blather level enabled:
supervisord.log.tar.gz

blusewang · 2016-12-21T06:55:30Z

Me too.every times when i login with http and restart a program.my server's CPU immediately go to 100%.
I wasted so much time on this bug!
I used it to manage my laravel queue. I hate it, but I have no choose!

gonesurfing · 2016-12-26T04:40:40Z

This is 100% reproducible on 3.3.1 (anaconda package on OSX).
-Start supervisord with http server on localhost:9001
-open new ipython console and import xmlrpclib and type
server = xmlrpclib.Server('http://localhost:9001')
-now run any server.getMethod and instantly cpu goes to 100% and stays there
-del(server) and cpu goes back to normal.

Bummer...This is a show stopper for what I need. let me know if any other info is needed. I'm going to have to downgrade and see if an older version doesn't show this behavior.

Update: I downgraded to 3.1.3 and this behavior is not present. Something definitely broke between those two versions, but I don't have any of the 3.2.x branch available to test easily.

cenkalti · 2016-12-26T16:06:32Z

I can confirm this. Very frustrating 😞

mnaberez · 2016-12-26T16:43:49Z

@igorsobreira A number of users have reported high CPU usage from supervisord and believe it was introduced by your patch in #129. Another case of high CPU usage was found in #589 and confirmed to be caused by the patch. Could you please look into this? There's reproduce instructions above in #807 (comment).

This fixes Supervisor#807

plockaby · 2017-04-15T23:57:49Z

I want to add that of the 76 servers that I have running supervisord, I had to stop/start supervisord on seven of them in the past four days for this bug.

plockaby · 2017-04-16T16:26:17Z

Now an eighth. A bit more detail now. I am running supervisor 3.3.0. I installed it last May as soon as it came out because it had my patch in it that I really wanted. I restarted supervisor on all my hosts to get the new version and it has been going just fine. Last week, I made a global configuration change. I changed minfds to 100000 (because a specific process wanted it) and I set childlogdir which had not been set before. I installed that new configuration on all my hosts and went around restarting supervisord. Now hosts are going into this poll spin loop at random. This is the strace log. These are spinning like crazy, eating up a whole CPU.

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=22, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}, {fd=36, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=41, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}], 18, 1000) = 1 ([{fd=20, revents=POLLIN}])
gettimeofday({1492359703, 190995}, NULL) = 0
gettimeofday({1492359703, 191062}, NULL) = 0
gettimeofday({1492359703, 194873}, NULL) = 0
gettimeofday({1492359703, 194973}, NULL) = 0
gettimeofday({1492359703, 195072}, NULL) = 0
gettimeofday({1492359703, 195108}, NULL) = 0
gettimeofday({1492359703, 195153}, NULL) = 0
gettimeofday({1492359703, 195224}, NULL) = 0
gettimeofday({1492359703, 195254}, NULL) = 0
gettimeofday({1492359703, 195299}, NULL) = 0
gettimeofday({1492359703, 195327}, NULL) = 0
gettimeofday({1492359703, 195378}, NULL) = 0
gettimeofday({1492359703, 195446}, NULL) = 0
wait4(-1, 0x7ffea7758d04, WNOHANG, NULL) = 0
gettimeofday({1492359703, 195526}, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=22, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}, {fd=36, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=41, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}], 18, 1000) = 1 ([{fd=20, revents=POLLIN}])
gettimeofday({1492359703, 195874}, NULL) = 0
gettimeofday({1492359703, 195936}, NULL) = 0
gettimeofday({1492359703, 196000}, NULL) = 0
gettimeofday({1492359703, 196092}, NULL) = 0
gettimeofday({1492359703, 196166}, NULL) = 0
gettimeofday({1492359703, 196256}, NULL) = 0
gettimeofday({1492359703, 196336}, NULL) = 0
gettimeofday({1492359703, 196380}, NULL) = 0
gettimeofday({1492359703, 196520}, NULL) = 0
gettimeofday({1492359703, 196557}, NULL) = 0
gettimeofday({1492359703, 196599}, NULL) = 0
gettimeofday({1492359703, 196643}, NULL) = 0
gettimeofday({1492359703, 196689}, NULL) = 0
wait4(-1, 0x7ffea7758d04, WNOHANG, NULL) = 0
gettimeofday({1492359703, 196787}, NULL) = 0

On the host in particular that is crashing right now, these are the fds:

total 0
lr-x------ 1 root root 64 Apr 11 15:42 0 -> /dev/null
lrwx------ 1 root root 64 Apr 11 15:42 1 -> socket:[74731265]
lr-x------ 1 root root 64 Apr 11 15:42 10 -> pipe:[74731311]
lr-x------ 1 root root 64 Apr 11 15:42 11 -> pipe:[74731313]
l-wx------ 1 root root 64 Apr 11 15:42 12 -> /data/logs/supervisor/supermon.log
l-wx------ 1 root root 64 Apr 11 15:42 14 -> pipe:[77877459]
lr-x------ 1 root root 64 Apr 11 15:42 15 -> pipe:[74731314]
lr-x------ 1 root root 64 Apr 11 15:42 16 -> pipe:[77877460]
l-wx------ 1 root root 64 Apr 11 15:42 17 -> /data/logs/supervisor/supercron.log
l-wx------ 1 root root 64 Apr 11 15:42 18 -> /data/logs/supervisor/supercron.err
l-wx------ 1 root root 64 Apr 11 15:42 19 -> pipe:[74731318]
lrwx------ 1 root root 64 Apr 11 15:42 2 -> socket:[74731265]
l-wx------ 1 root root 64 Apr 11 15:42 20 -> /data/logs/supervisor/dart-agent.err
lr-x------ 1 root root 64 Apr 11 15:42 21 -> pipe:[74731319]
lr-x------ 1 root root 64 Apr 11 15:42 22 -> pipe:[77877461]
l-wx------ 1 root root 64 Apr 11 15:42 24 -> pipe:[74731321]
lr-x------ 1 root root 64 Apr 11 15:42 25 -> pipe:[74731320]
lr-x------ 1 root root 64 Apr 11 15:42 26 -> pipe:[74731322]
l-wx------ 1 root root 64 Apr 11 15:42 27 -> /data/logs/supervisor/statsd.log
l-wx------ 1 root root 64 Apr 11 15:42 28 -> /data/logs/supervisor/statsd.err
l-wx------ 1 root root 64 Apr 16 09:27 29 -> pipe:[74731324]
l-wx------ 1 root root 64 Apr 11 15:42 3 -> /data/logs/supervisord.log
lr-x------ 1 root root 64 Apr 16 09:27 30 -> pipe:[74731323]
lr-x------ 1 root root 64 Apr 16 09:27 31 -> pipe:[74731325]
l-wx------ 1 root root 64 Apr 16 09:27 32 -> /data/logs/supervisor/redis.log
l-wx------ 1 root root 64 Apr 16 09:27 33 -> /data/logs/supervisor/redis.err
l-wx------ 1 root root 64 Apr 16 09:27 34 -> pipe:[74731327]
lr-x------ 1 root root 64 Apr 16 09:27 35 -> pipe:[74731326]
lr-x------ 1 root root 64 Apr 16 09:27 36 -> pipe:[74731328]
l-wx------ 1 root root 64 Apr 11 15:42 37 -> /data/logs/supervisor/dmca-admin-web.log
l-wx------ 1 root root 64 Apr 11 15:42 38 -> /data/logs/supervisor/dmca-admin-web.err
l-wx------ 1 root root 64 Apr 16 09:27 39 -> pipe:[74731330]
lrwx------ 1 root root 64 Apr 11 15:42 4 -> socket:[74731287]
lr-x------ 1 root root 64 Apr 16 09:27 40 -> pipe:[74731329]
lr-x------ 1 root root 64 Apr 11 15:42 41 -> pipe:[74731331]
l-wx------ 1 root root 64 Apr 11 15:42 42 -> /data/logs/supervisor/pgcheck.log
l-wx------ 1 root root 64 Apr 11 15:42 43 -> /data/logs/supervisor/pgcheck.err
l-wx------ 1 root root 64 Apr 11 15:42 44 -> /data/logs/supervisor/dart-agent.log
lr-x------ 1 root root 64 Apr 11 15:42 45 -> pipe:[74731332]
l-wx------ 1 root root 64 Apr 11 15:42 47 -> /data/logs/supervisor/pgwatch.log
l-wx------ 1 root root 64 Apr 11 15:42 48 -> /data/logs/supervisor/pgwatch.err
l-wx------ 1 root root 64 Apr 11 15:42 5 -> /data/logs/supervisor/supermon.err
l-wx------ 1 root root 64 Apr 11 15:42 6 -> pipe:[74731309]
lr-x------ 1 root root 64 Apr 11 15:42 7 -> /dev/urandom
lr-x------ 1 root root 64 Apr 11 15:42 8 -> pipe:[74731310]
l-wx------ 1 root root 64 Apr 11 15:42 9 -> pipe:[74731312]

mnaberez · 2017-04-16T16:38:32Z

@plockaby We're testing a fix for this over in #875. Are you able to try that patch?

plockaby · 2017-04-16T17:28:23Z

On about 2/3rds of my servers I have now upgraded to 3.3.1 and installed the patch. If it takes over a CPU again I'll let you know but it might be a few days. Sorry I didn't notice the conversation in the pull request.

plockaby · 2017-04-16T18:39:50Z

That patch did not solve my problem. I almost immediately had one of the updated systems (not the same one as before) go into a poll craziness:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}], 12, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1492367896, 593446}, NULL) = 0
gettimeofday({1492367896, 593515}, NULL) = 0
gettimeofday({1492367896, 593584}, NULL) = 0
gettimeofday({1492367896, 593637}, NULL) = 0
gettimeofday({1492367896, 593719}, NULL) = 0
gettimeofday({1492367896, 593788}, NULL) = 0
gettimeofday({1492367896, 593826}, NULL) = 0
gettimeofday({1492367896, 593863}, NULL) = 0
gettimeofday({1492367896, 593886}, NULL) = 0
gettimeofday({1492367896, 593906}, NULL) = 0
gettimeofday({1492367896, 593948}, NULL) = 0
gettimeofday({1492367896, 593987}, NULL) = 0
gettimeofday({1492367896, 594031}, NULL) = 0
wait4(-1, 0x7ffe0f36f7e4, WNOHANG, NULL) = 0
gettimeofday({1492367896, 594103}, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}], 12, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1492367896, 594331}, NULL) = 0
gettimeofday({1492367896, 594415}, NULL) = 0
gettimeofday({1492367896, 594491}, NULL) = 0
gettimeofday({1492367896, 594561}, NULL) = 0
gettimeofday({1492367896, 594642}, NULL) = 0
gettimeofday({1492367896, 594663}, NULL) = 0
gettimeofday({1492367896, 594699}, NULL) = 0
gettimeofday({1492367896, 594739}, NULL) = 0
gettimeofday({1492367896, 594769}, NULL) = 0
gettimeofday({1492367896, 594808}, NULL) = 0
gettimeofday({1492367896, 594836}, NULL) = 0
gettimeofday({1492367896, 594887}, NULL) = 0
gettimeofday({1492367896, 594934}, NULL) = 0
wait4(-1, 0x7ffe0f36f7e4, WNOHANG, NULL) = 0
gettimeofday({1492367896, 595005}, NULL) = 0

lr-x------ 1 root root 64 Apr 16 10:37 0 -> /dev/null
lrwx------ 1 root root 64 Apr 16 10:37 1 -> socket:[36606016]
lr-x------ 1 root root 64 Apr 16 10:37 10 -> pipe:[36606043]
lr-x------ 1 root root 64 Apr 16 10:37 11 -> pipe:[36606045]
l-wx------ 1 root root 64 Apr 16 10:37 12 -> /data/logs/supervisor/supermon.log
l-wx------ 1 root root 64 Apr 16 10:37 13 -> /data/logs/supervisor/supermon.err
l-wx------ 1 root root 64 Apr 16 10:37 14 -> pipe:[36606047]
lr-x------ 1 root root 64 Apr 16 10:37 15 -> pipe:[36606046]
lr-x------ 1 root root 64 Apr 16 10:37 16 -> pipe:[36606048]
l-wx------ 1 root root 64 Apr 16 10:37 17 -> /data/logs/supervisor/supercron.log
l-wx------ 1 root root 64 Apr 16 10:37 18 -> /data/logs/supervisor/supercron.err
l-wx------ 1 root root 64 Apr 16 10:37 19 -> pipe:[36606050]
lrwx------ 1 root root 64 Apr 16 10:37 2 -> socket:[36606016]
lr-x------ 1 root root 64 Apr 16 10:37 20 -> pipe:[36606049]
lr-x------ 1 root root 64 Apr 16 10:37 21 -> pipe:[36606051]
l-wx------ 1 root root 64 Apr 16 10:37 22 -> /data/logs/supervisor/dart-agent.log
l-wx------ 1 root root 64 Apr 16 10:37 24 -> pipe:[36606053]
lr-x------ 1 root root 64 Apr 16 10:37 25 -> pipe:[36606052]
lr-x------ 1 root root 64 Apr 16 10:37 26 -> pipe:[36606054]
l-wx------ 1 root root 64 Apr 16 10:37 27 -> /data/logs/supervisor/pgcheck.log
l-wx------ 1 root root 64 Apr 16 10:37 28 -> /data/logs/supervisor/pgcheck.err
l-wx------ 1 root root 64 Apr 16 10:37 3 -> /data/logs/supervisord.log
lr-x------ 1 root root 64 Apr 16 10:37 30 -> pipe:[36606055]
l-wx------ 1 root root 64 Apr 16 10:37 32 -> /data/logs/supervisor/pgwatch.log
l-wx------ 1 root root 64 Apr 16 10:37 33 -> /data/logs/supervisor/pgwatch.err
lrwx------ 1 root root 64 Apr 16 10:37 4 -> socket:[36606038]
l-wx------ 1 root root 64 Apr 16 10:37 5 -> /data/logs/supervisor/dart-agent.err
l-wx------ 1 root root 64 Apr 16 10:37 6 -> pipe:[36606041]
lr-x------ 1 root root 64 Apr 16 10:37 7 -> /dev/urandom
lr-x------ 1 root root 64 Apr 16 10:37 8 -> pipe:[36606042]
l-wx------ 1 root root 64 Apr 16 10:37 9 -> pipe:[36606044]

mnaberez · 2017-04-16T20:01:44Z

@plockaby I posted this over in #875. Can you please work with us there to reproduce?

plockaby · 2017-04-16T20:12:37Z

Can do. I'll move to commenting over there instead.

plockaby · 2017-05-12T18:58:41Z

I hate to harp on this but #875 didn't solve my problem and I'm still restarting supervisord every few days on random hosts. I just wanted to bring up that this is still happening.

blusewang · 2017-05-13T02:30:56Z

@plockaby I suggest you try to use systemd instead supervisor!

mnaberez · 2017-05-14T18:09:50Z

I hate to harp on this but #875 didn't solve my problem and I'm still restarting supervisord every few days on random hosts.

@plockaby I was able to reproduce one cause of high CPU and observe that #875 did fix it, which is why I merged #875. I haven't observed another cause of high CPU usage yet. If you are still experiencing problems with #875 applied, we are going to need reproduce instructions to show us how to get supervisord into that state so we can work on it.

mnaberez · 2017-06-03T19:12:05Z

Supervisor 3.3.2 was released to PyPI and it includes the poller fixes (#875, #885).

The only high CPU problem that I was able to reproduce is fixed by this release.

If you are experiencing a high CPU problem on Supervisor 3.3.2, please open a new issue and include instructions to reproduce it.

plockaby · 2017-06-16T16:19:26Z

I've been on vacation so I'm just getting to this now.

I tried 3.3.2 and it didn't solve the CPU spikes. I actually rolled all of my hosts back to 3.2.3 under the assumption that it was a problem with the new polling code. I'm still experiencing supervisord going into polling loops that use 100% CPU.

I understand that you can't fix a problem that you can't reproduce. If I had any way to reproduce this consistently, I would show you. All I can give you, really, is situations where it is happening. Right now I can tell you that this is not caused by changes in 3.3.x. I can also tell you that it is (probably) not caused by multiple processes writing to their logs simultaneously as this doesn't happen with any more frequency on our heavily loaded hosts as compared to our lightly loaded hosts. In fact it has happened on hosts that have just one process controlled by supervisord just as often as it has happened on hosts that have forty processes.

If I ever do figure out how to reproduce it then I will create a new ticket. Would you mind if I continue to post information about when I experience the issue?

mnaberez · 2017-06-16T16:52:38Z

Would you mind if I continue to post information about when I experience the issue?

Of course not. Please do that.

plockaby · 2017-06-28T15:41:44Z

Interestingly I've discovered that it often recovers from the polling problem. I identified one of my hosts that had started spinning CPU on polling but it wasn't convenient to restart the supervisord at the time so I just left top running in a window on the side. Six days of 100% CPU and it just stopped as suddenly as it started. Another one of the hosts that I identified ran at 100% CPU for nearly five days before it stopped but then it started again six hours later. (I've identified all of these by looking at collectd CPU graphs.)

I have an strace running on one of the hosts where the problem has cropped up several times and where I don't mind if the disk fills. When it starts on that host then I should have a bit more information about what is starting and/or stopping the problem.

I have also noticed that it is not happening on Centos7 hosts. It so far has only happened on Centos6 hosts. Unfortunately most of my hosts are still on Centos6. :)

plockaby · 2017-06-30T17:11:54Z

I've narrowed the problem down even further to be specific to the interaction between supervisord and an event listener on the communications channel between the two (i.e. my event listener's stdout). The file descriptor that is generating all of the poll events is my event listener's stdout file.

This explains why the problem would disappear: my event listener is configured to restart itself at random intervals and indeed when I restart it the 100% CPU spike goes away. I'm still examining the strace to figure out why the problem starts in the first place. The log itself isn't blowing up in size with insane numbers of writes.

plockaby · 2017-06-30T18:47:46Z

I was going to write a big long description with quotes from my strace but I think I discovered the problem. The problem happens when my event listener's log is rotated. Because it is an event listener and I really don't care about stdout for event listeners, my configuration is like this:

stdout_logfile = /data/logs/supervisor/dart-agent.log
stdout_logfile_maxbytes = 1MB
stdout_logfile_backups = 0

When the strace says that my program started going crazy, this appears in the strace:

open("/data/logs/supervisor/dart-agent.log", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0

That is the log file for my event listener and "5" is the file descriptor for its log file that I am currently spinning on polling:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=12, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}], 14, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1498848417, 352015}, NULL) = 0
gettimeofday({1498848417, 352079}, NULL) = 0
gettimeofday({1498848417, 352141}, NULL) = 0
gettimeofday({1498848417, 352204}, NULL) = 0
gettimeofday({1498848417, 352268}, NULL) = 0
gettimeofday({1498848417, 352331}, NULL) = 0
gettimeofday({1498848417, 352483}, NULL) = 0
gettimeofday({1498848417, 352590}, NULL) = 0
gettimeofday({1498848417, 352769}, NULL) = 0
gettimeofday({1498848417, 352858}, NULL) = 0
wait4(-1, 0x7fff3600b334, WNOHANG, NULL) = 0
gettimeofday({1498848417, 352992}, NULL) = 0

I hope that is enough to go on?

plockaby · 2017-08-23T22:52:25Z

I have just discovered that this is not a problem specific to event listeners. I had a non-event listener program just trigger the same problem. However that's the first time that has happened that I can tell.

justinpryzby · 2022-02-23T18:16:47Z

Still an issue - just hit it on 4.1.0. In this case, I cannot start supervisorctl:
supervisorctl -c .../supervisord.conf
error: <class 'ConnectionResetError'>, [Errno 104] Connection reset by peer: file: /usr/lib64/python3.6/socket.py line: 586

[pryzbyj@telsasoft ~]$ while sleep 1; do ps h -ww -o time,comm 27978; done
07:47:51 supervisord
07:47:52 supervisord
07:47:53 supervisord
07:47:54 supervisord

kewde · 2022-03-21T21:08:54Z

POLLOUT will return immediately if you're connecting to an unresponsive peer - cause heavy CPU usage.

mnaberez · 2022-06-22T15:10:08Z

#1515 (comment)

We found high CPU consumption (our java application 70%, supervisor 30%) when using supervisor
After running strace, it turned out that this consumption is associated with redirecting application's stdout to log file - supervisor using system calls read/write/select in python2 and poll in python3 - and apparently the main cpu time is spent on select syscall. Also we can't use kqueue since we use ubuntu

We need logs, so we want to reduce CPU consumption without turning them off
So far, the only thing that comes to mind is using epoll instead of select/poll to wait for descriptors in the hope that it will help

Are there any plans to add epoll to pollers list?
If so, when will it happen? And if not, is it worth adding its support on my own? Are there any pitfalls here?

Or maybe you can suggest how to solve this problem in some other way?

aleph0naught · 2022-06-22T15:28:18Z

@mnaberez

Can you please answer the question about epoll that I asked in closed issue?

So far, the only thing that comes to mind is using epoll instead of select/poll to wait for descriptors in the hope that it will help

Are there any plans to add epoll to pollers list?
If so, when will it happen? And if not, is it worth adding its support on my own? Are there any pitfalls here?

mnaberez · 2022-06-22T23:08:29Z

Hi, I don't have any answers to that, sorry, I only closed it to keep all the comments on this issue in one place.

aleph0naught · 2022-06-22T23:45:58Z

Hi, I don't have any answers to that, sorry, I only closed it to keep all the comments on this issue in one place.

Ok!
I reopened new issue with more specific title
I think this is independent task from the supervisor's point of view (task about adding epoll poller), despite the fact that it is related to this issue

mnaberez · 2022-06-29T16:11:40Z

A patch was proposed three times identically (originally here, the other two marked as duplicate):

   def poll(self, timeout):
        fds = self._poll_fds(timeout)
        readables, writables = [], []
        time.sleep(1)  # it was proposed to add this sleep()

I don't recommend users apply this patch because it is probably just masking an underlying problem. This may appear to work but it may also cause supervisord to become sluggish. For example, sending supervisorctl commands may take 1 second longer to respond. Log files may appear to freeze and then dump a lot of data at once after 1 second. In the worse case, processes running under supervisord may hang or crash if their pipes fill up because supervisord can't empty them fast enough. The interval could be lowered, e.g. time.sleep(0.1), but supervisord is a single-threaded event loop that's not supposed to block. I don't think we could merge a patch like that for these reasons.

poll_fds is intended to return immediately only if there is any change on the file descriptors it monitors, or a timeout occurs. If the timeout occurs, supervisord does some very minimally processing and immediately goes back into poll_fds. In the case of supervisord spinning on poll_fds, it means poll_fds is always returning immediately. It's not supposed to do that. The fix would be figure out why it is always returning immediately and solve it.

Reading through this thread, it seems that there have been several different causes of this issue, at least one of which was solved with a linked commit. I don't think any current maintainers know how to reproduce it. It would be extremely helpful if someone could send a minimal reproduce case that causes supervisord to spin at 100% CPU. For example: with this minimal configuration file, run supervisord on this operating system, and you will see it spin. This problem doesn't always occur, e.g. it has never been seen in pre-release testing or on CI, which is why reproduce instructions are needed.

ollofx · 2022-06-29T21:38:45Z

thanks for the moderation, and the clear explanation concerning the delay implied with this quick patch (not meant to solve the real issue).

I forgot that it was me who posted a year ago how I avoided CPU spike, anyway my objective was to only point out the issue to a maintainer who might be able to really fix it.

I posted the sleep fix, to allow people who were in the same case as me with production constraints, to avoid 100% CPU and keep using supervisord, at the cost of having some delay for supervisorctrl command, I've put 1s just to make sure people realize it has this cost.

for your last paragraph, I'm not sure all CPU spikes are the results of that, but for my case, it was happening when supervisorctrl request stopping of a process, that catch the signal and needed to wait for some tasks to finish before quitting, as database shutdown, or webserver active requests to finish (it was basically catching signal and not exiting quickly as supervisor expected)

actually this is how the problem can be replicated, and it is systematic, and it was mentioned in the post 1 year ago, but was not clear enough, I though a maintainer would find some technical description useful.

basically if you want to replicate the issue, just make the process for which you request to stop ignore the stop signal.

the supervisorctl will be continuously calling supervisord which will respond continuously without any break, and creates a continuous loop that eat up 100% of CPU till the underlying process decide to quit.(hopefully you have many CPUs then it is not 100%)

maybe the only reason not so many people complain about this, is that not so many people manage process such that they wait for them more then 30mins to quit, actually the setup I'm dealing with is meant to allow processes waiting such long time to quit, and I can't afford to have 1 CPU out during all that time.

[program:pgsql]
# if error then wait 10s before next retry
command=/supervisor/once_10.sh '/pgsql.sh postgres'
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stdout
stderr_logfile_maxbytes=0
autostart=true
autorestart=true
# only considers agent started after 10s
startsecs=10
# restart indefinetly
startretries=999999999999
stopwaitsecs=%(ENV_pod_stop_timeout)s
killasgroup=true
# https://www.postgresql.org/docs/9.4/server-shutdown.html
stopsignal=TERM

[program:uwsgi]
# if error then wait 10s before next retry
command=/supervisor/%(ENV_P_UWSGI_POLICY)s_10.sh /uwsgi.sh uwsgi ${P_UWSGI_CMD}
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stdout
stderr_logfile_maxbytes=0
autostart=true
autorestart=true
# only considers agent started after 10s
startsecs=10
# restart indefinetly
startretries=999999999999
stopwaitsecs=%(ENV_pod_uwsgi_stop_timeout)s
killasgroup=true

LinJinghua · 2022-07-29T18:43:04Z

I have some guesses, I don't know if they are correct. Can you help take a look? @mnaberez

Some cases of supervisord 100% CPU in Linux may be caused by the reuse of file descriptors.

Supervisorctl uses socket to communicate with supervisord. Usually the socket file descriptor in supervisord is lowest-numbered file descriptor not currently open for the process.
Soon supervisorctl quit and the socket file descriptor is closed. But combined_map has added this fd(combined_map.update(socket_map)) and register in poller(self.options.poller.register_readable(fd)). The code is https://github.com/Supervisor/supervisor/blob/3.4.0/supervisor/supervisord.py#L185..L208 .

If supervisorctl ask for restart process, and the process will open a log file as stdout. At this point, this log file descriptor(fd) is most likely the one held by the previous socket. Because this file descriptor is valid, it will never be removed by _ignore_invalid( eventmask & select.POLLNVAL will always be false). PollPoller.readables will always container this file descriptor, and poll for read. But at this time, combined_map no longer containers this file descriptor( dispatcher.del_channel will delete it from socket_map ). The event corresponding to this file descriptor will never be processed.

In the loop, poll returns quickly(because there is really data to process) and CPU increased to 100%(busy wait).

I get this a lot in supervisor 3.4.0. Maybe the master branch also has this problem.

Here are some traces:

~ supervisord --version
3.4.0
~ cat /etc/supervisord.d/subprocess.ini
[program:subprocess]
command=bash -c "while true; do date; ls -lh /proc/self/fd; sleep 1; done"
autostart=true
autorestart=unexpected
startsecs=0
exitcodes=0
redirect_stderr=true
stdout_logfile_maxbytes=128KB
stdout_logfile_backups=0
stderr_logfile_maxbytes=128KB
stderr_logfile_backups=0
...

~ ls -lh /proc/`supervisorctl pid`/fd
total 0
lrwx------ 1 root root 64 Jul 20 17:36 0 -> /dev/null
l-wx------ 1 root root 64 Jul 20 17:36 1 -> pipe:[64823090]
l-wx------ 1 root root 64 Jul 20 17:37 13 -> pipe:[244841693]
lr-x------ 1 root root 64 Jul 20 18:07 18 -> pipe:[244841694]
l-wx------ 1 root root 64 Jul 20 17:36 2 -> pipe:[64823091]
l-wx------ 1 root root 64 Jul 20 17:37 3 -> /var/tmp/log/supervisor/supervisord.log
lrwx------ 1 root root 64 Jul 20 17:37 4 -> socket:[64899115]
lr-x------ 1 root root 64 Jul 20 17:37 7 -> /dev/urandom
l-wx------ 1 root root 64 Jul 20 17:37 9 -> /tmp/subprocess-stdout---supervisor-ChnhnK.log
...

~ strace -Ttt -f -p `supervisorctl pid`
strace: Process 96368 attached
strace: [ Process PID=96368 runs in x32 mode. ]
strace: [ Process PID=96368 runs in 64 bit mode. ]
17:53:54.026947 poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}], 3, 1000) = 1 ([{fd=9, revents=POLLIN}]) <0.000063>
17:53:54.027209 wait4(-1, 0x7ffc085cdd44, WNOHANG, NULL) = 0 <0.000050>
17:53:54.027413 poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}], 3, 1000) = 1 ([{fd=9, revents=POLLIN}]) <0.000040>
17:53:54.027504 wait4(-1, 0x7ffc085cdd44, WNOHANG, NULL) = 0 <0.000013>
17:53:54.027585 poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}], 3, 1000) = 1 ([{fd=9, revents=POLLIN}]) <0.000011>

CPU usage of supervisord increased to 100% usually occurs after supervisorctl retart subprocess. It is generally impossible for the log file(fd=9) to be registered to PollPoller. I guess it was introduced after the socket descriptor was reused.

justinpryzby · 2023-01-04T22:52:22Z

Evidence that the FDs are getting confused: I found this in a stderr logfile of one of our processes.

POST /RPC2 HTTP/1.1
Host: localhost
Accept-Encoding: identity
User-Agent: Python-xmlrpc/3.8
Content-Type: text/xml
Accept: text/xml
Content-Length: 122

<?xml version='1.0'?>
<methodCall>
<methodName>supervisor.getAllProcessInfo</methodName>
<params>
</params>
</methodCall>
POST /RPC2 HTTP/1.1
Host: localhost
Accept-Encoding: identity
User-Agent: Python-xmlrpc/3.8
Content-Type: text/xml
Accept: text/xml
Content-Length: 122

<?xml version='1.0'?>
<methodCall>
<methodName>supervisor.getAllProcessInfo</methodName>
<params>
</params>
</methodCall>

The process was running, but not able to write to stdout - in its logging output, I saw:
BrokenPipeError: [Errno 32] Broken pipe

mnaberez · 2023-03-14T17:11:26Z

A pull request has been submitted to fix the high CPU usage: #1581

If you are experiencing this issue and are able to try the patch, please do so and send your feedback.

justinpryzby · 2023-06-03T00:21:22Z

Thanks! We saw no issues since I deployed this to our customers in April.
But just now I see a process stuck writing to its stderr, which supervisor is failing to read from.
This may be a separate independent error than the one that was fixed...

It seems like it's lost track of its FDs, probably during log rotation.
2023-06-02 23:47:00,246 CRIT uncaptured python exception, closing channel <POutputDispatcher at 139775577852032 for <Subprocess at 139775577618176 with name xxxxxxxxx in state RUNNING> (stdout)> (<class 'OSError'>:[Errno 9] Bad file descriptor [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/supervisord.py|runforever|218] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|handle_read_event|281] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|record_output|215] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|_log|184] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|info|327] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|log|345] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|emit|227] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|doRollover|264])

lseek(5, 0, SEEK_CUR) = -1 EBADF (Bad file descriptor)
write(5, "\n>>Sat Jun 3 00:53:57 2023%a\n-E"..., 4079) = -1 EBADF (Bad file descriptor)

$ sudo ls -l /proc/26111/fd/5
ls: cannot access /proc/26111/fd/5: No such file or directory

justinpryzby · 2023-07-06T18:25:29Z

Here's an (other?) error that I found in a stderr logfile for an application running under supervisor.

Exception ignored in: <function tail_f_producer.del at 0x7f92f5918f70>
Traceback (most recent call last):
File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/http.py", line 649, in del
self._close()
File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/http.py", line 675, in _close
self.file.close()
OSError: [Errno 9] Bad file descriptor
Exception ignored in: <function tail_f_producer.del at 0x7f92f5918f70>

This is with the patch from [2a93d6b]. To be clear, I think this is a separate problem than the one that was fixed (and I've got no reason to think that the patch caused this issue). Maybe this issue should be closed and a new one created for other FD issues...

Edit: I also observed this on an instance with v4.2.5 without the patch applied.

justinpryzby · 2023-07-13T15:13:30Z

I'm observing a process currently using 100% CPU use doing poll() with (I checked) the patch applied.

Currently used ~27hr CPU time
root 2533 1 5 jun21 ? 1-03:22:30 /usr/bin/python3 /usr/local/bin/supervisord -c /home/telsasoft/server/supervisord/supervisord.conf

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=33, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}, {fd=60, events=POLLIN|POLLPRI|POLLHUP}, {fd=63, events=POLLIN|POLLPRI|POLLHUP}, {fd=66, events=POLLIN|POLLPRI|POLLHUP}, {fd=69, events=POLLIN|POLLPRI|POLLHUP}, {fd=71, events=POLLIN|POLLPRI|POLLHUP}, {fd=57, events=POLLIN|POLLPRI|POLLHUP}, {fd=34, events=POLLIN|POLLPRI|POLLHUP}, {fd=37, events=POLLIN|POLLPRI|POLLHUP}, {fd=52, events=POLLIN|POLLPRI|POLLHUP}, {fd=51, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=49, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=54, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=14, events=POLLIN|POLLPRI|POLLHUP}, {fd=19, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=28, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=29, events=POLLIN|POLLPRI|POLLHUP}], 27, 1000) = 1 ([{fd=29, revents=POLLNVAL}])
wait4(-1, 0x7ffdd4f1cf20, WNOHANG, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=33, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}, {fd=60, events=POLLIN|POLLPRI|POLLHUP}, {fd=63, events=POLLIN|POLLPRI|POLLHUP}, {fd=66, events=POLLIN|POLLPRI|POLLHUP}, {fd=69, events=POLLIN|POLLPRI|POLLHUP}, {fd=71, events=POLLIN|POLLPRI|POLLHUP}, {fd=57, events=POLLIN|POLLPRI|POLLHUP}, {fd=34, events=POLLIN|POLLPRI|POLLHUP}, {fd=37, events=POLLIN|POLLPRI|POLLHUP}, {fd=52, events=POLLIN|POLLPRI|POLLHUP}, {fd=51, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=49, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=54, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=14, events=POLLIN|POLLPRI|POLLHUP}, {fd=19, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=28, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=29, events=POLLIN|POLLPRI|POLLHUP}], 27, 1000) = 1 ([{fd=29, revents=POLLNVAL}])
wait4(-1, 0x7ffdd4f1cf20, WNOHANG, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=33, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}, {fd=60, events=POLLIN|POLLPRI|POLLHUP}, {fd=63, events=POLLIN|POLLPRI|POLLHUP}, {fd=66, events=POLLIN|POLLPRI|POLLHUP}, {fd=69, events=POLLIN|POLLPRI|POLLHUP}, {fd=71, events=POLLIN|POLLPRI|POLLHUP}, {fd=57, events=POLLIN|POLLPRI|POLLHUP}, {fd=34, events=POLLIN|POLLPRI|POLLHUP}, {fd=37, events=POLLIN|POLLPRI|POLLHUP}, {fd=52, events=POLLIN|POLLPRI|POLLHUP}, {fd=51, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=49, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=54, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=14, events=POLLIN|POLLPRI|POLLHUP}, {fd=19, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=28, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=29, events=POLLIN|POLLPRI|POLLHUP}], 27, 1000) = 1 ([{fd=29, revents=POLLNVAL}])

There's nothing in the supervisor logfile.
Let me know if I can collect more info somehow.

Jagadeeshftw · 2023-09-13T18:12:10Z

Thanks! We saw no issues since I deployed this to our customers in April. But just now I see a process stuck writing to its stderr, which supervisor is failing to read from. This may be a separate independent error than the one that was fixed...

It seems like it's lost track of its FDs, probably during log rotation. 2023-06-02 23:47:00,246 CRIT uncaptured python exception, closing channel <POutputDispatcher at 139775577852032 for <Subprocess at 139775577618176 with name xxxxxxxxx in state RUNNING> (stdout)> (<class 'OSError'>:[Errno 9] Bad file descriptor [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/supervisord.py|runforever|218] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|handle_read_event|281] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|record_output|215] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/dispatchers.py|_log|184] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|info|327] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|log|345] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|emit|227] [/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/supervisor/loggers.py|doRollover|264])

lseek(5, 0, SEEK_CUR) = -1 EBADF (Bad file descriptor) write(5, "\n>>Sat Jun 3 00:53:57 2023%a\n-E"..., 4079) = -1 EBADF (Bad file descriptor)

$ sudo ls -l /proc/26111/fd/5 ls: cannot access /proc/26111/fd/5: No such file or directory

Did you resolve this Python exception error? I'm also facing the same issue.
[velocix@ac-0 cachelogic]$ supervisord --version
4.2.5
[velocix@ac-0 cachelogic]$ tail -1 /var/log/supervisor/supervisord.log
2023-09-13 18:00:08,052 CRIT uncaptured python exception, closing channel <POutputDispatcher at 140414136427688 for <Subprocess at 140414136085808 with name vxsnmpcdnselfcheck in state RUNNING> (stderr)> (<class 'OSError'>:[Errno 29] Illegal seek [/usr/local/lib/python3.6/site-packages/supervisor/supervisord.py|runforever|218] [/usr/local/lib/python3.6/site-packages/supervisor/dispatchers.py|handle_read_event|276] [/usr/local/lib/python3.6/site-packages/supervisor/dispatchers.py|record_output|210] [/usr/local/lib/python3.6/site-packages/supervisor/dispatchers.py|_log|179] [/usr/local/lib/python3.6/site-packages/supervisor/loggers.py|info|327] [/usr/local/lib/python3.6/site-packages/supervisor/loggers.py|log|345] [/usr/local/lib/python3.6/site-packages/supervisor/loggers.py|emit|227] [/usr/local/lib/python3.6/site-packages/supervisor/loggers.py|doRollover|264])
[velocix@ac-0 cachelogic]$

justinpryzby · 2023-09-13T19:00:50Z

On Wed, Sep 13, 2023 at 11:12:23AM -0700, Jagadeesh B wrote: Did you resolve this Python exception error? I'm also facing the same issue.

I didn't resolve it, no.

***@***.*** cachelogic]$ supervisord --version **4.2.5**

Is that with the patch applied (or git HEAD) or unpatched 4.2.5 ?

Jagadeeshftw · 2023-09-14T09:06:21Z

@justinpryzby

I was experiencing this issue with Supervisor, and to address it, I attempted to upgrade Supervisor from version 4.2.2 to the latest version(4.2.5) using the command:

pip3 install --upgrade supervisor

However, after upgrading to the latest version, the issue I encountered still persists. I checked the issue section and found suggestions that upgrading the Supervisor would resolve it, but unfortunately, it didn't work in my case.

I'm not sure about whether this version is patch-applied or not.

If you have any insights or suggestions on how to resolve this issue, please feel free to share.
#1602

mouday · 2024-03-21T06:06:52Z

yum install, version is 3.4.0

justinpryzby · 2024-03-21T11:26:57Z

yum install, version is 3.4.0

What do you mean when you say "yum install..." ? Did you mean to say that you hit the issue with that version or ??

hedgedawg · 2024-07-09T06:41:07Z

A pull request has been submitted to fix the high CPU usage: #1581

If you are experiencing this issue and are able to try the patch, please do so and send your feedback.

Thanks @mnaberez for the recommendation.

I had been seeing the issue intermittently where CPU usage spikes to 100%, and I've been able to replicate it after a few attempts of concurrent supervisorctl restart commands for multiple processes managed by supervisor.

Here are my observations, pre and post patch:

supervisor 4.1.0 - high CPU usage
supervisor 4.2.5 - high CPU usage
supervisor 4.2.5, including a patch from Fix high cpu usage caused by fd leak #1581 - I've been unable to replicate the high CPU usage

Hopefully this helps, as I understand there is some demand for a new version including this change - ref #1635.

Details of the patch:

index 0a4f3e6..2265db9 100755
--- a/supervisor/supervisord.py
+++ b/supervisor/supervisord.py
@@ -222,6 +222,14 @@ class Supervisor:
                         raise
                     except:
                         combined_map[fd].handle_error()
+                else:
+                    # if the fd is not in combined_map, we should unregister it. otherwise,
+                    # it will be polled every time, which may cause 100% cpu usage
+                    self.options.logger.warn('unexpected read event from fd %r' % fd)
+                    try:
+                        self.options.poller.unregister_readable(fd)
+                    except:
+                        pass
 
             for fd in w:
                 if fd in combined_map:
@@ -237,6 +245,12 @@ class Supervisor:
                         raise
                     except:
                         combined_map[fd].handle_error()
+                else:
+                    self.options.logger.warn('unexpected write event from fd %r' % fd)
+                    try:
+                        self.options.poller.unregister_writable(fd)
+                    except:
+                        pass
 
             for group in pgroups:
                 group.transition()

andreigabreanu · 2024-07-13T09:02:45Z

Hello,
It seems there is still an issue with the high CPU, not sure if related to this ticket.
We're on version: 4.2.5-1 - Debian 12. After a restart, the CPU is between 0 and 20% (normal usage). It happens every few weeks, where the supervisor process goes to 100% cpu and remains there.

justinpryzby · 2024-07-26T20:53:54Z

More evidence of FD confusion.
TelsaSupervisor> reload
Really restart the remote supervisord process y/N? y
error: <class 'http.client.BadStatusLine'>, 2024-07-26 14:34:46 UDP: [10.254.1.52]:55053->[10.28.1.45]:162 [UDP: [10.254.1.52]:55053->[10.28.1.45]:162]:
: file: /usr/lib64/python3.6/http/client.py line: 289

Edit: I should've added that this instance does not have the patch applied.

mnaberez · 2025-01-16T17:26:40Z

#1581 (comment):

Encountered similar issue where supervisord is taking high CPU because it is busy processing POLLERR on pipe fd for which the other end of the pipe does not exist. Running version 4.0.3. Question: 1. Not sure if this fix will help here as we don't handle POLLERR as part of def _ignore_invalid(self, fd, eventmask): can we add logic to handle POLLERR as part of def _ignore_invalid(self, fd, eventmask): if (eventmask & select.POLLNVAL) or (eventmask & select.POLLERR): 2. Another symptom is restarting one of the process on our system resolves the issue, which indicates that some fd cleanup post chile supervisor process restarts kicks in and cleans up that fd. Also the issue happens after 50+ days on our system.

drewp mentioned this issue Feb 18, 2017

Unregister file descriptors when streams close. #589

Merged

mhariri pushed a commit to mhariri/supervisor that referenced this issue Mar 10, 2017

Clear writable sockets after the first event

1a25c50

This fixes Supervisor#807

mhariri mentioned this issue Mar 10, 2017

Clear writable sockets after the first event #903

Closed

mnaberez mentioned this issue Apr 16, 2017

Fix 100% CPU usage #875

Merged

mnaberez closed this as completed Jun 3, 2017

mnaberez reopened this Jun 30, 2017

aleph0naught mentioned this issue Jun 22, 2022

high CPU usage and epoll Poller #1515

Closed

aleph0naught mentioned this issue Jun 22, 2022

Adding epoll Poller #1516

Closed

This comment was marked as duplicate.

Sign in to view

mnaberez pinned this issue Mar 5, 2023

mnaberez changed the title ~~100% CPU usage (maybe caused by new poll implementation?)~~ 100% CPU usage (maybe caused by new poll implementation?) (help wanted) Mar 5, 2023

aftersnow mentioned this issue Mar 14, 2023

Fix high cpu usage caused by fd leak #1581

Merged

mnaberez mentioned this issue Dec 19, 2023

supervisord process cpu use very high #1612

Closed

100% CPU usage (maybe caused by new poll implementation?) (help wanted) #807

100% CPU usage (maybe caused by new poll implementation?) (help wanted) #807

Comments

timonwong commented Aug 6, 2016

mnaberez commented Aug 6, 2016

timonwong commented Aug 8, 2016 • edited Loading

timonwong commented Aug 10, 2016

blusewang commented Dec 21, 2016 • edited Loading

gonesurfing commented Dec 26, 2016 • edited Loading

cenkalti commented Dec 26, 2016

mnaberez commented Dec 26, 2016

plockaby commented Apr 15, 2017

plockaby commented Apr 16, 2017 • edited Loading

mnaberez commented Apr 16, 2017

plockaby commented Apr 16, 2017

plockaby commented Apr 16, 2017

mnaberez commented Apr 16, 2017

plockaby commented Apr 16, 2017

plockaby commented May 12, 2017

blusewang commented May 13, 2017

mnaberez commented May 14, 2017

mnaberez commented Jun 3, 2017

plockaby commented Jun 16, 2017

mnaberez commented Jun 16, 2017

plockaby commented Jun 28, 2017 • edited Loading

plockaby commented Jun 30, 2017 • edited Loading

plockaby commented Jun 30, 2017 • edited Loading

plockaby commented Aug 23, 2017

justinpryzby commented Feb 23, 2022

kewde commented Mar 21, 2022 • edited Loading

mnaberez commented Jun 22, 2022

aleph0naught commented Jun 22, 2022

mnaberez commented Jun 22, 2022

aleph0naught commented Jun 22, 2022

This comment was marked as duplicate.

mnaberez commented Jun 29, 2022

ollofx commented Jun 29, 2022 • edited Loading

LinJinghua commented Jul 29, 2022

justinpryzby commented Jan 4, 2023

mnaberez commented Mar 14, 2023

justinpryzby commented Jun 3, 2023 • edited Loading

justinpryzby commented Jul 6, 2023 • edited Loading

justinpryzby commented Jul 13, 2023

Jagadeeshftw commented Sep 13, 2023

justinpryzby commented Sep 13, 2023 via email

Jagadeeshftw commented Sep 14, 2023 • edited Loading

mouday commented Mar 21, 2024

justinpryzby commented Mar 21, 2024

hedgedawg commented Jul 9, 2024

andreigabreanu commented Jul 13, 2024

justinpryzby commented Jul 26, 2024 • edited Loading

mnaberez commented Jan 16, 2025

timonwong commented Aug 8, 2016 •

edited

Loading

blusewang commented Dec 21, 2016 •

edited

Loading

gonesurfing commented Dec 26, 2016 •

edited

Loading

plockaby commented Apr 16, 2017 •

edited

Loading

plockaby commented Jun 28, 2017 •

edited

Loading

plockaby commented Jun 30, 2017 •

edited

Loading

plockaby commented Jun 30, 2017 •

edited

Loading

kewde commented Mar 21, 2022 •

edited

Loading

ollofx commented Jun 29, 2022 •

edited

Loading

justinpryzby commented Jun 3, 2023 •

edited

Loading

justinpryzby commented Jul 6, 2023 •

edited

Loading

Jagadeeshftw commented Sep 14, 2023 •

edited

Loading

justinpryzby commented Jul 26, 2024 •

edited

Loading