Investigate ability to detect out-of-memory-related restart looping #1318

gkc · 2023-04-24T15:15:41Z

Is your feature request related to a problem? Please describe.

We need a way to detect out-of-memory-related restart looping

Describe the solution you'd like

Have a tool which listens for events such as when the docker swarm manager has killed a container. Such a tool could then check if the container was killed due to container memory usage exceeding its cap, and could also check if it was previously killed within N (e.g. 10) minutes also due to container memory usage exceeding its cap

See https://docs.docker.com/engine/reference/commandline/events/

Describe alternatives you've considered

No response

Additional context

Linked to #1303

athandle · 2023-04-25T13:31:04Z

1st incident - logs from @kumarnarendra701 @sitaram-kalluri is checking
root@swarm0002-01:~# docker service logs -f uc0c4qz75h98k830v3nimbf1d
5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | INFO|2023-04-25 12:32:20.817282|AtSecondaryServer|currentAtSign : @foolishgemini1
5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal |
5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | INFO|2023-04-25 12:32:20.829918|HiveBase|commit_log_106a8370ed3426f079bb4b1239aa00ee682c830a9898eadeea1a84b6d7fedd2a initialized successfully
5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal |
5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | INFO|2023-04-25 12:32:20.896658|HiveBase|access_log_106a8370ed3426f079bb4b1239aa00ee682c830a9898eadeea1a84b6d7fedd2a initialized successfully
5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal |
5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | INFO|2023-04-25 12:32:27.270485|HiveBase|notifications_106a8370ed3426f079bb4b1239aa00ee682c830a9898eadeea1a84b6d7fedd2a initialized successfully
5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal |

kumarnarendra701 · 2023-05-01T13:01:07Z

2nd incident - @snowthe18raw (5835)

Logs -

O|2023-05-01 12:37:56.179259|AtSecondaryServer|currentAtSign : @snowthe18raw 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | INFO|2023-05-01 12:37:56.209010|HiveBase|commit_log_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | INFO|2023-05-01 12:37:56.240765|HiveBase|access_log_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | INFO|2023-05-01 12:38:04.473256|HiveBase|notifications_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | INFO|2023-05-01 12:38:53.763544|AtSecondaryServer|currentAtSign : @snowthe18raw 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | INFO|2023-05-01 12:38:53.774617|HiveBase|commit_log_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | INFO|2023-05-01 12:38:53.790753|HiveBase|access_log_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | 
^C

athandle · 2023-07-09T16:08:28Z

@sitaram-kalluri @murali-shris can you add your comments on this ticket

gkc · 2023-07-09T17:50:20Z

@athandle : @sitaram-kalluri has made good progress in profiling memory usage on #1303 and has PR #1428 which reduces memory consumption during startup and in steady state

However, this ticket is about detecting out-of-memory restarts at the swarm level. See "Describe the solution you'd like" in this ticket's description above

kumarnarendra701 · 2023-09-18T13:04:33Z

Moving to the next sprint

cpswan · 2023-10-02T13:38:30Z

Moving to PR72. @kumarnarendra701 can you please make sure this one's on your list for this sprint.

kumarnarendra701 · 2023-10-09T10:02:46Z

@cpswan @gkc - Can we use this swarmprom to monitor our swarm clusters? Because if we use Docker events, we'd have to set up our custom script on each of our swarm nodes. Please share your thoughts and let me know if you see any other monitoring tools.

https://dockerswarm.rocks/swarmprom/

cpswan · 2023-10-09T10:14:02Z

Nice find @kumarnarendra701 Swarmprom looks really nice.

Let's get it set up on staging and see how we get on with it (and whether it can solve the problem we're looking at here).

kumarnarendra701 · 2023-10-16T11:32:46Z

@cpswan - I set up Swarmprom in our staging environment, and I'm currently checking out the UI to explore more about it and moving it to the next sprint for more work.

kumarnarendra701 · 2023-10-30T09:50:50Z

@cpswan - Didn't get a chance to work on this tool and move to next sprint.

kumarnarendra701 · 2023-11-13T09:21:29Z

@cpswan - Moving to the next sprint. I'm seeing the issue in Swarmprom UI. I'll update further progress on the ticket.

kumarnarendra701 · 2023-11-24T12:11:22Z

@cpswan
Quick update on Swarmprom - I'm currently experiencing some trouble as I'm unable to view all swarm nodes and their services on Swarmprom UI. Finding a solution has been quite challenging as there is very little documentation available on the internet. However, I'm actively working on resolving this issue and will keep you updated on the progress of the setup.
cc: @athandle

kumarnarendra701 · 2023-12-10T06:55:39Z

@cpswan -I am seeing issues with Swarmprom setup as they have stopped development on their repository. Therefore, I have found another tool and I am exploring Portainer.

cc: @athandle

kumarnarendra701 · 2023-12-26T13:34:10Z

@cpswan - Portainer UI setup completed and facing some issues in agent connectivity and working on this.

cpswan · 2024-01-02T09:59:44Z

@cconstab I know that you tried Portainer a while ago, so it would be good to get your feedback on it?

kumarnarendra701 · 2024-01-02T13:04:50Z

@cpswan - I used Portainer in my staging Swarm cluster, but I noticed it's mainly for managing Docker Swarm itself and doesn't focus much on monitoring. Also, it doesn't show more visibility of stacks that are created outside of Portainer.

cconstab · 2024-01-02T15:42:40Z

I found it worked ok in small setups like mybhome lab but did not scale well to our setup.
It used tobat least become laggy and unreliable. I also had security concerns.

My take was in the end use the cli and if we needed tools look else where.

The portainer team also got "k8s" pretty bad and that started to pull the project away from Swarm mode.

This was 2 years back so things may well have changed.

cpswan · 2024-01-08T13:38:33Z

Bumping to PR78 so that @kumarnarendra701 can continue. I've suggested:

If Portainer isn't suitable then maybe go back to swarmprom and let's see what it might take to get it up to scratch.

kumarnarendra701 · 2024-01-12T11:03:59Z

@cpswan - I tried to setup Swarmform on a staging cluster, and while all services seem to be working fine, I'm unable to view all cluster data on the Swarm node data and only show one node. I've tried to find a solution for this, but it's proving to be very difficult to debug due to the limited blogs available online.
Active service:

Swarm UI:

Setup Informations:
Server: staging0001-01
Dir: /root/swarmprom
Command to start swarmprom: ADMIN_USER=atadmin ADMIN_PASSWORD=**** SLACK_URL=https://hooks.slack.com/services/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 SLACK_CHANNEL=devops-alerts SLACK_USER=alertmanager docker stack deploy -c docker-compose.yml mon

Can you please quickly review this and let me know if you notice any issues with the setup?
cc: @athandle

cpswan · 2024-01-17T14:26:40Z

@kumarnarendra701 looks like the mon_dockerd-exporter containers are unable to send their data:

...
17/Jan/2024:14:18:02 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
17/Jan/2024:14:18:17 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
17/Jan/2024:14:18:32 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
17/Jan/2024:14:18:47 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
17/Jan/2024:14:19:02 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused

My fault finding process:

You're only seeing data from one host in the cluster, which I expect is the 01 manager. Hypothesis: It can talk to itself, but the other exporters can't talk to it?
Take a look at the exporter services with docker service ps mon_dockerd-exporter

ID             NAME                                             IMAGE                       NODE                                 DESIRED STATE   CURRENT STATE        ERROR     PORTS
j9l9814758ti   mon_dockerd-exporter.f7fctkgsxqyzqbz2qvivpvmc2   stefanprodan/caddy:latest   staging0001-03.us-central1-c.c.development-305719.internal   Running         Running 6 days ago
q3d5yyzdkxcv   mon_dockerd-exporter.frqgu019hd0jzohhvpsgf6s6v   stefanprodan/caddy:latest   staging0001-04.us-central1-a.c.development-305719.internal   Running         Running 6 days ago
pn83y3720y3r   mon_dockerd-exporter.hdyr1xdcacg9ahrsvec1n5jp6   stefanprodan/caddy:latest   staging0001-01                                 Running         Running 6 days ago
1ff406n2eqca   mon_dockerd-exporter.ipumfp1ioq0ue6vmcw6y70he2   stefanprodan/caddy:latest   staging0001-06.us-central1-c.c.development-305719.internal   Running         Running 6 days ago
vxj7ti0mvuth   mon_dockerd-exporter.njs9res7cc75ny27qo9ixtsgy   stefanprodan/caddy:latest   staging0001-05.us-central1-b.c.development-305719.internal   Running         Running 6 days ago
ug2ri84zlfm6   mon_dockerd-exporter.pt2qgrl1usnb3iqbbal8v96h2   stefanprodan/caddy:latest   staging0001-02                                 Running         Running 6 days ago

Pick a node (I went with staging0001-04) and see what's going on there with docker logs mon_dockerd-exporter.frqgu019hd0jzohhvpsgf6s6v.q3d5yyzdkxcvqykdfr0kcqkqw, which yields the log snippet above.

I'd call out that 172.18.x addresses aren't in the LAN range for that Swarm.

kumarnarendra701 · 2024-01-19T13:33:35Z

@cpswan - Thanks for your input. I tried running "swarmprom" in the secondary Docker network, but it failed. Although I can ping the IP from the container, I cannot connect to port 9323.

Errors -

19/Jan/2024:13:33:32 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
19/Jan/2024:13:33:47 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
19/Jan/2024:13:34:02 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
19/Jan/2024:13:34:17 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
19/Jan/2024:13:34:32 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused

staging0001-04 ~ # docker exec -it fe83602d5257 sh
/www # ping 172.18.0.1
PING 172.18.0.1 (172.18.0.1): 56 data bytes
64 bytes from 172.18.0.1: seq=0 ttl=64 time=0.239 ms
64 bytes from 172.18.0.1: seq=1 ttl=64 time=0.111 ms

64 bytes from 172.18.0.1: seq=2 ttl=64 time=0.126 ms
64 bytes from 172.18.0.1: seq=3 ttl=64 time=0.131 ms
64 bytes from 172.18.0.1: seq=4 ttl=64 time=0.111 ms
^C
--- 172.18.0.1 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.111/0.143/0.239 ms
/www # 
/www # 
/www # 
/www # telnet 172.18.0.1 9323
telnet: can't connect to remote host (172.18.0.1): Connection refused
/www # 
/www # 
/www # exit
staging0001-04 ~ # 
staging0001-04 ~ # 
staging0001-04 ~ # ping 172.18.0.1
PING 172.18.0.1 (172.18.0.1) 56(84) bytes of data.
64 bytes from 172.18.0.1: icmp_seq=1 ttl=64 time=0.192 ms
64 bytes from 172.18.0.1: icmp_seq=2 ttl=64 time=0.051 ms
64 bytes from 172.18.0.1: icmp_seq=3 ttl=64 time=0.071 ms
^C
--- 172.18.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2053ms
rtt min/avg/max/mdev = 0.051/0.104/0.192/0.062 ms

The IP it trying to connect is docker network

docker_gwbridge: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.18.0.1  netmask 255.255.0.0  broadcast 172.18.255.255
        inet6 fe80::42:82ff:fe2f:5115  prefixlen 64  scopeid 0x20<link>
        ether 02:42:82:2f:51:15  txqueuelen 0  (Ethernet)
        RX packets 1126064  bytes 308309940 (294.0 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1351476  bytes 134159319 (127.9 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

cc: @athandle

athandle · 2024-03-18T11:50:08Z

Reduced SP and moved to next sprint

cpswan · 2024-05-13T13:10:13Z

@kumarnarendra701 can you please try to get back into this and see if you can resolve the network issues.

kumarnarendra701 · 2024-05-22T10:04:00Z

@cpswan - I've tried using Swarmprom several times, but it looks like the repository was archived 4 years ago and there are very few blog posts about it. It seems like we might need to consider using other monitoring tools, but most tools are designed for Kubernetes with very few options for Docker swarm monitoring. If you know of any tools that can monitor a swarm cluster, please suggest them so that I can start implementing them.
cc: @gkc

gkc · 2024-05-22T15:29:53Z

I've started running docker events to a log on each swarm, I'll take a look at the output tomorrow

gkc · 2024-06-24T15:13:21Z

I did look at the output and all interesting events are being logged. There weren't any "too much memory being used" restarts when last I looked after a couple of days; I will look again at the weekend

gkc · 2024-07-08T07:10:50Z

docker events has indeed been reporting container die messages which include the exit code - i.e. docker events produces enough information to allow creation of a script which listens to and acts on the event stream as described in the original description of this PR.

gkc · 2024-07-22T10:06:37Z

I will create a script during this sprint and do some testing via my atServer to verify it

gkc added the enhancement New feature or request label Apr 24, 2023

gkc assigned cpswan and athandle Apr 24, 2023

athandle assigned cconstab and kumarnarendra701 and unassigned athandle Apr 24, 2023

athandle mentioned this issue Apr 24, 2023

Alerts for memory utilization of secondaries atsign-foundation/seccheck#28

Closed

gkc assigned gkc and unassigned cpswan, cconstab and kumarnarendra701 Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate ability to detect out-of-memory-related restart looping #1318

Investigate ability to detect out-of-memory-related restart looping #1318

gkc commented Apr 24, 2023 •

edited

Loading

athandle commented Apr 25, 2023

kumarnarendra701 commented May 1, 2023

athandle commented Jul 9, 2023

gkc commented Jul 9, 2023

kumarnarendra701 commented Sep 18, 2023

cpswan commented Oct 2, 2023

kumarnarendra701 commented Oct 9, 2023

cpswan commented Oct 9, 2023

kumarnarendra701 commented Oct 16, 2023

kumarnarendra701 commented Oct 30, 2023

kumarnarendra701 commented Nov 13, 2023

kumarnarendra701 commented Nov 24, 2023

kumarnarendra701 commented Dec 10, 2023

kumarnarendra701 commented Dec 26, 2023

cpswan commented Jan 2, 2024

kumarnarendra701 commented Jan 2, 2024

cconstab commented Jan 2, 2024

cpswan commented Jan 8, 2024

kumarnarendra701 commented Jan 12, 2024

cpswan commented Jan 17, 2024

kumarnarendra701 commented Jan 19, 2024 •

edited

Loading

athandle commented Mar 18, 2024

cpswan commented May 13, 2024

kumarnarendra701 commented May 22, 2024

gkc commented May 22, 2024

gkc commented Jun 24, 2024

gkc commented Jul 8, 2024 •

edited

Loading

gkc commented Jul 22, 2024

Investigate ability to detect out-of-memory-related restart looping #1318

Investigate ability to detect out-of-memory-related restart looping #1318

Comments

gkc commented Apr 24, 2023 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

athandle commented Apr 25, 2023

kumarnarendra701 commented May 1, 2023

athandle commented Jul 9, 2023

gkc commented Jul 9, 2023

kumarnarendra701 commented Sep 18, 2023

cpswan commented Oct 2, 2023

kumarnarendra701 commented Oct 9, 2023

cpswan commented Oct 9, 2023

kumarnarendra701 commented Oct 16, 2023

kumarnarendra701 commented Oct 30, 2023

kumarnarendra701 commented Nov 13, 2023

kumarnarendra701 commented Nov 24, 2023

kumarnarendra701 commented Dec 10, 2023

kumarnarendra701 commented Dec 26, 2023

cpswan commented Jan 2, 2024

kumarnarendra701 commented Jan 2, 2024

cconstab commented Jan 2, 2024

cpswan commented Jan 8, 2024

kumarnarendra701 commented Jan 12, 2024

cpswan commented Jan 17, 2024

kumarnarendra701 commented Jan 19, 2024 • edited Loading

athandle commented Mar 18, 2024

cpswan commented May 13, 2024

kumarnarendra701 commented May 22, 2024

gkc commented May 22, 2024

gkc commented Jun 24, 2024

gkc commented Jul 8, 2024 • edited Loading

gkc commented Jul 22, 2024

gkc commented Apr 24, 2023 •

edited

Loading

kumarnarendra701 commented Jan 19, 2024 •

edited

Loading

gkc commented Jul 8, 2024 •

edited

Loading