Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate ability to detect out-of-memory-related restart looping #1318

Open
gkc opened this issue Apr 24, 2023 · 28 comments
Open

Investigate ability to detect out-of-memory-related restart looping #1318

gkc opened this issue Apr 24, 2023 · 28 comments
Assignees
Labels
enhancement New feature or request

Comments

@gkc
Copy link
Contributor

gkc commented Apr 24, 2023

Is your feature request related to a problem? Please describe.

We need a way to detect out-of-memory-related restart looping

Describe the solution you'd like

Have a tool which listens for events such as when the docker swarm manager has killed a container. Such a tool could then check if the container was killed due to container memory usage exceeding its cap, and could also check if it was previously killed within N (e.g. 10) minutes also due to container memory usage exceeding its cap

See https://docs.docker.com/engine/reference/commandline/events/

Describe alternatives you've considered

No response

Additional context

Linked to #1303

@athandle
Copy link
Contributor

1st incident - logs from @kumarnarendra701 @sitaram-kalluri is checking
root@swarm0002-01:~# docker service logs -f uc0c4qz75h98k830v3nimbf1d
5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | INFO|2023-04-25 12:32:20.817282|AtSecondaryServer|currentAtSign : @foolishgemini1
5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal |
5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | INFO|2023-04-25 12:32:20.829918|HiveBase|commit_log_106a8370ed3426f079bb4b1239aa00ee682c830a9898eadeea1a84b6d7fedd2a initialized successfully
5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal |
5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | INFO|2023-04-25 12:32:20.896658|HiveBase|access_log_106a8370ed3426f079bb4b1239aa00ee682c830a9898eadeea1a84b6d7fedd2a initialized successfully
5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal |
5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal | INFO|2023-04-25 12:32:27.270485|HiveBase|notifications_106a8370ed3426f079bb4b1239aa00ee682c830a9898eadeea1a84b6d7fedd2a initialized successfully
5ee98e0b-b85b-57dd-88f8-b5fa2a4ae8b3_secondary.1.uc0c4qz75h98@swarm0002-07.us-central1-a.c.secondaries.internal |

@kumarnarendra701
Copy link
Collaborator

2nd incident - @snowthe18raw (5835)

Logs -

O|2023-05-01 12:37:56.179259|AtSecondaryServer|currentAtSign : @snowthe18raw 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | INFO|2023-05-01 12:37:56.209010|HiveBase|commit_log_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | INFO|2023-05-01 12:37:56.240765|HiveBase|access_log_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | INFO|2023-05-01 12:38:04.473256|HiveBase|notifications_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.xaoe45b8qput@swarm0002-20.us-central1-b.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | INFO|2023-05-01 12:38:53.763544|AtSecondaryServer|currentAtSign : @snowthe18raw 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | INFO|2023-05-01 12:38:53.774617|HiveBase|commit_log_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | INFO|2023-05-01 12:38:53.790753|HiveBase|access_log_f259ac132bcb4ea5ebbba9b8202c4c96067b0047c38fd5be48bc0dc7ea4c6b7f initialized successfully 
3719a964-2f4c-5fb8-8108-5a0eb96e4fc0_secondary.1.vdtujfc7slcm@swarm0002-21.us-central1-c.c.secondaries.internal    | 
^C

@athandle
Copy link
Contributor

athandle commented Jul 9, 2023

@sitaram-kalluri @murali-shris can you add your comments on this ticket

@gkc
Copy link
Contributor Author

gkc commented Jul 9, 2023

@athandle : @sitaram-kalluri has made good progress in profiling memory usage on #1303 and has PR #1428 which reduces memory consumption during startup and in steady state

However, this ticket is about detecting out-of-memory restarts at the swarm level. See "Describe the solution you'd like" in this ticket's description above

@kumarnarendra701
Copy link
Collaborator

Moving to the next sprint

@cpswan
Copy link
Member

cpswan commented Oct 2, 2023

Moving to PR72. @kumarnarendra701 can you please make sure this one's on your list for this sprint.

@kumarnarendra701
Copy link
Collaborator

@cpswan @gkc - Can we use this swarmprom to monitor our swarm clusters? Because if we use Docker events, we'd have to set up our custom script on each of our swarm nodes. Please share your thoughts and let me know if you see any other monitoring tools.

https://dockerswarm.rocks/swarmprom/

@cpswan
Copy link
Member

cpswan commented Oct 9, 2023

Nice find @kumarnarendra701 Swarmprom looks really nice.

Let's get it set up on staging and see how we get on with it (and whether it can solve the problem we're looking at here).

@kumarnarendra701
Copy link
Collaborator

@cpswan - I set up Swarmprom in our staging environment, and I'm currently checking out the UI to explore more about it and moving it to the next sprint for more work.

@kumarnarendra701
Copy link
Collaborator

@cpswan - Didn't get a chance to work on this tool and move to next sprint.

@kumarnarendra701
Copy link
Collaborator

@cpswan - Moving to the next sprint. I'm seeing the issue in Swarmprom UI. I'll update further progress on the ticket.

@kumarnarendra701
Copy link
Collaborator

@cpswan
Quick update on Swarmprom - I'm currently experiencing some trouble as I'm unable to view all swarm nodes and their services on Swarmprom UI. Finding a solution has been quite challenging as there is very little documentation available on the internet. However, I'm actively working on resolving this issue and will keep you updated on the progress of the setup.
cc: @athandle

@kumarnarendra701
Copy link
Collaborator

@cpswan -I am seeing issues with Swarmprom setup as they have stopped development on their repository. Therefore, I have found another tool and I am exploring Portainer.

image
cc: @athandle

@kumarnarendra701
Copy link
Collaborator

@cpswan - Portainer UI setup completed and facing some issues in agent connectivity and working on this.

@cpswan
Copy link
Member

cpswan commented Jan 2, 2024

@cconstab I know that you tried Portainer a while ago, so it would be good to get your feedback on it?

@kumarnarendra701
Copy link
Collaborator

@cpswan - I used Portainer in my staging Swarm cluster, but I noticed it's mainly for managing Docker Swarm itself and doesn't focus much on monitoring. Also, it doesn't show more visibility of stacks that are created outside of Portainer.

@cconstab
Copy link
Member

cconstab commented Jan 2, 2024

I found it worked ok in small setups like mybhome lab but did not scale well to our setup.
It used tobat least become laggy and unreliable. I also had security concerns.

My take was in the end use the cli and if we needed tools look else where.

The portainer team also got "k8s" pretty bad and that started to pull the project away from Swarm mode.

This was 2 years back so things may well have changed.

@cpswan
Copy link
Member

cpswan commented Jan 8, 2024

Bumping to PR78 so that @kumarnarendra701 can continue. I've suggested:

If Portainer isn't suitable then maybe go back to swarmprom and let's see what it might take to get it up to scratch.

@kumarnarendra701
Copy link
Collaborator

@cpswan - I tried to setup Swarmform on a staging cluster, and while all services seem to be working fine, I'm unable to view all cluster data on the Swarm node data and only show one node. I've tried to find a solution for this, but it's proving to be very difficult to debug due to the limited blogs available online.
Active service:

image

Swarm UI:

image

Setup Informations:
Server: staging0001-01
Dir: /root/swarmprom
Command to start swarmprom: ADMIN_USER=atadmin ADMIN_PASSWORD=**** SLACK_URL=https://hooks.slack.com/services/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 SLACK_CHANNEL=devops-alerts SLACK_USER=alertmanager docker stack deploy -c docker-compose.yml mon

Can you please quickly review this and let me know if you notice any issues with the setup?
cc: @athandle

@cpswan
Copy link
Member

cpswan commented Jan 17, 2024

@kumarnarendra701 looks like the mon_dockerd-exporter containers are unable to send their data:

...
17/Jan/2024:14:18:02 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
17/Jan/2024:14:18:17 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
17/Jan/2024:14:18:32 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
17/Jan/2024:14:18:47 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
17/Jan/2024:14:19:02 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused

My fault finding process:

  • You're only seeing data from one host in the cluster, which I expect is the 01 manager. Hypothesis: It can talk to itself, but the other exporters can't talk to it?
  • Take a look at the exporter services with docker service ps mon_dockerd-exporter
ID             NAME                                             IMAGE                       NODE                                 DESIRED STATE   CURRENT STATE        ERROR     PORTS
j9l9814758ti   mon_dockerd-exporter.f7fctkgsxqyzqbz2qvivpvmc2   stefanprodan/caddy:latest   staging0001-03.us-central1-c.c.development-305719.internal   Running         Running 6 days ago
q3d5yyzdkxcv   mon_dockerd-exporter.frqgu019hd0jzohhvpsgf6s6v   stefanprodan/caddy:latest   staging0001-04.us-central1-a.c.development-305719.internal   Running         Running 6 days ago
pn83y3720y3r   mon_dockerd-exporter.hdyr1xdcacg9ahrsvec1n5jp6   stefanprodan/caddy:latest   staging0001-01                                 Running         Running 6 days ago
1ff406n2eqca   mon_dockerd-exporter.ipumfp1ioq0ue6vmcw6y70he2   stefanprodan/caddy:latest   staging0001-06.us-central1-c.c.development-305719.internal   Running         Running 6 days ago
vxj7ti0mvuth   mon_dockerd-exporter.njs9res7cc75ny27qo9ixtsgy   stefanprodan/caddy:latest   staging0001-05.us-central1-b.c.development-305719.internal   Running         Running 6 days ago
ug2ri84zlfm6   mon_dockerd-exporter.pt2qgrl1usnb3iqbbal8v96h2   stefanprodan/caddy:latest   staging0001-02                                 Running         Running 6 days ago
  • Pick a node (I went with staging0001-04) and see what's going on there with docker logs mon_dockerd-exporter.frqgu019hd0jzohhvpsgf6s6v.q3d5yyzdkxcvqykdfr0kcqkqw, which yields the log snippet above.

I'd call out that 172.18.x addresses aren't in the LAN range for that Swarm.

@kumarnarendra701
Copy link
Collaborator

kumarnarendra701 commented Jan 19, 2024

@cpswan - Thanks for your input. I tried running "swarmprom" in the secondary Docker network, but it failed. Although I can ping the IP from the container, I cannot connect to port 9323.

Errors -

19/Jan/2024:13:33:32 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
19/Jan/2024:13:33:47 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
19/Jan/2024:13:34:02 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
19/Jan/2024:13:34:17 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
19/Jan/2024:13:34:32 +0000 [ERROR 502 /metrics] dial tcp 172.18.0.1:9323: getsockopt: connection refused
staging0001-04 ~ # docker exec -it fe83602d5257 sh
/www # ping 172.18.0.1
PING 172.18.0.1 (172.18.0.1): 56 data bytes
64 bytes from 172.18.0.1: seq=0 ttl=64 time=0.239 ms
64 bytes from 172.18.0.1: seq=1 ttl=64 time=0.111 ms

64 bytes from 172.18.0.1: seq=2 ttl=64 time=0.126 ms
64 bytes from 172.18.0.1: seq=3 ttl=64 time=0.131 ms
64 bytes from 172.18.0.1: seq=4 ttl=64 time=0.111 ms
^C
--- 172.18.0.1 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.111/0.143/0.239 ms
/www # 
/www # 
/www # 
/www # telnet 172.18.0.1 9323
telnet: can't connect to remote host (172.18.0.1): Connection refused
/www # 
/www # 
/www # exit
staging0001-04 ~ # 
staging0001-04 ~ # 
staging0001-04 ~ # ping 172.18.0.1
PING 172.18.0.1 (172.18.0.1) 56(84) bytes of data.
64 bytes from 172.18.0.1: icmp_seq=1 ttl=64 time=0.192 ms
64 bytes from 172.18.0.1: icmp_seq=2 ttl=64 time=0.051 ms
64 bytes from 172.18.0.1: icmp_seq=3 ttl=64 time=0.071 ms
^C
--- 172.18.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2053ms
rtt min/avg/max/mdev = 0.051/0.104/0.192/0.062 ms

The IP it trying to connect is docker network

docker_gwbridge: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.18.0.1  netmask 255.255.0.0  broadcast 172.18.255.255
        inet6 fe80::42:82ff:fe2f:5115  prefixlen 64  scopeid 0x20<link>
        ether 02:42:82:2f:51:15  txqueuelen 0  (Ethernet)
        RX packets 1126064  bytes 308309940 (294.0 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1351476  bytes 134159319 (127.9 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

cc: @athandle

@athandle
Copy link
Contributor

Reduced SP and moved to next sprint

@cpswan
Copy link
Member

cpswan commented May 13, 2024

@kumarnarendra701 can you please try to get back into this and see if you can resolve the network issues.

@kumarnarendra701
Copy link
Collaborator

@cpswan - I've tried using Swarmprom several times, but it looks like the repository was archived 4 years ago and there are very few blog posts about it. It seems like we might need to consider using other monitoring tools, but most tools are designed for Kubernetes with very few options for Docker swarm monitoring. If you know of any tools that can monitor a swarm cluster, please suggest them so that I can start implementing them.
cc: @gkc

@gkc
Copy link
Contributor Author

gkc commented May 22, 2024

I've started running docker events to a log on each swarm, I'll take a look at the output tomorrow

@gkc gkc assigned gkc and unassigned cpswan, cconstab and kumarnarendra701 Jun 24, 2024
@gkc
Copy link
Contributor Author

gkc commented Jun 24, 2024

I did look at the output and all interesting events are being logged. There weren't any "too much memory being used" restarts when last I looked after a couple of days; I will look again at the weekend

@gkc
Copy link
Contributor Author

gkc commented Jul 8, 2024

docker events has indeed been reporting container die messages which include the exit code - i.e. docker events produces enough information to allow creation of a script which listens to and acts on the event stream as described in the original description of this PR.

@gkc
Copy link
Contributor Author

gkc commented Jul 22, 2024

I will create a script during this sprint and do some testing via my atServer to verify it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants