Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Since update to 0.1.2, services sporadically returning a 404 page not found in plain text #125

Open
paxw-panevo opened this issue Apr 10, 2024 · 8 comments

Comments

@paxw-panevo
Copy link

image

...which makes us think it's an issue with traefik/proxy stack.

Force-redeploying the socket-proxy seems to make things stable for a while so we're thinking it could be an issue with the latest socket-proxy.

We use:

  • traefik == v2.9.6
  • docker swarm, docker engine == 26.0.0

Let us know if there's any other info needed.

@pedrobaeza
Copy link
Member

Maybe it's an incompatibility with IP v6 in your stack. Try disabling it with the environment variable DISABLE_IPV6=1

@paxw-panevo
Copy link
Author

Thanks for the prompt response, @pedrobaeza. We'll attempt this with our staging environment.

Just for our info/for me to develop an intuition when similar problems like this arise: why is it that IPv6 incompatibility is causing a sporadic issue? Sometimes the pages load successfully, sometimes the proxy fails to find the service (returns a "404 Not Found"). And things look stable for hours after redeploying the socket-proxy until the sporadic/intermittent issue happens again. I understand if this question may be too vague/the answer highly depends on other components in the infrastructure.

We appreciate the library and your team's work!

@pedrobaeza
Copy link
Member

Sorry, I can't say. It's just an intuition knowing that not everything is prepared for IPv6.

@yfhyou
Copy link

yfhyou commented Apr 20, 2024

Does the error go away or you have to redeploy? What does docker logs dockerproxy say at the end when this happens? Is the dockerproxy container still running when it happens?

I have a similar issue and found that dockerproxy just seems to stop for unknown reason. The end of my log looks like. (The warning process xxxxx exited at the beginning are going on the entire time in the background even when it is working)

.
.
.
[WARNING] 110/065746 (1) : Process 14242 exited with code 0 (Exit)
[WARNING] 110/065847 (1) : Process 14251 exited with code 0 (Exit)
[WARNING] 110/065947 (1) : Process 14259 exited with code 0 (Exit)
Stopping backend dockerbackend in 0 ms.
Stopping backend docker-events in 0 ms.
Stopping frontend dockerfrontend in 0 ms.
Proxy dockerbackend stopped (cumulated conns: FE: 0, BE: 106015).
Proxy docker-events stopped (cumulated conns: FE: 0, BE: 1).
Proxy dockerfrontend stopped (cumulated conns: FE: 180, BE: 0).
[WARNING] 110/070000 (1) : Exiting Master process...
[WARNING] 110/070000 (11) : Stopping backend dockerbackend in 0 ms.
[WARNING] 110/070000 (11) : Stopping backend docker-events in 0 ms.
[WARNING] 110/070000 (11) : Stopping frontend dockerfrontend in 0 ms.
[WARNING] 110/070000 (11) : Stopping frontend GLOBAL in 0 ms.
[WARNING] 110/070000 (11) : Proxy dockerbackend stopped (cumulated conns: FE: 0, BE: 106015).
[WARNING] 110/070000 (11) : Proxy docker-events stopped (cumulated conns: FE: 0, BE: 1).
[WARNING] 110/070000 (11) : Proxy dockerfrontend stopped (cumulated conns: FE: 180, BE: 0).
[WARNING] 110/070000 (11) : Proxy GLOBAL stopped (cumulated conns: FE: 0, BE: 0).
[WARNING] 110/070000 (1) : Current worker #1 (11) exited with code 0 (Exit)
[WARNING] 110/070000 (1) : All workers exited. Exiting... (0)

In the past I have even forked this repo and made a similar setup using the 'lts' or 2.8.x release of HAProxy, with the same results. From what I have found, ONLY haproxy 1.9 doesn't exit for an unknown reason after a random amount of time.

@paxw-panevo
Copy link
Author

paxw-panevo commented Apr 22, 2024

We had similar-looking logs,

image

To make our services stable, we had to downgrade docker-socket-proxy to 0.1.1

@pedrobaeza
Copy link
Member

Can you maybe propose a PR to update to HAProxy 1.9?

@yfhyou
Copy link

yfhyou commented Apr 22, 2024

This was just changed so a new PR to downgrade might be confusing. Perhaps @stumpylog saw this in testing?

@polarathene
Copy link

You should share more context of your environment. Knowing the version of Docker and what system you run it on (kernel version if linux) can be helpful context for maintainers.

Suspected cause (soft limit for FDs above 1024)

An issue I've observed with software failing or regressing in containers in the past has been due to high file descriptor limits causing problems. Shell into the container and run ulimit -Sn and ulimit -Hn, this may be a value of around a million (2^20) which should be ok, but if it's closer to a billion (2^30) that is often a problem.

Technically ulimit -Sn should ideally report 1024 and any software that needs to raise this soft limit higher would do so at run time, up to the hard limit (ulimit -Hn). I would not be surprised if the soft limit were above 1024 and the software calls select() which can fail when FDs are above that limit (hence why the software itself should manage when soft limit is safe to raise above this). For reference Nginx does manage this limit selectively, while Caddy is Go based which has it implicitly raised and dropped when appropriate.


HAProxy has select() calls

UPDATE: Yes haproxy is guilty of this too, and for a permalink I'll link this. Suppose the FDs allocated exceeds 1024 (these can be connections, stdin/stdout/stderr, etc, starting a basic container it can open 30+ FDs while real workloads can easily begin with hundreds), select() AFAIK will break as it was not designed to handle this scenario (it's a legacy call and software should be using better calls when available).

If that is actually the cause, that is why you're getting sporadic failures. HAProxy 1.9 was released in Dec 2018, while the 2.2 currently used is July 2020 with plans to upgrade to HAProxy 3.0 (May 2024) once a fix is backported. It's quite possible something changed since 1.9 related to that code. Looking over the history since then for the file I permalinked to, a few stand out potentially this FD cache removal which may have minimized the issue from occurring for you?

I invested quite a bit of time on this topic to get those defaults in Docker fixed, it should have improved from Docker v25 but IIRC there is a related fix for an internal dependency containerd that is waiting on a 2.0 release. So if you are running Docker v25 or newer and still have the soft limit issue it should be resolved in a release later this year or next year sometime, but for now you'll need to be aware of this caveat which can cause failures like you're experiencing with haproxy to excessive memory usage or performance regression/stalling (software that daemonizes with naive approaches to close FDs during init, some take minutes others over an hour).


Advice to verify

If you run the container and check the soft limit and it exceeds 1024, try running the container with a soft limit explicitly set to 1024 (like mentioned here). See if you still experience random failures, I have a hunch that would resolve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants