-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Presence change storm putting increased load on server #16843
Comments
Following up a bit, over the night the problem subsided for a few hours, possibly because clients where disconnected entirely (machine shut down), but it started again early this morning, with 2 clients showing the same symptoms. Over the course of the morning, it has now become 6 clients that do this, with around 35 requests a second, in what looks like a GET and then OPTIONS cycle. |
Disabling presence on homeserver seems to have removed the problem (not that surprising) I am wondering how it began, though, nothing have changed on the homeserver, which has been running for years, and all of a sudden a multitude of different clients start flooding presence state changes. |
Hello, I've had exactly the same problem. After investigation, I found the same thing: clients spam the server with When I noticed the problem, the clients had Element 1.55 and the web clients 1.53. Synapse was at 1.93. After updating the clients to the latest version and synapse to 1.99, I still had the same problem. There were no configuration changes before/during the problem, apart from activating/deactivating presences. I have around thirty active users and it's becoming a problem very, very quickly. Installation Method: Debian package |
I have the same problem. Extreme CPU usage in the presence and generic-sync workers. It gradually started appearing on the 22.01.2024 at around 13:00 CET with version 1.98.0 (with no changes to the server since the restart into 1.98.0 on the 22.12.2023). During evening/night time the graphs looked quiet and as usual. The CPU usage went up again suddenly on the next day (23.) during office hours (and continued after an upgrade to 1.99.0). I also assume that it could be caused by certain clients or client versions. |
The symptoms look like matrix-org/synapse#16057, which did not reappear for months until now. |
We have this problem as well. Synapse Version: 1.99.0 |
here as well, same config: Synapse 1.99.0 (matrixdotorg/synapse) rogue client is on osx, v1.11.56-rc.0 |
@gc-holzapfel How did you identify the client? I am just interested, since I looked at clients versions and usernames in log snippets of good and bad minutes, but could not find a pattern. It always seemed as if all of them were involved in the ~10x request count during bad minutes. |
i had a look at the synapse/docker-logs and there is (so far) only one misbehaving client...although we have other users with same os&client that do not cause problems. |
I'm hitting the same issue with multiple clients on : I confirm that disabling presence via:
Solves the issue. |
I had the same issue on the same date; January 23, 2023 around 10:30 a.m. Synapse Version: 1.100 |
I have three separate environments, test with synapse monolith, preprod with synapse workers, prod with synapse workers and I still see the problem. It doesn't matter whether I update Synapse or Element, as soon as I reactivate the presences, we DDOS ourselves. Would it be possible, even if it's a synapse bug, to limit the number of API Presence is a feature we'd really like to keep active! |
Still HARD flooding the browser and the server |
In Synapse 1.98 this configuration doesn't seem to work anymore. So flooding issue still present and not bypassable... |
Same problem |
On 1.101.0 (~ 150 MAU, no workers) and disabling presence is working for me. However it is a disaster for me having to disable a very basic and important feature of synapse with no fix in sight while new releases (not fixing the problem) keep flying in. I'm a small fish, but I'm wondering what the guys with the big servers are doing to help with the issue. |
@clokep Do you have insight on this considering your work on matrix-org/synapse#16544 ? |
No. Maybe there's a misconfiguration? Multiple workers fighting or something. I haven't read in depth what's going on though -- I'm no longer working at Element. |
Actually this is really simple to reproduce, just did it in 2 min, you just have to :
Whatever the issue is, it shows a lack of programming security at multiple layers :
We will try to dirty patch it for now and put the patch here if this is relevant. |
Interesting, does this make it a client issue more than a server one, I wonder? I'm not really sure who triggers the syncing, if its the client misbehaving or if the server should actively try to stop this behaviour. |
For me it's both. |
That would make the most sense, yes, and in fact, it would make the most sense that the server isn't DDoSable through its official APIs. |
I think the client properly send it's own status, then just polls and wait for an answer as a classic polling would do (even if also as you mentionned, polling flood could actually be identified client-side and prevented/delayed) But yeah, here the server have the main role, processing the conflicting presence requests and controling properly the polling system. Aside of a polling flood control, a simple fix in the presence feature may be :
So it would stay in the same defined order BUSY > ONLINE > ... I think this is approximately the hotfix we are trying to apply rn, we'll keep you updated about that cc. @raphaelbadawi |
Hello, I've made a little patch (tested on 1.98 and current master branch). For users like us who may have force-enabled multi-tab sessions and find themselves with incohesive states within the same device, this solves the "blinking" between online and idle states which flooded the server and sometimes thread-blocked the client.
|
What does this mean? |
We patched matrix-react-sdk so we can have multi-tabs session. This is why we had this flood: if a tab was awake and another was inactive, it kept syncing online->inactive->online->inactive into the same device id. The previous fix avoided flood among different devices, but not in this peculiar case. |
What does "multi-tab sessions" mean? Does it mean you're logging into multiple different accounts? This sounds like it might be a client issue then if the data isn't properly segmented between the accounts or something like that. |
Confirmed it (finally 🥳 ) passed all QA tests with different presence states properly merged and not flooding the polling anymore ! => => @clokep I think @raphaelbadawi says that he have multiple tabs (on the same account) sending different presence states, as if you have multiple clients, and then a third one on another browser that got flooded, this is the test case I use. Thank you, finally passed the tests for production ! Next step would be to confirm how this could be introduced in the recently changed state merger mechanism |
Yes this was my use case, same user id, same token, but on multiple tabs. |
@agrimpard Could you try and give us a feedback ? Just got live in production today and seems to work ! cc. @Ronchonchon45 @OneBlue @gc-holzapfel @plui29989 @FlyveHest |
This comment was marked as resolved.
This comment was marked as resolved.
Fix : element-hq#16843 Patch from : element-hq#16843 (comment)
I've just tested the patch. I started by stopping Matrix, I applied the patch on I then stopped Matrix again, removed the patch on I should have started by reactivating the presences without the patch to see a real change. I wonder if the patch hasn't done some sort of reset on the presences and so, even without the patch, I'm out of the presence bug. For those who want to test, you have the source |
I applied the patch on Monday (25th). Always was good, but this morning at 10AM, the problem is back. |
Hello @Ronchonchon45 . How do you reproduce the issue on your side ? For me it was when being logged in with the same user on several tabs at the same time. |
I don't know how the problem came back. |
What do you have in the actual logs ? Is it flooding user presence state update (if a user state blinks rapidly between two states it may be related to multitab) or is it something else ? |
Still an issue on v1.118.0 |
Fixed it temp. by closing al Browsertabs |
A workaround could be to rate-limit it on the reverse proxy. I did this similarly for #16987. For example using nginx this could look something like (this probably also limits sync requests): http {
map $query_string $map_query_param_matrix_client_sync_set_presence {
default "";
~(^|&)set_presence= $binary_remote_addr;
}
map $request_method $map_request_method_matrix_client_sync_set_presence {
default "";
GET $map_query_param_matrix_client_sync_set_presence;
}
limit_req_zone $map_request_method_matrix_client_sync_set_presence zone=matrix_client_sync_set_presence:10m rate=2r/s;
server {
location ~ ^/_matrix/client/v3/sync {
limit_req zone=matrix_client_sync_set_presence burst=4 nodelay;
proxy_pass http://$workers;
proxy_read_timeout 600s;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $remote_addr;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_http_version 1.1;
client_max_body_size 125M;
access_log /var/log/nginx/matrix._matrix.log vhost_combined_tls_worker_withreq;
}
}
} |
This is still a problem with v1.120.2. It is not always a problem but often recently, usually on Mondays, this week every day but not all day long. So I still think this is client related but since we have basically no control over the clients it must be fixed on the server side. How can we debug/fix this on the server side? Today I tracked down all Element Web/Desktop users/computers with versions older than |
For me things luckily have calmed down since July and stayed pretty stable since then. I've considered rate-limiting by IP on the reverse proxy, but this immediately becomes an issue as soon as multiple users are connected via NAT (VPN or WiFi). So, the rate-limiting if implemented correctly has to happen server side and on a per authenticated user basis. |
I agree. I implemented an |
I created a PR #18000 which I tested toady on our production server with success. A detailed report is in #18000 (comment). Would be great if somebody else could test it. |
Description
Six hours ago, my private Synapse server suddenly started putting 5 times the load on my server than usual.
I have narrowed it down to what seems about a handful of clients, based on IP, that GET and OPTION around 150 presence changes every second.
Users are running either Element Web latest version, or official clients on phones or desktop.
Searching issues, this might seem to somewhat related : #16705
Steps to reproduce
I don't know, the flood started 6 hours ago with no change in neither server nor clients (from what I can tell) and have been running since.
Restarting the server did nothing, the presence changes continued as soon as I restarted it.
Also tried waiting a minute before starting, same issue.
Homeserver
matrix.gladblad.dk
Synapse Version
1.99.0
Installation Method
Docker (matrixdotorg/synapse)
Database
SQLite
Workers
Single process
Platform
Ubuntu 20, in docker
Configuration
No response
Relevant log output
Anything else that would be useful to know?
No response
The text was updated successfully, but these errors were encountered: