-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: concurrent connections limiter #751
Conversation
Can we deploy this to one of the dev-* environments? |
Coverage of commit
|
Coverage of commit
|
@paulswartz I have confirmed that this is working and properly establishing connection to memcache in dev-blue: https://mbta.splunkcloud.com/en-US/app/search/search?q=search%20index%3D%22api-dev-blue-application%22%20ApiWeb.Plugs.RateLimiterConcurrent&display.page.search.mode=verbose&dispatch.sample_ratio=1&workload_pool=standard_perf&earliest=-30m%40m&latest=now&sid=1707509690.6290337 |
@nlwstein hmm, something looks weird about that log. for the first two requests, "concurrent" + "limit" == "remaining", but for the last request, "concurrent" + "limit" != "remaining". That makes me think we either didn't count a request, or one expired from the lock without being actually disconnected. |
@paulswartz I don't think that concurrent field is coming from my code. I'm not familiar with it and suspect it comes from another package. My log statement begins with the module name:
|
That's right, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
putting in a temporary review so that it drops off the reminder list. open issue is that the existing concurrency logger does not always agree with the new concurrency limiter, and we're not sure why.
Coverage of commit
|
I tried to reproduce this locally, and could not seem to make it happen. The counters were synced, even having the same 'ghost' requests after putting my laptop to sleep for the night 😂 My guess as to what happened is that a bunch of requests were opened and closed in a short period of time, simultaneously, and when that log fired there was a race between when the request tracker updated its logging metadata, and this cleanup filter which fires right before that particular log message is emitted. It also relies on a (configurable) heartbeat, so it's possible that it failed a heartbeat check, reducing the count of monitored processes on this feature before the callback for the request tracker fired. I also redeployed to Maybe once we deploy this we should write a Splunk query to determine how frequently this happens? Ideally, it's just the heartbeat mechanism doing it's thing 🙏 |
I see it happening a lot in Splunk? |
Coverage of commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty review pending more investigation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a chat with Firas in Slack. Summary:
- merge this as is
- continue to investigate the discrepancy between the original counter and the distributed counter
- if we're unable to make progress on the accuracy of the distributed counter by 2024-03-31, we'll roll it back and discuss other approaches
Coverage of commit
|
Summary of changes
Asana Ticket: 🍎 Reject API requests when there are too many concurrent requests
This provides a new rate limiter that is mainly useful for clamping down on too many active streaming connections. It is also able to track and control long-running "static" requests (for example, trying to load all of the predictions at once). It is configurable on a per-user basis in the admin panel (
-1
to disable). The largest value between user config and the base config is used.The static concurrency locks are released in real-time, whereas the streaming ones depend on the hibernate loop cycle, so there can be a latency of under a minute for releasing those. The limits I have added to the config file are guidance for testing, not necessarily for production.
Opinion
: I am not sure ifmemcached
is the ideal tool for this, but I used it because I don't want to introduce new infrastructure if possible and it was already in-use for a similar job. Changing out the storage mechanism in the future should be relatively easy.The whole rate limiting system seems like a good use case for Redis (or perhaps there's something new that's even better), primarily because: