feat: concurrent connections limiter #751

nlwstein · 2024-02-09T18:34:56Z

Summary of changes

Asana Ticket: 🍎 Reject API requests when there are too many concurrent requests

This provides a new rate limiter that is mainly useful for clamping down on too many active streaming connections. It is also able to track and control long-running "static" requests (for example, trying to load all of the predictions at once). It is configurable on a per-user basis in the admin panel (-1 to disable). The largest value between user config and the base config is used.

The static concurrency locks are released in real-time, whereas the streaming ones depend on the hibernate loop cycle, so there can be a latency of under a minute for releasing those. The limits I have added to the config file are guidance for testing, not necessarily for production.

Opinion: I am not sure if memcached is the ideal tool for this, but I used it because I don't want to introduce new infrastructure if possible and it was already in-use for a similar job. Changing out the storage mechanism in the future should be relatively easy.

The whole rate limiting system seems like a good use case for Redis (or perhaps there's something new that's even better), primarily because:

You can more efficiently filter / upsert nested data, or even just on a key prefix basis.
It has a concept of locks and transactions

This reverts commit 1598989.

paulswartz · 2024-02-09T18:38:35Z

Can we deploy this to one of the dev-* environments?

github-actions · 2024-02-09T18:44:55Z

Coverage of commit `70db24d`

Summary coverage rate:
  lines......: 88.2% (4236 of 4800 lines)
  functions..: 70.9% (2269 of 3201 functions)
  branches...: no data found

Files changed coverage rate:
                                                                        |Lines       |Functions  |Branches    
  Filename                                                              |Rate     Num|Rate    Num|Rate     Num
  ============================================================================================================
  apps/api_accounts/lib/api_accounts/key.ex                             |75.0%      4|70.0%    10|    -      0
  apps/api_web/lib/api_web.ex                                           |85.7%     14|83.3%     6|    -      0
  apps/api_web/lib/api_web/api_controller_helpers.ex                    |96.2%     53|93.8%    16|    -      0
  apps/api_web/lib/api_web/event_stream.ex                              |82.5%     57|84.6%    13|    -      0
  apps/api_web/lib/api_web/plugs/rate_limiter_concurrent.ex             |15.0%     20|33.3%     6|    -      0
  apps/api_web/lib/api_web/rate_limiter/memcache.ex                     | 0.0%      6| 0.0%     3|    -      0
  apps/api_web/lib/api_web/rate_limiter/memcache/supervisor.ex          |43.8%     16|40.0%     5|    -      0
  apps/api_web/lib/api_web/rate_limiter/rate_limiter_concurrent.ex      |35.7%     56|37.5%    16|    -      0
  apps/api_web/lib/api_web/user.ex                                      |83.3%      6| 100%     5|    -      0

Download coverage report

github-actions · 2024-02-09T20:09:13Z

Coverage of commit `f8aabfd`

Summary coverage rate:
  lines......: 88.2% (4236 of 4800 lines)
  functions..: 70.9% (2269 of 3201 functions)
  branches...: no data found

Files changed coverage rate:
                                                                        |Lines       |Functions  |Branches    
  Filename                                                              |Rate     Num|Rate    Num|Rate     Num
  ============================================================================================================
  apps/api_accounts/lib/api_accounts/key.ex                             |75.0%      4|70.0%    10|    -      0
  apps/api_web/lib/api_web.ex                                           |85.7%     14|83.3%     6|    -      0
  apps/api_web/lib/api_web/api_controller_helpers.ex                    |96.2%     53|93.8%    16|    -      0
  apps/api_web/lib/api_web/event_stream.ex                              |84.2%     57|84.6%    13|    -      0
  apps/api_web/lib/api_web/plugs/rate_limiter_concurrent.ex             |15.0%     20|33.3%     6|    -      0
  apps/api_web/lib/api_web/rate_limiter/memcache.ex                     | 0.0%      6| 0.0%     3|    -      0
  apps/api_web/lib/api_web/rate_limiter/memcache/supervisor.ex          |43.8%     16|40.0%     5|    -      0
  apps/api_web/lib/api_web/rate_limiter/rate_limiter_concurrent.ex      |35.7%     56|37.5%    16|    -      0
  apps/api_web/lib/api_web/user.ex                                      |83.3%      6| 100%     5|    -      0

Download coverage report

nlwstein · 2024-02-09T20:15:43Z

@paulswartz I have confirmed that this is working and properly establishing connection to memcache in dev-blue: https://mbta.splunkcloud.com/en-US/app/search/search?q=search%20index%3D%22api-dev-blue-application%22%20ApiWeb.Plugs.RateLimiterConcurrent&display.page.search.mode=verbose&dispatch.sample_ratio=1&workload_pool=standard_perf&earliest=-30m%40m&latest=now&sid=1707509690.6290337

nlwstein · 2024-02-12T16:50:44Z

@bklebe This is on dev-blue again, and seems to be working as expected: https://mbta.splunkcloud.com/en-US/app/search/search?q=search%20index%3D%22api-dev-blue-application%22%20ApiWeb.Plugs.RateLimiterConcurrent%20event_stream%3Dtrue&display.page.search.mode=verbose&dispatch.sample_ratio=1&workload_pool=standard_perf&earliest=-30m%40m&latest=now&sid=1707756574.7074763

paulswartz · 2024-02-12T16:54:23Z

@nlwstein hmm, something looks weird about that log. for the first two requests, "concurrent" + "limit" == "remaining", but for the last request, "concurrent" + "limit" != "remaining". That makes me think we either didn't count a request, or one expired from the lock without being actually disconnected.

nlwstein · 2024-02-14T15:02:59Z

@nlwstein hmm, something looks weird about that log. for the first two requests, "concurrent" + "limit" == "remaining", but for the last request, "concurrent" + "limit" != "remaining". That makes me think we either didn't count a request, or one expired from the lock without being actually disconnected.

@paulswartz I don't think that concurrent field is coming from my code. I'm not familiar with it and suspect it comes from another package.

My log statement begins with the module name:

        Logger.error(
          "ApiWeb.Plugs.RateLimiterConcurrent event=request_statistics api_user=#{conn.assigns.api_user.id} at_limit=#{at_limit?} remaining=#{remaining - 1} limit=#{limit} event_stream=#{event_stream?}"
        )

paulswartz · 2024-02-14T15:13:11Z

That's right, the concurrent log is coming from the existing concurrency tracker. But since there's only one node, it should always agree with the new concurrency tracker and it doesn't look like that's the case.

paulswartz

putting in a temporary review so that it drops off the reminder list. open issue is that the existing concurrency logger does not always agree with the new concurrency limiter, and we're not sure why.

github-actions · 2024-02-21T14:45:57Z

Coverage of commit `dceca73`

Summary coverage rate:
  lines......: 88.2% (4234 of 4800 lines)
  functions..: 70.9% (2268 of 3200 functions)
  branches...: no data found

Files changed coverage rate:
                                                                        |Lines       |Functions  |Branches    
  Filename                                                              |Rate     Num|Rate    Num|Rate     Num
  ============================================================================================================
  apps/api_accounts/lib/api_accounts/key.ex                             |75.0%      4|70.0%    10|    -      0
  apps/api_web/lib/api_web.ex                                           |85.7%     14|83.3%     6|    -      0
  apps/api_web/lib/api_web/api_controller_helpers.ex                    |96.2%     53|93.8%    16|    -      0
  apps/api_web/lib/api_web/event_stream.ex                              |82.5%     57|84.6%    13|    -      0
  apps/api_web/lib/api_web/plugs/rate_limiter_concurrent.ex             |15.0%     20|33.3%     6|    -      0
  apps/api_web/lib/api_web/rate_limiter/memcache.ex                     | 0.0%      6| 0.0%     3|    -      0
  apps/api_web/lib/api_web/rate_limiter/memcache/supervisor.ex          |43.8%     16|40.0%     5|    -      0
  apps/api_web/lib/api_web/rate_limiter/rate_limiter_concurrent.ex      |35.7%     56|37.5%    16|    -      0
  apps/api_web/lib/api_web/user.ex                                      |83.3%      6| 100%     5|    -      0

Download coverage report

nlwstein · 2024-02-21T15:02:31Z

putting in a temporary review so that it drops off the reminder list. open issue is that the existing concurrency logger does not always agree with the new concurrency limiter, and we're not sure why.

I tried to reproduce this locally, and could not seem to make it happen. The counters were synced, even having the same 'ghost' requests after putting my laptop to sleep for the night 😂

My guess as to what happened is that a bunch of requests were opened and closed in a short period of time, simultaneously, and when that log fired there was a race between when the request tracker updated its logging metadata, and this cleanup filter which fires right before that particular log message is emitted. It also relies on a (configurable) heartbeat, so it's possible that it failed a heartbeat check, reducing the count of monitored processes on this feature before the callback for the request tracker fired.

I also redeployed to dev-blue and it didn't seem to happen again.

Maybe once we deploy this we should write a Splunk query to determine how frequently this happens? Ideally, it's just the heartbeat mechanism doing it's thing 🙏

paulswartz · 2024-02-21T15:05:15Z

I see it happening a lot in Splunk?

github-actions · 2024-02-22T17:09:31Z

Coverage of commit `66796ad`

Summary coverage rate:
  lines......: 88.2% (4234 of 4800 lines)
  functions..: 70.9% (2268 of 3200 functions)
  branches...: no data found

Files changed coverage rate:
                                                                        |Lines       |Functions  |Branches    
  Filename                                                              |Rate     Num|Rate    Num|Rate     Num
  ============================================================================================================
  apps/api_accounts/lib/api_accounts/key.ex                             |75.0%      4|70.0%    10|    -      0
  apps/api_web/lib/api_web.ex                                           |85.7%     14|83.3%     6|    -      0
  apps/api_web/lib/api_web/api_controller_helpers.ex                    |96.2%     53|93.8%    16|    -      0
  apps/api_web/lib/api_web/event_stream.ex                              |82.5%     57|84.6%    13|    -      0
  apps/api_web/lib/api_web/plugs/rate_limiter_concurrent.ex             |15.0%     20|33.3%     6|    -      0
  apps/api_web/lib/api_web/rate_limiter/memcache.ex                     | 0.0%      6| 0.0%     3|    -      0
  apps/api_web/lib/api_web/rate_limiter/memcache/supervisor.ex          |43.8%     16|40.0%     5|    -      0
  apps/api_web/lib/api_web/rate_limiter/rate_limiter_concurrent.ex      |35.7%     56|37.5%    16|    -      0
  apps/api_web/lib/api_web/user.ex                                      |83.3%      6| 100%     5|    -      0

Download coverage report

paulswartz

empty review pending more investigation

paulswartz

Had a chat with Firas in Slack. Summary:

merge this as is
continue to investigate the discrepancy between the original counter and the distributed counter
if we're unable to make progress on the accuracy of the distributed counter by 2024-03-31, we'll roll it back and discuss other approaches

github-actions · 2024-02-27T16:00:30Z

Coverage of commit `446928f`

Summary coverage rate:
  lines......: 88.2% (4235 of 4800 lines)
  functions..: 70.9% (2268 of 3200 functions)
  branches...: no data found

Files changed coverage rate:
                                                                        |Lines       |Functions  |Branches    
  Filename                                                              |Rate     Num|Rate    Num|Rate     Num
  ============================================================================================================
  apps/api_accounts/lib/api_accounts/key.ex                             |75.0%      4|70.0%    10|    -      0
  apps/api_web/lib/api_web.ex                                           |85.7%     14|83.3%     6|    -      0
  apps/api_web/lib/api_web/api_controller_helpers.ex                    |96.2%     53|93.8%    16|    -      0
  apps/api_web/lib/api_web/event_stream.ex                              |84.2%     57|84.6%    13|    -      0
  apps/api_web/lib/api_web/plugs/rate_limiter_concurrent.ex             |15.0%     20|33.3%     6|    -      0
  apps/api_web/lib/api_web/rate_limiter/memcache.ex                     | 0.0%      6| 0.0%     3|    -      0
  apps/api_web/lib/api_web/rate_limiter/memcache/supervisor.ex          |43.8%     16|40.0%     5|    -      0
  apps/api_web/lib/api_web/rate_limiter/rate_limiter_concurrent.ex      |35.7%     56|37.5%    16|    -      0
  apps/api_web/lib/api_web/user.ex                                      |83.3%      6| 100%     5|    -      0

Download coverage report

This reverts commit fedc026.

nlwstein added 2 commits February 9, 2024 13:32

Revert "Revert "feat: concurrent connections limiter (#696)" (#750)"

18b29c3

This reverts commit 1598989.

fix: don't override connection opts

b50db21

nlwstein requested review from paulswartz and bklebe February 9, 2024 18:35

fix: only build memcache config if memcache is actually required

70db24d

nlwstein had a problem deploying to dev February 9, 2024 18:43 — with GitHub Actions Failure

nlwstein had a problem deploying to dev February 9, 2024 18:48 — with GitHub Actions Failure

nlwstein had a problem deploying to dev-blue February 9, 2024 18:54 — with GitHub Actions Failure

nlwstein added 2 commits February 9, 2024 14:54

fix: keep a consistent connection_opts throughout various environments

c5d888c

chore: resolve credo complaint

f8aabfd

nlwstein temporarily deployed to dev-blue February 9, 2024 20:09 — with GitHub Actions Inactive

nlwstein temporarily deployed to dev-blue February 12, 2024 15:54 — with GitHub Actions Inactive

paulswartz reviewed Feb 20, 2024

View reviewed changes

Merge branch 'master' into nlws-new-connections-limiter

dceca73

nlwstein temporarily deployed to dev-blue February 21, 2024 14:40 — with GitHub Actions Inactive

nlwstein requested a review from paulswartz February 21, 2024 15:02

bklebe removed their request for review February 21, 2024 19:55

Merge branch 'master' into nlws-new-connections-limiter

66796ad

paulswartz reviewed Feb 23, 2024

View reviewed changes

paulswartz approved these changes Feb 26, 2024

View reviewed changes

Merge branch 'master' into nlws-new-connections-limiter

446928f

nlwstein merged commit fedc026 into master Feb 27, 2024
6 checks passed

nlwstein deleted the nlws-new-connections-limiter branch February 27, 2024 15:58

meagharty added a commit that referenced this pull request Mar 1, 2024

Revert "feat: concurrent connections limiter (#751)"

2566bbb

This reverts commit fedc026.

meagharty added a commit that referenced this pull request Mar 1, 2024

Revert "feat: concurrent connections limiter (#751)" (#772)

ebaf4dd

This reverts commit fedc026.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: concurrent connections limiter #751

feat: concurrent connections limiter #751

nlwstein commented Feb 9, 2024

paulswartz commented Feb 9, 2024

github-actions bot commented Feb 9, 2024

github-actions bot commented Feb 9, 2024

nlwstein commented Feb 9, 2024

nlwstein commented Feb 12, 2024

paulswartz commented Feb 12, 2024

nlwstein commented Feb 14, 2024

paulswartz commented Feb 14, 2024

paulswartz left a comment

github-actions bot commented Feb 21, 2024

nlwstein commented Feb 21, 2024

paulswartz commented Feb 21, 2024

github-actions bot commented Feb 22, 2024

paulswartz left a comment

paulswartz left a comment

github-actions bot commented Feb 27, 2024

feat: concurrent connections limiter #751

feat: concurrent connections limiter #751

Conversation

nlwstein commented Feb 9, 2024

Summary of changes

paulswartz commented Feb 9, 2024

github-actions bot commented Feb 9, 2024

Coverage of commit 70db24d

github-actions bot commented Feb 9, 2024

Coverage of commit f8aabfd

nlwstein commented Feb 9, 2024

nlwstein commented Feb 12, 2024

paulswartz commented Feb 12, 2024

nlwstein commented Feb 14, 2024

paulswartz commented Feb 14, 2024

paulswartz left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 21, 2024

Coverage of commit dceca73

nlwstein commented Feb 21, 2024

paulswartz commented Feb 21, 2024

github-actions bot commented Feb 22, 2024

Coverage of commit 66796ad

paulswartz left a comment

Choose a reason for hiding this comment

paulswartz left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 27, 2024

Coverage of commit 446928f

Coverage of commit `70db24d`

Coverage of commit `f8aabfd`

Coverage of commit `dceca73`

Coverage of commit `66796ad`

Coverage of commit `446928f`