ref(hc): Use session tied (not transactional) serialization for outboxes #57877

corps · 2023-10-10T23:47:47Z

We want to avoid using an actual transaction around the outbox signal processing so that nested signal invocations can complete transactions with securely stored (committed) results. the previous implementation used watermarks under the assumption that cross silo interactions (which generate nested outboxes) would be in a separate process -- however, self hosted and monolith this is not the case.

This approach still allows serialization (locking of entries by shard) but does not create a transaction which might prevent commits by inner signal handlers or nested invocations. The benefit of using postgres here instead of redis is the inclusion of session-bound behavior -- if pods rollover during a deploy, for instance, the graceful connection release (which happens by the OS when the process clears the files opened) will inform postgres to nearly instantly release said lock.

src/sentry/models/outbox.py

codecov · 2023-10-11T06:59:35Z

Codecov Report

Merging #57877 (22d4ada) into master (5495d8c) will increase coverage by 0.10%.
Report is 62 commits behind head on master.
The diff coverage is 90.90%.

❗ Current head 22d4ada differs from pull request most recent head ad5a5f1. Consider uploading reports for the commit ad5a5f1 to get more accurate results

@@            Coverage Diff             @@
##           master   #57877      +/-   ##
==========================================
+ Coverage   79.05%   79.15%   +0.10%     
==========================================
  Files        5131     5131              
  Lines      223055   224983    +1928     
  Branches    37574    38126     +552     
==========================================
+ Hits       176330   178085    +1755     
- Misses      41083    41221     +138     
- Partials     5642     5677      +35

Files	Coverage Δ
src/sentry/models/outbox.py	`92.62% <90.90%> (+0.49%)`	⬆️

... and 39 files with indirect coverage changes

corps · 2023-10-11T17:37:34Z

src/sentry/models/outbox.py

+ cursor.execute("SELECT pg_advisory_unlock(%s)", [shard_lock_id])
+ except Exception:
+ # If something strange is going on with our connection, force it closed to prevent holding the lock.
+ connections[using].close()


This is safe due to the behavior of django "connection wrappers": when you close a connection explicitly, the next usage will reopen the connection. https://github.com/django/django/blob/fc62e17778dad9eab9e507d90d85a33d415f64a7/django/db/backends/base/base.py#L271

corps · 2023-10-11T17:39:40Z

src/sentry/models/outbox.py

@@ -524,28 +521,45 @@ def save(self, **kwds: Any) -> None: # type: ignore[override]
 metrics.incr("outbox.saved", 1, tags=tags)
 super().save(**kwds)

+ def lock_id(self, attrs: Iterable[str]) -> int:
+ # 64 bit integer that roughly encodes a unique, serializable lock identifier
+ return mmh3.hash64(".".join(str(getattr(self, attr)) for attr in attrs))[0]


Postgres advisory locks are backed by 64 bit keys... unfortunately they aren't scoped in any other way. We perform a 64 bit hashing of the sharding values to produce one. That means it is possible to have collision: but that's not awful. A collision only means that two distinct shards end up blocking and competing against each other, but it is highly unlikely in practice. It's the price to pay for using session rather than transaction backed locks.

corps · 2023-10-11T17:40:15Z

src/sentry/models/outbox.py

+
+ try:
+ with connections[using].cursor() as cursor:
+ if flush_all:


In the flush_all case, we do not want to block -- we try to obtain the shard, but will happily skip this shard if it is already being produced elsewise.

corps · 2023-10-11T17:40:33Z

src/sentry/models/outbox.py

+
+ if obtained_lock:
+ next_shard_row: OutboxBase | None
+ next_shard_row = self.selected_messages_in_shard(


The same query as above, but now behind the separate lock acquisition.

markstory · 2023-10-11T18:32:00Z

src/sentry/models/outbox.py

+ if not cursor.fetchone()[0]:
+ obtained_lock = False
+ else:
+ cursor.execute("SELECT pg_advisory_lock(%s)", [shard_lock_id])


We won't be able to wait here indefinitely. In production we'll hit statement/query timeouts. However, that shouldn't matter much as we'll end up in the finally block and release the lock and close the connection if we did fail to acquire the lock due to statement timeouts.

markstory · 2023-10-11T18:47:29Z

src/sentry/models/outbox.py

+ if flush_all:
+ cursor.execute("SELECT pg_try_advisory_lock(%s)", [shard_lock_id])
+ if not cursor.fetchone()[0]:
+ obtained_lock = False


Do we want a metric so we have visibility into any spin waits we have on these locks?

Sure, can add some metrics around this work

markstory · 2023-10-11T18:58:17Z

src/sentry/models/outbox.py

+ finally:
+ try:
+ with connections[using].cursor() as cursor:
+ cursor.execute("SELECT pg_advisory_unlock(%s)", [shard_lock_id])


Would this fail when a lock cannot be acquired because of a query timeout?

Yes. It is considered a query that does not terminate until loc is acquired, meaning that standard query timeout applies.

markstory · 2023-10-11T19:03:52Z

src/sentry/models/outbox.py

- try:
- next_shard_row = (
- self.selected_messages_in_shard(latest_shard_row=latest_shard_row)
- .select_for_update(nowait=flush_all)


Previously this would have prevented multiple workers from attempting to operate on the same rows. If we end up with multiple workers processing outboxes I don't think session level locking will prevent multiple workers from handling the same outbox messages.

While outbox deliveries are scheduled every minute reducing the chances of overlapping workers, couldn't we have a backlog form (due to an outage elsewhere), and have competitive consumers race each other?

Kind of. To be honest, since all rows belong to the same shard here, any lock on any of these rows effectively locks the shard (because rows are only deleted in the same transaction as the locking).

Now, none of that is necessary because the locking is explicit on the hash, and the deletion of rows need not be a transaction as the lock is known to be acquired at the time anyways.

In practice, the serialization here is strong and there won't be any case of multiple workers acting on the same rows.

To be clear, "session based" does not mean it isn't shared across workers -- it only means that the lock is released automatically if the connection is dropped. that's it.

And also this is already tested in the test_outbox where I used a separate thread to force a separate connection and validate the serialization when both access the same shard.

To be clear, "session based" does not mean it isn't shared across workers -- it only means that the lock is released automatically if the connection is dropped. that's it.

👍 I misinterpreted 'session based' with transaction scoped locks, but they are different lock scopes.

ghost

Thanks for the walkthrough!
Good to go, once we have the metrics.

sentry-io · 2023-10-11T23:21:52Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ OperationalError: QueryCanceled('canceling statement due to user request\n') sentry.tasks.check_auth_identity View Issue

_{Did you find this useful? React with a 👍 or 👎}

getsentry-bot · 2023-10-11T23:28:34Z

PR reverted: ecea429

…or outboxes (#57877)" This reverts commit 512616c. Co-authored-by: corps <593850+corps@users.noreply.github.com>

Pass 2 on #57877 Added better handling for operational error on lock timeout (something lost in the original PR), and also fixed the outbox contention issues around auth identity.

ref(hc): Use session tied (not transactional) serialization for outboxes

bbce3d4

corps requested a review from a team October 10, 2023 23:47

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Oct 10, 2023

Merge branch 'master' into zc/outbox-advisory-lock

f598902

vercel bot deployed to Preview October 10, 2023 23:51 View deployment

corps commented Oct 11, 2023

View reviewed changes

src/sentry/models/outbox.py Outdated Show resolved Hide resolved

Remove stuff not needed

22d4ada

vercel bot deployed to Preview October 11, 2023 06:08 View deployment

corps commented Oct 11, 2023

View reviewed changes

markstory reviewed Oct 11, 2023

View reviewed changes

corps force-pushed the zc/outbox-advisory-lock branch from 25e1d69 to 22d4ada Compare October 11, 2023 20:05

markstory approved these changes Oct 11, 2023

View reviewed changes

ghost approved these changes Oct 11, 2023

View reviewed changes

Metrics and removal of dead code

ad5a5f1

vercel bot deployed to Preview October 11, 2023 21:30 View deployment

corps merged commit 512616c into master Oct 11, 2023
49 checks passed

corps deleted the zc/outbox-advisory-lock branch October 11, 2023 22:30

corps added the Trigger: Revert add to a merged PR to revert it (skips CI) label Oct 11, 2023

getsentry-bot added a commit that referenced this pull request Oct 11, 2023

Revert "ref(hc): Use session tied (not transactional) serialization f…

ecea429

…or outboxes (#57877)" This reverts commit 512616c. Co-authored-by: corps <593850+corps@users.noreply.github.com>

corps mentioned this pull request Oct 12, 2023

ref(hc): Advisory lock #57958

Merged

corps added a commit that referenced this pull request Oct 12, 2023

ref(hc): Advisory lock (#57958)

fa0cf78

Pass 2 on #57877 Added better handling for operational error on lock timeout (something lost in the original PR), and also fixed the outbox contention issues around auth identity.

github-actions bot locked and limited conversation to collaborators Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref(hc): Use session tied (not transactional) serialization for outboxes #57877

ref(hc): Use session tied (not transactional) serialization for outboxes #57877

corps commented Oct 10, 2023 •

edited

Loading

codecov bot commented Oct 11, 2023 •

edited

Loading

corps Oct 11, 2023

corps Oct 11, 2023

corps Oct 11, 2023

corps Oct 11, 2023 •

edited

Loading

markstory Oct 11, 2023

markstory Oct 11, 2023

corps Oct 11, 2023

markstory Oct 11, 2023

corps Oct 11, 2023

markstory Oct 11, 2023

corps Oct 11, 2023

corps Oct 11, 2023

corps Oct 11, 2023

markstory Oct 11, 2023 •

edited

Loading

ghost left a comment

sentry-io bot commented Oct 11, 2023

getsentry-bot commented Oct 11, 2023

ref(hc): Use session tied (not transactional) serialization for outboxes #57877

ref(hc): Use session tied (not transactional) serialization for outboxes #57877

Conversation

corps commented Oct 10, 2023 • edited Loading

codecov bot commented Oct 11, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

corps Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markstory Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

ghost left a comment

Choose a reason for hiding this comment

sentry-io bot commented Oct 11, 2023

Suspect Issues

getsentry-bot commented Oct 11, 2023

corps commented Oct 10, 2023 •

edited

Loading

codecov bot commented Oct 11, 2023 •

edited

Loading

corps Oct 11, 2023 •

edited

Loading

markstory Oct 11, 2023 •

edited

Loading