-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(seer grouping): Store Seer metadata on grouphashes during ingest #77956
base: master
Are you sure you want to change the base?
feat(seer grouping): Store Seer metadata on grouphashes during ingest #77956
Conversation
This PR has a migration; here is the generated SQL for --
-- Add field seer_date_sent to grouphashmetadata
--
ALTER TABLE "sentry_grouphashmetadata" ADD COLUMN "seer_date_sent" timestamp with time zone NULL;
--
-- Add field seer_event_sent to grouphashmetadata
--
ALTER TABLE "sentry_grouphashmetadata" ADD COLUMN "seer_event_sent" varchar(32) NULL;
--
-- Add field seer_grouphash_sent to grouphashmetadata
--
ALTER TABLE "sentry_grouphashmetadata" ADD COLUMN "seer_grouphash_sent_id" bigint NULL;
--
-- Add field seer_match_distance to grouphashmetadata
--
ALTER TABLE "sentry_grouphashmetadata" ADD COLUMN "seer_match_distance" double precision NULL;
--
-- Add field seer_matched_grouphash to grouphashmetadata
--
ALTER TABLE "sentry_grouphashmetadata" ADD COLUMN "seer_matched_grouphash_id" bigint NULL;
--
-- Add field seer_model to grouphashmetadata
--
ALTER TABLE "sentry_grouphashmetadata" ADD COLUMN "seer_model" varchar NULL;
ALTER TABLE "sentry_grouphashmetadata" ADD CONSTRAINT "sentry_grouphashmeta_seer_grouphash_sent__831da653_fk_sentry_gr" FOREIGN KEY ("seer_grouphash_sent_id") REFERENCES "sentry_grouphash" ("id") DEFERRABLE INITIALLY DEFERRED NOT VALID;
ALTER TABLE "sentry_grouphashmetadata" VALIDATE CONSTRAINT "sentry_grouphashmeta_seer_grouphash_sent__831da653_fk_sentry_gr";
CREATE INDEX CONCURRENTLY "sentry_grouphashmetadata_seer_grouphash_sent_id_831da653" ON "sentry_grouphashmetadata" ("seer_grouphash_sent_id");
ALTER TABLE "sentry_grouphashmetadata" ADD CONSTRAINT "sentry_grouphashmeta_seer_matched_groupha_c92b0107_fk_sentry_gr" FOREIGN KEY ("seer_matched_grouphash_id") REFERENCES "sentry_grouphash" ("id") DEFERRABLE INITIALLY DEFERRED NOT VALID;
ALTER TABLE "sentry_grouphashmetadata" VALIDATE CONSTRAINT "sentry_grouphashmeta_seer_matched_groupha_c92b0107_fk_sentry_gr";
CREATE INDEX CONCURRENTLY "sentry_grouphashmetadata_seer_matched_grouphash_id_c92b0107" ON "sentry_grouphashmetadata" ("seer_matched_grouphash_id"); |
❌ 11 Tests Failed:
View the full list of 3 ❄️ flaky tests
To view individual test run time comparison to the main branch, go to the Test Analytics Dashboard |
4b444cd
to
32565d6
Compare
This comment was marked as off-topic.
This comment was marked as off-topic.
# that because of merging/unmerging, the sent GroupHash and this metadata's GroupHash (if not | ||
# one and the same) aren't guaranteed to forever point to the same group (though they will when | ||
# this field is written). | ||
seer_grouphash_sent = FlexibleForeignKey( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would prefer not to keep track of this on GroupHashMetadata
. -- we can simply infer this between the relations between the grouphash and group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see note above on merging/unmerging, i'll think further on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small note: for when moving out of draft, it would be preferred to isolate the model changes in a singular PR
This adds the following Seer information to grouphash metadata during ingest:
GroupHash
record)Given that a single event may generate multiple grouphashes, and only one hash value actually gets sent, this also adds a
seer_grouphash_sent
field, which holds a reference to theGroupHash
corresponding to the sent hash. This not only lets us know which value we should find in the Seer database, it also lets us avoid storing all of the above information redundantly - we can store it only on theGroupHash
which gets sent, and give the "sibling"GroupHash
es access to the information, should they need it, via that link.For example, if an event generates
GroupHash
recordsgrouphashA
andgrouphashB
, andgrouphashA
is sent:grouphashA
's metadatagrouphashA
andgrouphashB
will link togrouphashA
as the representative which was sentmetadata.seer_grouphash_sent
on eithergrouphashA
orgrouphashB
.grouphashA.seer_grouphash_sent.metadata.seer_date_sent
andgrouphashB.seer_grouphash_sent.metadata.seer_date_sent
.Because all of this linking makes for a kind of clunky API, at the point at which we actually go to use this information (in grouping info, as a check during future backfills if/when we update our Seer model, etc), we'll probably want to add appropriate properties and/or helper methods to
Group
orGroupHash
, depending on the application. Until we get to that point, though, it made sense to me to wait rather than guess (potentially incorrectly) what we'll need.Notes:
As a result of this change, we're no longer storing Seer results in event data. This is better - before, to see the Seer results you had to find the first event to generate a given hash (which is the first event in a group in cases where Seer doesn't find a match, but is some random event among a group's full event list in cases where Seer does match to an existing group). Now, you can find Seer resuls from any event in a group, via the group's
GroupHash
records. (It did mean I had to pull out some temporary logs I had added to group creation, but it's unclear if they're still necessary. If the issue which prompted them comes up again, I'll add them back in a different way.)As agreed offline, I'm updating the
GroupHashMetadata
records which are created elsewhere - rather than waiting to create them until the Seer results are available (or until we decide not to call Seer) - because it's easier to reason about and because the cardinality here is low, given that more than 99% of events match to an existing group and therefore never hit this code. If, after theGroupHashMetadata
MVP is done, we decide we need to optimize this, we can do so before GA.Not done here: Updating the backfill code to store Seer results in grouphash metadata rather than on the group. This should be done before the next time we run a backfill, but given that the current backfill is already half-completed, having them in one place for new groups and a different (single) place for backfilled groups - while not ideal - seemed better than having them in one place for new groups and sometimes the same place but sometimes a different place for backfilled groups.