Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update queries for workflow counts #71

Merged
merged 23 commits into from
Oct 11, 2024

Conversation

yuenmichelle1
Copy link
Collaborator

@yuenmichelle1 yuenmichelle1 commented Oct 8, 2024

Because of how the FE calls out /classifications?workflow_id endpoint when querying from project's stats (i.e. all at once), which is hammering the API and Timescale db. 🔨

This ^ (and our non-usage of compression or data retention policies) plus our usage of Real Time Aggregates on the continuous aggregate we query workflow data from, causes querying to be not performant. (As evident when turning off Real Time Aggregation, we see faster query times and lower cpu utilization).

Because we are limited in resources and we still want Real Time Aggregation to be possible, we do the following for querying workflow classification counts ONLY:

  • Create a New HourlyClassificationCountByWorkflow RealTime Aggregate
  • Set a data retention policy for this new hourly workflow classification count aggregate (this should limit the amount of data the query planner will have to sift through when dealing with real time aggregates)
  • Turn off real time aggregation for the DailyClassificationCount
  • When querying workflow classification counts that include current date's counts, we query current date's counts via the HourlyClassificationCountByWorkflow continuous aggregate and we query the materialized DailyClassificationCountsByWorkflow for everything before that date

Some details:

  • to avoid double counting when querying workflow counts that include a query to real time aggregate, we scope querying our materialized continuous aggregate for everything up to the day before the current date (double counting can happen when materialized view gets refreshed to include current date's counts..can happen via manual refresh or cagg watermark looking too far ahead [sometimes can happen, but very rare and not a fault on our end])
  • timezones are not too much of an issue here, since our Rails app is in UTC and our db time is in UTC
  • mentioned in BE/Ops call (on 9/26/24) about the complexity, but this detail seemed to confuse Cliff and Zach, when current date is the start of a new bucket period (period options being day, week, month, or year, we need to ensure we append the current date counts as a new entry with the proper start date of the appended period. Otherwise, its a simple add current date counts to the newest bucket, if there is a newest bucket entry. More detail in the next bullet
  • In include_today_to_scoped in count_classifications, we take care of the following cases:
    - for a given workflow, there are no classifications past or current => returns the empty past classifications counts
    - for a given workflow, there are past classifications but no current classifications => returns past classifications counts
    - for a given workflow, there are no past classifications but there are current classifications (i.e. a new workflow) => returns todays_classifications with proper period bucket start date
    - for a given workflow, there are past classifications and current classifications
    A) if part of the most recent past classifications bucket, then add
    B) if not, then append with proper period bucket start date

app/queries/count_classifications.rb Show resolved Hide resolved
app/queries/count_classifications.rb Show resolved Hide resolved
app/queries/count_classifications.rb Show resolved Hide resolved
app/queries/count_classifications.rb Show resolved Hide resolved
db/schema.rb Show resolved Hide resolved
@yuenmichelle1 yuenmichelle1 marked this pull request as ready for review October 8, 2024 04:47
@yuenmichelle1 yuenmichelle1 merged commit cef6509 into main Oct 11, 2024
4 checks passed
@yuenmichelle1 yuenmichelle1 deleted the update-queries-for-workflow-counts branch October 18, 2024 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant