Update queries for workflow counts #71

yuenmichelle1 · 2024-10-08T04:46:07Z

Because of how the FE calls out /classifications?workflow_id endpoint when querying from project's stats (i.e. all at once), which is hammering the API and Timescale db. 🔨

This ^ (and our non-usage of compression or data retention policies) plus our usage of Real Time Aggregates on the continuous aggregate we query workflow data from, causes querying to be not performant. (As evident when turning off Real Time Aggregation, we see faster query times and lower cpu utilization).

Because we are limited in resources and we still want Real Time Aggregation to be possible, we do the following for querying workflow classification counts ONLY:

Create a New HourlyClassificationCountByWorkflow RealTime Aggregate
Set a data retention policy for this new hourly workflow classification count aggregate (this should limit the amount of data the query planner will have to sift through when dealing with real time aggregates)
Turn off real time aggregation for the DailyClassificationCount
When querying workflow classification counts that include current date's counts, we query current date's counts via the HourlyClassificationCountByWorkflow continuous aggregate and we query the materialized DailyClassificationCountsByWorkflow for everything before that date

Some details:

to avoid double counting when querying workflow counts that include a query to real time aggregate, we scope querying our materialized continuous aggregate for everything up to the day before the current date (double counting can happen when materialized view gets refreshed to include current date's counts..can happen via manual refresh or cagg watermark looking too far ahead [sometimes can happen, but very rare and not a fault on our end])
timezones are not too much of an issue here, since our Rails app is in UTC and our db time is in UTC
mentioned in BE/Ops call (on 9/26/24) about the complexity, but this detail seemed to confuse Cliff and Zach, when current date is the start of a new bucket period (period options being day, week, month, or year, we need to ensure we append the current date counts as a new entry with the proper start date of the appended period. Otherwise, its a simple add current date counts to the newest bucket, if there is a newest bucket entry. More detail in the next bullet
In include_today_to_scoped in count_classifications, we take care of the following cases:
- for a given workflow, there are no classifications past or current => returns the empty past classifications counts
- for a given workflow, there are past classifications but no current classifications => returns past classifications counts
- for a given workflow, there are no past classifications but there are current classifications (i.e. a new workflow) => returns todays_classifications with proper period bucket start date
- for a given workflow, there are past classifications and current classifications
A) if part of the most recent past classifications bucket, then add
B) if not, then append with proper period bucket start date

…ount to Materialized Only View

… data pull

…s day

app/queries/count_classifications.rb

db/migrate/20240926225916_create_hourly_workflow_classification_count.rb

db/schema.rb

yuenmichelle1 added 20 commits September 27, 2024 08:43

Adding Hourly Workflow Counts Realtime CAgg and change DailyWorkflowC…

3e691d5

…ount to Materialized Only View

initial go on using hourly classifications for workflows

4f1e82a

remove print statement

73f8249

Update count_classifications.rb

77cb262

Update count_classifications.rb

2e9ec6c

remove unused var

435992a

taking care of blank case/ no entry found case

1485084

remove logs

b4eb0b4

add frames for test

3d0831c

adding testing for cases when end_date is before and after current day

4168da8

add tests for testing period and change eriod format to match that of…

77a14c3

… data pull

adding tests for the case when there are classifications from previou…

b1d4a66

…s day

update comment on migration

d9d5e39

update hound comments

4ae1eee

update db.rake with new caggs

6ee4f68

Update hourly_workflow_classification_count.rb

789b555

Update db.rake

db35f9b

remove redundant returns

967974b

add frozen string literal true

7ba19d4

rubocop fix hound

b751b38

hound bot reviewed Oct 8, 2024

View reviewed changes

yuenmichelle1 marked this pull request as ready for review October 8, 2024 04:47

yuenmichelle1 added 3 commits October 8, 2024 00:07

adding comment on spec

d932fdf

rename spec to note adding counts

862dd9b

update migrations to be reversible

3bc6875

yuenmichelle1 requested a review from Tooyosi October 9, 2024 13:54

yuenmichelle1 merged commit cef6509 into main Oct 11, 2024
4 checks passed

yuenmichelle1 deleted the update-queries-for-workflow-counts branch October 18, 2024 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update queries for workflow counts #71

Update queries for workflow counts #71

yuenmichelle1 commented Oct 8, 2024 •

edited

Loading

Update queries for workflow counts #71

Update queries for workflow counts #71

Conversation

yuenmichelle1 commented Oct 8, 2024 • edited Loading

yuenmichelle1 commented Oct 8, 2024 •

edited

Loading