Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up aggregate data #440

Merged
merged 3 commits into from
Oct 2, 2024
Merged

Speed up aggregate data #440

merged 3 commits into from
Oct 2, 2024

Conversation

M-Kusumgar
Copy link
Contributor

@M-Kusumgar M-Kusumgar commented Sep 6, 2024

This branch tries to (and ultimately fails) speed up the review inputs metadata endpoint in Hintr. Profiling the code revealed these Naomi functions were the slow parts. After optimising these they are many times faster on the test data and the MWI data however where we ultimately fail is DRC for which there is only a slight speed up because the majority of the time for it is spent calculating the other plot types like art_new_total, art_new_adult_m, etc. We have a solution for this that will probably get done in a PR after the plot refactor. For now, the main changes in this PR:

  • drop_geometry arg to aggregate functions to skip unnecessary work (these functions were trying to drop geometry after it had already been dropped so wasting a lot of time reading the files)
  • small test fix, we should be checking check3 (defined in the lines just above the diff) and not check1
  • new algorithm for aggregating up data, it is just a for loop that works level by level to aggregate up the data
  • same new algorithm is also applied to calculating missing_map

…s metadata endpoint in Hintr. Profiling the code revealed these Naomi functions were the slow parts. After optimising these they are many times faster on the test data and the MWI data however where we ultimately fail is DRC for which there is only a slight speed up because the majority of the time for it is spent calculating the other plot types like art_new_total, art_new_adult_m, etc. We have a solution for this that will probably get done in a PR after the plot refactor. For now, the main changes in this PR:

* drop_geometry arg to aggregate functions to skip unnecessary work (these functions were trying to drop geometry after it had already been dropped so wasting a lot of time reading the files)
* small test fix, we should be checking `check3` (defined in the lines just above the diff) and not `check1`
* new algorithm for aggregating up data, it is just a for loop that works level by level to aggregate up the data
* same new algorithm is also applied to calculating `missing_map`
@M-Kusumgar M-Kusumgar changed the title This branch tries to (and ultimately fails) speed up the review input… Speed up aggregate data Sep 9, 2024
@r-ash
Copy link
Contributor

r-ash commented Sep 10, 2024

Hintr PR hivtools/hintr#524

Copy link
Contributor

@r-ash r-ash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, and hintr tests are passing with this branch so looking good. I think you can ignore that devl branch failure for some reason it is running really slowly but I don't think raising any real issues.

Really need to sort out this duckdb install stuff, not sure why it is not working from binary. (Maybe we'll have to run these tests inside docker again if I can't get that install working)

##'
##' @return Aggregated ART data containing columns area_id, area_name,
##' area_level, area_level_label, parent_area_id, sex, age_group, time_period,
##' year, quarter,calendar_quarter and art_current
##'
##' @export
aggregate_art <- function(art, shape) {
aggregate_art <- function(art, shape, drop_geometry = TRUE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love passing this through as an arg, does this definitely make it faster? Ran a quick benchmark and it looks like dropping the geometry itself is very fast.

mwi_file <- "tests/testthat/testdata/malawi.geojson"
mwi_sf <- sf::read_sf(mwi_file)
bench::mark(sf::read_sf(mwi_file), sf::read_sf(mwi_file) |> sf::st_drop_geometry(), sf::st_drop_geometry(mwi_sf), check = FALSE)
# A tibble: 3 × 13
  expression                                       min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory             time       gc      
  <bch:expr>                                  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>             <list>     <list>  
1 sf::read_sf(mwi_file)                         37.9ms   39.6ms      25.2     1.1MB     0       13     0      516ms <NULL> <Rprofmem>         <bench_tm> <tibble>
2 sf::st_drop_geometry(sf::read_sf(mwi_file))     42ms     44ms      22.7     1.1MB     0       12     0      529ms <NULL> <Rprofmem>         <bench_tm> <tibble>
3 sf::st_drop_geometry(mwi_sf)                    46µs   56.1µs   14715.       656B     2.14  6874     1      467ms <NULL> <Rprofmem [2 × 3]> <bench_tm> <tibble>
``

> these functions were trying to drop geometry after it had already been dropped so wasting a lot of time reading the files

Do you have a profile that shows that? Code looked like if we passed in an sf object it wouldn't be reading anything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it does make a big difference to skip the drop_geometry I wonder if we could instead check if the areas has a geometry or not, and not attempt to remove it if it doesn't have it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay looking at it properly, we were passing in shape to this function not areas (areas is result of reading the file and geometry dropped) so i can remove this arg and pass in areas only

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just added a check for if its a table then no point reading it again

@@ -152,7 +152,7 @@ test_that("ART data can be aggregated when avalible at different admin levels",
dplyr::group_by(area_level_label) %>%
dplyr::summarise(n = dplyr::n(), .groups = "drop")

expect_equal(check1$n, c(10, 10, 6))
expect_equal(check3$n, c(10, 10, 6))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot!

@M-Kusumgar M-Kusumgar requested a review from r-ash September 12, 2024 13:04
Copy link
Contributor

@r-ash r-ash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I'd like @rtesra to check the data manipulation before we merge but all fine from my end.

@M-Kusumgar M-Kusumgar merged commit 3598dd6 into master Oct 2, 2024
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants