Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could the intervals be extended to month and/or month-year? #14

Open
Lextuga007 opened this issue Sep 17, 2021 · 11 comments
Open

Could the intervals be extended to month and/or month-year? #14

Lextuga007 opened this issue Sep 17, 2021 · 11 comments

Comments

@Lextuga007
Copy link

I want to give patientcounter a try with smoking prevalence data by team or ward and I have information over many years so the best way to 'count' the open people in a team or ward are by referrals by month-year. Patientcounter only goes to day - is that right?

@johnmackintosh
Copy link
Owner

Hi @Lextuga007 - I've only just seen this, not sure why I wasn't notified before.

as far as I know, if it works with cut, it should work - this is the guidance for cut.POSIXct:

image

I'd be happy to take a look if you have some trial data you could share (offline)?

@will-ball
Copy link

Hey @johnmackintosh did you guys end up finding out if this worked? I'm potentially going to be doing a count of folks added before but not removed from a register on a specific date over multiple years. It appears that specifying "year" would be fine - how would I go about setting the day & month to check at?

@johnmackintosh
Copy link
Owner

@will-ball I never got round to looking into this in detail. In reference to @Lextuga007's comment, the package doesn't necessarily only go to day level, but it does expect date-time, rather than dates.
It was created due to the need for needing hourly or even finer grained counts.

If you use the individual level, the function returns a row per individual per interval, including the original start and end datetimes, plus the interval's base date and hour - which you can use to filter results to a specific date and time.

Alternatively, maybe you could use data.table's rolling joins?

https://www.gormanalysis.com/blog/r-data-table-rolling-joins/

https://r-norberg.blogspot.com/2016/06/understanding-datatable-rolling-joins.html

If you have some fake data to play around with, would be happy to take a look at all the options

@will-ball
Copy link

will-ball commented Dec 14, 2022

Thanks for getting back to me @johnmackintosh

I've not encountered rolling joins before so will take a look, thanks for flagging. I've got a toy dataset to illustrate:

# Simple Example
library(tidyverse)
library(lubridate)
library(truncnorm)

n_people <- 1000

start_date <- as_date("2012-01-01")
end_date <- as_date("2021-12-31")

set.seed(20221214)

data <- as_tibble(
  list(
    id = sample(1:n_people, replace = TRUE),
    added = start_date + sample.int(end_date - start_date, n_people))) %>% 
  mutate(
    removed = added + rtruncnorm(n_people, mean = 30, sd = 15, a = 1, b = 1000),
    days = added %--% removed %/% days(1))

From data which essentially looks like this, I'd like to count how many people are 'registered' on the 31st July each year. I don't think it should complicate anything but the same person can be added/removed multiple times. I will have a play myself but if you get bored and want to take a look let me know.

@johnmackintosh
Copy link
Owner

see if this gives you what you need @will-ball ?

library(tidyverse)
library(lubridate)
library(truncnorm)

library(patientcounter)

n_people <- 1000

start_date <- as_date("2012-01-01")
end_date <- as_date("2021-12-31")

set.seed(20221214)

data <- as_tibble(
  list(
    id = sample(1:n_people, replace = TRUE),
    added = start_date + sample.int(end_date - start_date, n_people))) %>% 
  mutate(
    removed = added + rtruncnorm(n_people, mean = 30, sd = 15, a = 1, b = 1000),
    days = added %--% removed %/% days(1))


data2 <- data %>% 
  mutate(added  = as.POSIXct(added), 
         removed = as.POSIXct(removed))

results <- interval_census(data2, 
                           identifier = 'id', 
                           admit = "added", 
                           discharge = "removed", 
                           time_unit = '1 day', 
                           results = 'patient')

results[lubridate::month(base_date)== 7 & lubridate::day(base_date) == 31] %>% 
  arrange(.,id, added)

@johnmackintosh
Copy link
Owner

results[lubridate::month(base_date)== 7 & lubridate::day(base_date) == 31,.N, .(base_date)]

will give you tallies for each cutoff date

@will-ball
Copy link

That works perfectly thanks 😄

@johnmackintosh
Copy link
Owner

Nice one @will-ball
Not sure I've been any use to @Lextuga007 yet so will leave this open for now

@Lextuga007
Copy link
Author

Yes, it does look like "year" is supported as time_unit parameter feeds into {lubridate} functions. However, when I run a smaller example for years there is a strange thing when an end date is already "floored":

library(dplyr)
library(patientcounter)

df <- tibble::tribble(
  ~id,  ~start_date,    ~end_date, ~smoking_status,
   5L, "2024-08-01", NA, "smoker",
   1L, "2019-01-01", "2020-01-01",        "smoker",
   2L, "2019-01-02", "2020-01-02",    "non-smoker",
   3L, "2019-01-03", "2022-01-01",        "smoker",
   4L, "2019-01-04", NA,    "non-smoker"
  ) |> 
  mutate(start_date = as.POSIXct(start_date),
         end_date = as.POSIXct(end_date))
  
results <- interval_census(df, 
                           identifier = 'id', 
                           admit = "start_date", 
                           discharge = "end_date", 
                           time_unit = 'year', 
                           results = 'patient') |> 
  arrange(id)

id 1 should get 2019 and 2020 but because it's end date is on the 1st 2020 doesn't show. I'm guessing but is this something related to the date times and the time is tipping it to 2019-12-31? The same happens with id 3 which should be 2019, 2020, 2021 and 2022 but 2022 is dropped.

@johnmackintosh
Copy link
Owner

Hmm, I wonder if that is timezone related.
I haven't tried your code yet, but I've encountered issues with the changeover from BST/ GMT if UTC has not been explicitly declared.

I don't have much bandwidth to look into this at present.

Another possible influencing factor is my use of "within" as the method used with foverlaps.
I was thinking about making that a parameter in the main function so that folk can use whatever method suits them best.

Will try and get that sorted soon.

@Lextuga007
Copy link
Author

Tom Jemmett https://github.com/tomjemmett wrote this code which I've adapted for the data I used and it's made me realise that what I need to count is not really a census as I don't want to subtract people who leave for something like prevalence.

df |> 
  tidyr::pivot_longer(-c(id, smoking_status), 
                      values_to = "date") |>
  dplyr::mutate(n = ifelse(name == "start_date", 1, -1)) |>
  tidyr::replace_na(list(date = lubridate::today())) |> 
  dplyr::mutate(date = lubridate::floor_date(date, "year")) |> 
  dplyr::arrange(date, smoking_status) |>
  dplyr::mutate(c = cumsum(n),
                .by = smoking_status) |> 
  dplyr::select(-name, -id, -n) |>  
  dplyr::slice_tail(n = 1, by = c(date, smoking_status)) |> 
  tidyr::complete(date = seq(min(date), max(date), by = "year")) |> 
  tidyr::fill(c(c, smoking_status)) |>
  tidyr::replace_na(list(c = 0))

I think for prevalence I'd need to drop the generating of -1 for an exit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants