Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practice for col names when combining datasets #13

Open
prockenschaub opened this issue Jun 6, 2022 · 1 comment
Open

Best practice for col names when combining datasets #13

prockenschaub opened this issue Jun 6, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@prockenschaub
Copy link
Collaborator

Problem

When combining databases (say MIMIC III and eICU), the names of the ID variables and the time variable depend on the order in which sources are passed to load_concepts. See the following reprex inspired by the quick start guide:

library(ricu)

src <- c("mimic_demo", "eicu_demo")

load_concepts("alb", src, verbose = FALSE)
#> # A `ts_tbl`: 6,657 ✖ 4
#> # Id vars:    `source`, `icustay_id`
#> # Units:      `alb` [g/dL]
#> # Index var:  `charttime` (1 hours)
#>       source     icustay_id charttime   alb
#>       <chr>           <int> <drtn>    <dbl>
#>     1 eicu_demo      141765  -2 hours   3.7
#>     2 eicu_demo      144815  -3 hours   4.2
#>     3 eicu_demo      144815   8 hours   3.6
#>     4 eicu_demo      145427  -6 hours   3.7
#>     5 eicu_demo      147307  -6 hours   3.5
#>     …
#> 6,653 mimic_demo     298685 130 hours   1.9
#> 6,654 mimic_demo     298685 154 hours   2
#> 6,655 mimic_demo     298685 203 hours   2
#> 6,656 mimic_demo     298685 272 hours   2.2
#> 6,657 mimic_demo     298685 299 hours   2.5

load_concepts("alb", rev(src), verbose = FALSE)
#># A `ts_tbl`: 6,657 ✖ 4
#># Id vars:    `source`, `patientunitstayid`
#># Units:      `alb` [g/dL]
#># Index var:  `labresultoffset` (1 hours)
#>      source     patientunitstayid labresultoffset   alb
#>      <chr>                  <int> <drtn>          <dbl>
#>    1 eicu_demo             141765  -2 hours         3.7
#>    2 eicu_demo             144815  -3 hours         4.2
#>    3 eicu_demo             144815   8 hours         3.6
#>    4 eicu_demo             145427  -6 hours         3.7
#>    5 eicu_demo             147307  -6 hours         3.5
#>    …
#>6,653 mimic_demo            298685 130 hours         1.9
#>6,654 mimic_demo            298685 154 hours         2
#>6,655 mimic_demo            298685 203 hours         2
#>6,656 mimic_demo            298685 272 hours         2.2
#>6,657 mimic_demo            298685 299 hours         2.5
#># … with 6,647 more rows

As you can see, although the information is exactly the same, the names depend on the order of src. This prevents me for example from simply appending two concepts from different databases:

bind_rows(
  load_concepts("alb", "mimic_demo", verbose = FALSE),
  load_concepts("alb", "eicu_demo", verbose = FALSE)
)

#> # A `ts_tbl`: 6,657 ✖ 5
#> # Id var:     `icustay_id`
#> # Index var:  `charttime` (1 hours)
#>       icustay_id charttime   alb patientunitstayid labresultoffset
#>            <int> <drtn>    <dbl>             <int> <drtn>
#>     1         NA  NA hours   3.4           3352333   2 hours
#>     2         NA  NA hours   3.3           3352333  11 hours
#>     3         NA  NA hours   3.1           3352333  36 hours
#>     4         NA  NA hours   3.4           3353113 -36 hours
#>     5         NA  NA hours   3.6           3353113  10 hours
#>     …
#> 6,653     201006   0 hours   2.4                NA  NA hours
#> 6,654     203766 -18 hours   2                  NA  NA hours
#> 6,655     203766   4 hours   1.7                NA  NA hours
#> 6,656     204132   7 hours   3.6                NA  NA hours
#> 6,657     204201   9 hours   2.3                NA  NA hours
#> # … with 6,647 more rows

Question

Am I missing something obvious here and am I supposed to do something differently? I did find the helper function id_vars and index_var that can help me recover what the names are but this seems cumbersome and does not allow me to only merge on a specific ID level (e.g. admissions) without remembering what this colum was called in the first database I passed to load_concept.

What was the reasoning underlying this design choice and would it be more practical to rename them directly to patient, hadm, and icustay, as returned e.g. by as_id_cfg(mimic_demo)?

@dplecko
Copy link
Member

dplecko commented Dec 29, 2023

This behavior is by design (currently). If I understand correctly, the suggestion would be to have:

  • id_var would take the value based on the passed id_type (such as icustay_id/hosp_id for icustay/hosp)
  • index_var value would depend only on id_type (such as an index_var called "icu_time" for icustay or "hosp_time" for hosp)

In this way, whenever load_concepts is invoked, meta_vars values would not depend on the data source from which the data is loaded.

I discussed this with @nbenn, and I think this may be a good suggestion. We should perhaps allow for this in the next version of ricu.

@prockenschaub prockenschaub added the enhancement New feature or request label Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants