Best practice for col names when combining datasets #13

prockenschaub · 2022-06-06T10:17:30Z

Problem

When combining databases (say MIMIC III and eICU), the names of the ID variables and the time variable depend on the order in which sources are passed to load_concepts. See the following reprex inspired by the quick start guide:

library(ricu)

src <- c("mimic_demo", "eicu_demo")

load_concepts("alb", src, verbose = FALSE)
#> # A `ts_tbl`: 6,657 ✖ 4
#> # Id vars:    `source`, `icustay_id`
#> # Units:      `alb` [g/dL]
#> # Index var:  `charttime` (1 hours)
#>       source     icustay_id charttime   alb
#>       <chr>           <int> <drtn>    <dbl>
#>     1 eicu_demo      141765  -2 hours   3.7
#>     2 eicu_demo      144815  -3 hours   4.2
#>     3 eicu_demo      144815   8 hours   3.6
#>     4 eicu_demo      145427  -6 hours   3.7
#>     5 eicu_demo      147307  -6 hours   3.5
#>     …
#> 6,653 mimic_demo     298685 130 hours   1.9
#> 6,654 mimic_demo     298685 154 hours   2
#> 6,655 mimic_demo     298685 203 hours   2
#> 6,656 mimic_demo     298685 272 hours   2.2
#> 6,657 mimic_demo     298685 299 hours   2.5

load_concepts("alb", rev(src), verbose = FALSE)
#># A `ts_tbl`: 6,657 ✖ 4
#># Id vars:    `source`, `patientunitstayid`
#># Units:      `alb` [g/dL]
#># Index var:  `labresultoffset` (1 hours)
#>      source     patientunitstayid labresultoffset   alb
#>      <chr>                  <int> <drtn>          <dbl>
#>    1 eicu_demo             141765  -2 hours         3.7
#>    2 eicu_demo             144815  -3 hours         4.2
#>    3 eicu_demo             144815   8 hours         3.6
#>    4 eicu_demo             145427  -6 hours         3.7
#>    5 eicu_demo             147307  -6 hours         3.5
#>    …
#>6,653 mimic_demo            298685 130 hours         1.9
#>6,654 mimic_demo            298685 154 hours         2
#>6,655 mimic_demo            298685 203 hours         2
#>6,656 mimic_demo            298685 272 hours         2.2
#>6,657 mimic_demo            298685 299 hours         2.5
#># … with 6,647 more rows

As you can see, although the information is exactly the same, the names depend on the order of src. This prevents me for example from simply appending two concepts from different databases:

bind_rows(
  load_concepts("alb", "mimic_demo", verbose = FALSE),
  load_concepts("alb", "eicu_demo", verbose = FALSE)
)

#> # A `ts_tbl`: 6,657 ✖ 5
#> # Id var:     `icustay_id`
#> # Index var:  `charttime` (1 hours)
#>       icustay_id charttime   alb patientunitstayid labresultoffset
#>            <int> <drtn>    <dbl>             <int> <drtn>
#>     1         NA  NA hours   3.4           3352333   2 hours
#>     2         NA  NA hours   3.3           3352333  11 hours
#>     3         NA  NA hours   3.1           3352333  36 hours
#>     4         NA  NA hours   3.4           3353113 -36 hours
#>     5         NA  NA hours   3.6           3353113  10 hours
#>     …
#> 6,653     201006   0 hours   2.4                NA  NA hours
#> 6,654     203766 -18 hours   2                  NA  NA hours
#> 6,655     203766   4 hours   1.7                NA  NA hours
#> 6,656     204132   7 hours   3.6                NA  NA hours
#> 6,657     204201   9 hours   2.3                NA  NA hours
#> # … with 6,647 more rows

Question

Am I missing something obvious here and am I supposed to do something differently? I did find the helper function id_vars and index_var that can help me recover what the names are but this seems cumbersome and does not allow me to only merge on a specific ID level (e.g. admissions) without remembering what this colum was called in the first database I passed to load_concept.

What was the reasoning underlying this design choice and would it be more practical to rename them directly to patient, hadm, and icustay, as returned e.g. by as_id_cfg(mimic_demo)?

The text was updated successfully, but these errors were encountered:

dplecko · 2023-12-29T17:47:01Z

This behavior is by design (currently). If I understand correctly, the suggestion would be to have:

id_var would take the value based on the passed id_type (such as icustay_id/hosp_id for icustay/hosp)
index_var value would depend only on id_type (such as an index_var called "icu_time" for icustay or "hosp_time" for hosp)

In this way, whenever load_concepts is invoked, meta_vars values would not depend on the data source from which the data is loaded.

I discussed this with @nbenn, and I think this may be a good suggestion. We should perhaps allow for this in the next version of ricu.

prockenschaub added the enhancement New feature or request label Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice for col names when combining datasets #13

Best practice for col names when combining datasets #13

prockenschaub commented Jun 6, 2022

dplecko commented Dec 29, 2023

Best practice for col names when combining datasets #13

Best practice for col names when combining datasets #13

Comments

prockenschaub commented Jun 6, 2022

Problem

Question

dplecko commented Dec 29, 2023