ENH: make aggregate allow for arbitrary functions from the .json #39

prockenschaub · 2023-08-07T18:20:12Z

Problem

concept-dict.json allows to specify a standard aggregation for concepts. However, this currently only works for functions that are known to dt_gforce, as aggregation functions specified as strings are directly passed on to dt_gforce (see also #36).

Solution

We could simply check for any function that is known to dt_gforce and pass those on. If we get another function, we try to parse it.

The text was updated successfully, but these errors were encountered:

nbenn · 2023-09-28T19:15:28Z

One thing to keep in mind here is performance. If you have an R function which is applied on every group (of which there can be quite many in these datasets when looking at 1e4-1e5 patients and hourly data), the milliseconds to call functions in R add up considerably (one trick of data.table here is to not call these functions to R, but use the C API iirc). At some point I did play around with this, but from what I remember, arbitrary R functions were barely usable (loading a couple of concepts would suddenly take ~30 mins or more).

mlondschien · 2023-09-29T05:57:23Z

Note that

load_concepts("caregiver", src = "mimic", aggregate = unique)

as suggested by @prockenschaub here works and finishes in very reasonable time.

prockenschaub · 2023-09-29T06:04:28Z

Ultimately, I guess it depends on the size of the table, number of groups, and complexity of the function. However, I don't think the fact that it won't work for some cases is a good argument against allowing arbitrary functions. It is a good point, though, and will need to be considered by the person creating the concept. If it finishes in reasonable time, fine. If it takes too long, another option needs to be found. By forwarding dt_gforce functions to dt_gforce and calling any other function through R, we allow for both without changing anything for the worse.

nbenn · 2023-09-29T07:44:59Z

@mlondschien This might work well for id_tbl concepts, but for typical hourly data we have a factor ~100 of invocations between "static" and "time-varying" concepts and going to higher time resolution, this gets worse.

I'm not against this, I just wanted to point out why I did not do something that is a pretty obvious thing one could do: I was worried about user experience. Someone will put in their own function for a time-varying concept and then things stop working without it being clear why. We could add a warning if a non dt_gforce-optimized function is chosen (and with rlang we could even make it so that such a warning is only showed once per session as to not be too annoying). Then this concern would mostly go away and we'd have the option to use this in cases where it is useful.

@prockenschaub As side remark: above, when saying non dt_gforce-optimized functions were barely usable for larger time-varying concepts, I had simple functions in mind (calling a base R primitive function like sum() might already be problematic). As soon as you have a function with a bit more involved logic implemented in R, things become completely unusable. Running some quick experiments to see what I mean should be straightforward. If my memory proves inaccurate, all the better.

mlondschien · 2023-09-29T07:49:11Z

My current version of caregiver looks like this:

"caregiver": {
        "category": "misc",
        "aggregate": "unique",
        "sources": {
            "mimic": [
                {
                    "table": "chartevents",
                    "val_var": "cgid",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "datetimeevents",
                    "val_var": "cgid",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "inputevents_cv",
                    "val_var": "cgid",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "inputevents_mv",
                    "val_var": "cgid",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "noteevents",
                    "val_var": "cgid",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "outputevents",
                    "val_var": "cgid",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "procedureevents_mv",
                    "val_var": "cgid",
                    "class": "col_itm",
                    "target": "ts_tbl"
                }
            ],
            "miiv": [
                {
                    "table": "chartevents",
                    "val_var": "caregiver_id",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "datetimeevents",
                    "val_var": "caregiver_id",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "ingredientevents",
                    "val_var": "caregiver_id",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "inputevents",
                    "val_var": "caregiver_id",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "outputevents",
                    "val_var": "caregiver_id",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "procedureevents",
                    "val_var": "caregiver_id",
                    "class": "col_itm",
                    "target": "ts_tbl"
                }
            ]
        }
    },

I load this as follows

caregiver = ricu::load_concepts("caregiver", src, aggregate = unique)

It's not super fast, but it works.

prockenschaub added the enhancement New feature or request label Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: make aggregate allow for arbitrary functions from the .json #39

ENH: make aggregate allow for arbitrary functions from the .json #39

prockenschaub commented Aug 7, 2023

nbenn commented Sep 28, 2023

mlondschien commented Sep 29, 2023

prockenschaub commented Sep 29, 2023

nbenn commented Sep 29, 2023

mlondschien commented Sep 29, 2023

ENH: make aggregate allow for arbitrary functions from the .json #39

ENH: make aggregate allow for arbitrary functions from the .json #39

Comments

prockenschaub commented Aug 7, 2023

Problem

Solution

nbenn commented Sep 28, 2023

mlondschien commented Sep 29, 2023

prockenschaub commented Sep 29, 2023

nbenn commented Sep 29, 2023

mlondschien commented Sep 29, 2023