Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: make aggregate allow for arbitrary functions from the .json #39

Open
prockenschaub opened this issue Aug 7, 2023 · 5 comments
Open
Labels
enhancement New feature or request

Comments

@prockenschaub
Copy link
Collaborator

Problem

concept-dict.json allows to specify a standard aggregation for concepts. However, this currently only works for functions that are known to dt_gforce, as aggregation functions specified as strings are directly passed on to dt_gforce (see also #36).

Solution

We could simply check for any function that is known to dt_gforce and pass those on. If we get another function, we try to parse it.

@nbenn
Copy link
Member

nbenn commented Sep 28, 2023

One thing to keep in mind here is performance. If you have an R function which is applied on every group (of which there can be quite many in these datasets when looking at 1e4-1e5 patients and hourly data), the milliseconds to call functions in R add up considerably (one trick of data.table here is to not call these functions to R, but use the C API iirc). At some point I did play around with this, but from what I remember, arbitrary R functions were barely usable (loading a couple of concepts would suddenly take ~30 mins or more).

@mlondschien
Copy link
Contributor

Note that

load_concepts("caregiver", src = "mimic", aggregate = unique)

as suggested by @prockenschaub here works and finishes in very reasonable time.

@prockenschaub
Copy link
Collaborator Author

Ultimately, I guess it depends on the size of the table, number of groups, and complexity of the function. However, I don't think the fact that it won't work for some cases is a good argument against allowing arbitrary functions. It is a good point, though, and will need to be considered by the person creating the concept. If it finishes in reasonable time, fine. If it takes too long, another option needs to be found. By forwarding dt_gforce functions to dt_gforce and calling any other function through R, we allow for both without changing anything for the worse.

@nbenn
Copy link
Member

nbenn commented Sep 29, 2023

@mlondschien This might work well for id_tbl concepts, but for typical hourly data we have a factor ~100 of invocations between "static" and "time-varying" concepts and going to higher time resolution, this gets worse.

I'm not against this, I just wanted to point out why I did not do something that is a pretty obvious thing one could do: I was worried about user experience. Someone will put in their own function for a time-varying concept and then things stop working without it being clear why. We could add a warning if a non dt_gforce-optimized function is chosen (and with rlang we could even make it so that such a warning is only showed once per session as to not be too annoying). Then this concern would mostly go away and we'd have the option to use this in cases where it is useful.

@prockenschaub As side remark: above, when saying non dt_gforce-optimized functions were barely usable for larger time-varying concepts, I had simple functions in mind (calling a base R primitive function like sum() might already be problematic). As soon as you have a function with a bit more involved logic implemented in R, things become completely unusable. Running some quick experiments to see what I mean should be straightforward. If my memory proves inaccurate, all the better.

@mlondschien
Copy link
Contributor

My current version of caregiver looks like this:

"caregiver": {
        "category": "misc",
        "aggregate": "unique",
        "sources": {
            "mimic": [
                {
                    "table": "chartevents",
                    "val_var": "cgid",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "datetimeevents",
                    "val_var": "cgid",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "inputevents_cv",
                    "val_var": "cgid",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "inputevents_mv",
                    "val_var": "cgid",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "noteevents",
                    "val_var": "cgid",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "outputevents",
                    "val_var": "cgid",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "procedureevents_mv",
                    "val_var": "cgid",
                    "class": "col_itm",
                    "target": "ts_tbl"
                }
            ],
            "miiv": [
                {
                    "table": "chartevents",
                    "val_var": "caregiver_id",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "datetimeevents",
                    "val_var": "caregiver_id",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "ingredientevents",
                    "val_var": "caregiver_id",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "inputevents",
                    "val_var": "caregiver_id",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "outputevents",
                    "val_var": "caregiver_id",
                    "class": "col_itm",
                    "target": "ts_tbl"
                },
                {
                    "table": "procedureevents",
                    "val_var": "caregiver_id",
                    "class": "col_itm",
                    "target": "ts_tbl"
                }
            ]
        }
    },

I load this as follows

caregiver = ricu::load_concepts("caregiver", src, aggregate = unique)

It's not super fast, but it works.

@prockenschaub prockenschaub added the enhancement New feature or request label Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants