-
Notifications
You must be signed in to change notification settings - Fork 11
/
developer_guide.Rmd
185 lines (124 loc) · 7.26 KB
/
developer_guide.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
title: "Developer Guide"
---
<br/>
## Developer guide
### Object classes
`bupaR` knows 2 main object classes: `eventlog` and `activitylog.` Both are special types of a `data.frame` object. Furthermore, there is the overarching object class `log`. The object class `log` is used by functions where a distinction between the two classes is not relevant. It is only used as a higher-level classification of the `eventlog` and `activitylog` objects - it cannot stand on its own. That is, objects which have just the class `log` cannot exist, they must have one of the subclasses as well.
The defining characteristics of a `log` are stored in regular variables, of which the names can be obtained with the `mapping()` function.
```{r}
mapping(patients)
mapping(patients_act)
```
Note that both `eventlog` and `activitylog` have some mapping-elements in common:
- case identifier
- activity identifier
- resource identifier
While other mapping-elements are slightly different:
- the activity instance identifier only exist for `eventlog`. For `activitylog`, each row is an activity instance by definition
- the lifecycle identifier for `eventlog` consist of a single column. For `activitylog`, it consist of multiple columns. (At least start- and complete status are required, although they can contain NA's). These are stored under the `timestamps` mapping element.
Note that there are 2 classes for the mapping, one for `eventlog` and one for `activitylog`. (Note also that the `eventlog_mapping()` has a dedicated `print()`, while `activitylog` has not (yet), and prints just a regular list.)
Individual mapping-variables can be obtained with the dedicated id functions. They work on both the logs itself, and on the mappings.
```{r}
activity_id(patients)
activity_id(patients_act)
mapping_event <- mapping(patients)
mapping_act <- mapping(patients_act)
activity_id(mapping_event)
activity_id(mapping_act)
```
During data manipulation, it can sometimes happen (or sometimes is necessary) that the log is at some point transferred to a regular `data.frame` for some operations. If the ultimate output of the function should be once again a `log` object (and not a visual or summary table), the mapping can be used to recuperate the original mapping. This can be done using `re_map()`.
```{r}
patients_df <- as.data.frame(patients)
class(patients_df)
patients_log <- re_map(patients_df, mapping_event)
class(patients_log)
```
`re_map()` recognizes the class of the mapping, and thus works for both `activitylog` and `eventlog` mappings. It will always return to the original type. (I.e. if the mapping originates from an `activitylog` object, it will result once again in an `activitylog` object.) It can never be used to convert `activitylog` to `eventlog`, or vice versa.
While `re_map()` is exported by `bupaR`, it is primarily for internal use. Only for more advanced use of `bupaR`, it can be useful for the end-user.
Note that functions that are not exported can always be used using the `:::` instead of the `::` operator. For instance, we can use the non-exported `activity_id_()` function outside of `bupaR` as follows:
```{r}
bupaR:::activity_id_(patients)
```
While you should typically not need these function outside of `bupaR`, except for perhaps developing or testing some code interactively, we will use the `:::` notation in this manual whenever we refer to internal functions.
`activity_id_()` is a variant of `activity_id()`. Only instead of returning a `chr` object, it returns a symbol. This symbol is useful when you want to use the mapping variable while programming.
For example, suppose you want to filter the `patients` log, only for patient == 1. But you don't know that the `case_id` is "patient", so you use the function to get the case_id.
The following will not work.
```{r}
patients %>%
filter(case_id(patients) == 1)
```
just as the following will not work.
```{r}
patients %>%
filter("patient" == 1)
```
In order to successfully do this, we could use the symbol:
```{r}
patients %>%
filter(!!bupaR:::case_id_(patients) == 1)
```
More on symbols and !!: https://adv-r.hadley.nz/quasiquotation.html
Alternatively, the following notation works as well.
```{r}
patients %>%
filter(.data[[case_id(patients)]] == 1)
```
The .data here is a special command, _a pronoun_, that can be used in `dplyr` functions. More information here: https://adv-r.hadley.nz/quasiquotation.html
In `bupaR`, the preference goes to the latter notation. It has the advantage to be used in scripts both inside `bupaR` as well as outside (whereas the !! notation only works with the `bupaR:::` prefix). It is also slightly easier to understand than the workings of !!.
That said, the use of `case_id_()` and `symbol(case_id())` is still widespread in `bupaR`, but the goal is to phase out this usage.
### dplyr verbs
The following dplyr verbs have received methods for activity logs and event logs.
- `filter()`
- `group_by()`
- `arrange()`
- `mutate()`
- `select()`
They will all return a proper log, i.e. there is no risk of losing the defined mapping.
Special attention has to be given to the following:
#### Select
Conventionally, `select()` will not ensure that the log maintains the variables it needs to be considered a log. The select methods for logs therefore will keep the listed variables __and__ the variables that define the event log.
The following code returns an `eventlog` object with the attribute _oligurie_, as well as the 6 variables needed to define the event log (plus the .order variable, see further).
```{r}
sepsis %>%
select(oligurie)
```
This behavior can be turned by setting `force_df = TRUE`. In that case, the select will work just like a traditional `select()`, and the result will be a `data.frame`, no longer `eventlog`.
```{r}
sepsis %>%
select(oligurie, force_df = TRUE)
```
Because of this, you can select just the _event log mapping_ using `select()`.
```{r}
sepsis %>%
select()
```
If you want to select only specific `eventlog` classifiers, you can use `selects_ids()`. Because you would typically not select all id's (otherwise you can use `select()`), this will by default turn your object to a `data.frame` object.
```{r}
sepsis %>%
bupaR::select_ids(activity_id, case_id)
```
Note how the different classifiers are defined: using the `_id()` functions, but without the brackets. And not using characters.
#### Group by
While `group_by()` is defined for logs, it should be noted that it requires special methods for each function before that function is "compatible" with grouped logs. Some utility functions for this do however exist (see further).
There are some short cuts for typical groupings when programming in `bupaR`:
- `group_by_case()`
- `group_by_activity()`
- `group_by_activity_instance()`
- `group_by_resource()`
- `group_by_resource_activity()`
```{r}
patients %>%
group_by_case()
```
is equivalent to
```{r}
patients %>%
group_by(.data[[case_id(patients)]])
```
While, except for the more common resource-activity, not all relevant combinations of groupings are provided as a shortcut, the internal `group_by_ids()` allows the use of any combination of `_id()` functions. For example:
```{r}
patients %>%
bupaR:::group_by_ids(activity_id, case_id)
```
Note that the notation is analogous to `select_ids()`: specify the id functions, without quotation marks or brackets.