Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pass-by-value vs. pass-by-reference #50

Open
assignUser opened this issue Sep 24, 2020 · 14 comments
Open

pass-by-value vs. pass-by-reference #50

assignUser opened this issue Sep 24, 2020 · 14 comments

Comments

@assignUser
Copy link
Collaborator

I was just looking through the code and I noticed something changed that probably shouldn't have. Whenever we pass a data.table as an argument to a function, it is imperative to copy the data table to another data.table. So you might see something like:

foo <- function(dtName, ...) {
   dT <- copy(dtName)
   .
   .
}

This addresses a weird R problem (and maybe there is a better work around, but this was the only way I could figure out) where the original data.table gets changed by the function call even if that was not the intention. For example:

library(data.table)

dd <- data.table(x = c(1, 2, 3), y = c(3, 4, 5))
dd
#>    x y
#> 1: 1 3
#> 2: 2 4
#> 3: 3 5

# doesn't do what you think it does

chk <- function(d) {
  d[, z := c(9, 10, 11)]
  d[]
}

chk(dd)
#>    x y  z
#> 1: 1 3  9
#> 2: 2 4 10
#> 3: 3 5 11
dd
#>    x y  z
#> 1: 1 3  9
#> 2: 2 4 10
#> 3: 3 5 11

# this does

dd <- data.table(x = c(1, 2, 3), y = c(3, 4, 5))
dd
#>    x y
#> 1: 1 3
#> 2: 2 4
#> 3: 3 5

chk2 <- function(d) {
  d2 <- copy(d)
  d2[, z := c(9, 10, 11)]
  d2[]
}

chk2(dd)
#>    x y  z
#> 1: 1 3  9
#> 2: 2 4 10
#> 3: 3 5 11
dd
#>    x y
#> 1: 1 3
#> 2: 2 4
#> 3: 3 5

Originally posted by @kgoldfeld in #49 (comment)

@assignUser
Copy link
Collaborator Author

@kgoldfeld That is intended behavior by data.table as I understand. The terminology is pass-by-value vs. pass-by-reference. R almost exclusively works via pass-by-value, which means that arguments to functions are copied when they are passed on. This avoids side-effects like you describe here but also lowers performance due to the copying and increased memory usage. See this Stackoverflow answer for more general information.

As data.table is specifically made for huge data sets they implement changes via reference (e.g. reordering columns etc.) because they don't have to copy stuff around.

I did not remove calls to dt <- copy(dt) as far as I know. Do you want to do this more consistently throughout simstudy?

@assignUser assignUser added this to the 0.2.0 milestone Sep 24, 2020
@kgoldfeld
Copy link
Owner

I did see in genOrdCat that you removed dt <- copy(dtName) and you continue to work with dtName throughout, not dt (since it does not exist anymore). Was that your intention?

@assignUser
Copy link
Collaborator Author

I can blame you for that 😝 because I just reused the body of genCorOrdCat where you don't seem to do the copy thing:

genCorOrdCat <- function(dtName, idname = "id", adjVar = NULL, baseprobs,

But what ever the case let us double check all functions and make it the default to copy the incoming data.table. We could add an option (e.g. environment variable) to deactivate that behavior for people who work with very large simulations. That way we could preserve the speed gains but make it a conscious choice.

@kgoldfeld
Copy link
Owner

You are so right - I just looked at the release version. Not sure how I made that mistake, but I did. And then I accused you of it - doubly bad!

I'm not sure it is necessary to deactivate that behavior - since I generally test all the functions generating a million observations to see if things get too bogged down. Just seems like that would add complexity for the user. Now, if people started complaining, then I guess we could do it.

@assignUser
Copy link
Collaborator Author

I had a quick look and at the moment many but not all functions copy their input data.table. I would argue for creating one consistent behavior across all simstudy functions, either with or without copying. Consistency makes for a good user experience 😸

@kgoldfeld
Copy link
Owner

I totally agree - we should copy in all cases where it is possible the user will be creating a new data.table as a result of the call to the function are their intention is to change the original data set.

@assignUser
Copy link
Collaborator Author

ok I will look into it

@assignUser assignUser removed this from the 0.2.0 milestone Oct 5, 2020
@assignUser
Copy link
Collaborator Author

I will think about this some more. For now lets post pone it after 0.2.0.

assignUser added a commit that referenced this issue Oct 5, 2020
@kgoldfeld
Copy link
Owner

I agree - probably not causing too many problems. But, I am sure we will hear about it if it is.

@assignUser
Copy link
Collaborator Author

I restored the old behaviour of genOrdCat in the trtAssign pull request :) after that is merged I will start the rhub checks.

@assignUser
Copy link
Collaborator Author

I still think we should default to one behavior for all function but maybe it would make sense to add a parameter option to turn the copying off/on. I think we should adress this after #79 so we can make sure that performance is not addressed negatively.

@kgoldfeld
Copy link
Owner

I would want it to default to copying to a new data.table, but I am not totally convinced that anyone would ever want to turn it off (and allow changes to the data.table being passed as an argument, which just seems like weird behavior).

@assignUser
Copy link
Collaborator Author

I'm with you on the default as this is (mostly) the simstudy default anyway.

Well the in-place changes are the default behavior for data.table for performance reasons. So maybe for very large datasets this could be helpful? It would at least allow an option for heavy dt users to keep the default behavior of dt they are used to?

@kgoldfeld
Copy link
Owner

I think an option would be fine for those folks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants