Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use R-expression for formula arguments #75

Open
assignUser opened this issue Oct 21, 2020 · 11 comments
Open

Use R-expression for formula arguments #75

assignUser opened this issue Oct 21, 2020 · 11 comments
Labels
breaking change feature feature request or enhancement

Comments

@assignUser
Copy link
Collaborator

I guess a more R-way (which is a matter of taste) would be to specify formula = -2+age*0.1, i.e., as an R expression, see ?deparse
and ?substitute. The same with varname and dist. See base R functions transform() and subset() for inspiration.

As brought up by gagolews in openjournals/joss-reviews/issues/2763 we should think about this long term. It is a breaking change so not a small issue.

@gagolews
Copy link

Good luck! 👍

@assignUser
Copy link
Collaborator Author

@kgoldfeld would changing the order of the definition table columns cause problems in you workflow or do you think for other reasons it is a bad idea? I would like to change from:
varname formula variance dist link to
varname dist formula param1 param2
this does not have technical reasons, I just think this order is more intuitive with the renamed variance/link.

@kgoldfeld
Copy link
Owner

I think that would would work though I was also thinking of possibly this:

varname dist param1 param2 link

where param1 replaces formula and param2 replaces variance. link is not really a parameter of a distribution, but the relationship that defines the right and left hand sides of the equation. The obvious flaw is with the the new trtAssign distribution, link doesn't make so much sense. But, I think it is more confusing to have logit be a value for a parameter. I coud live with the link being identity (and not really needed to be specified) or nonbalanced for the *trtAssign dist`.

We could also consider the R specification for the distribution and link (e.g. in glm) to be combined, so that we eliminate the link field. So, dist = binary(logit) or dist = binary() - in the case where we don't change the name of binary to binomial. Or dist = trtAssign() or dist = trtAssign(balanced) vs. dist = trtAssign(nonbalanced). If no link is provided, glm also allows you to specify without parentheses, so dist = trtAssign.

@assignUser
Copy link
Collaborator Author

Hm. Why not go all out then and do param1 param2 param3 to make it distribution agnostic?
The problem with what you write in your second paragraph is backwards compatibility, which would be hard to uphold I think. At least with the way I have it planned currently... I'll think about it

@kgoldfeld
Copy link
Owner

I think I prefer sticking with link instead of param3, for the reasons stated earlier. I guess the confusing thing is that one can think of a parameter as a statistical parameter (ad I do) or as an argument to a function (as I think you might are). Maybe we can come up with a more generic term, but less generic than arg1, arg2, and arg3. Maybe thing1, thing2, and thing3 :).

@assignUser
Copy link
Collaborator Author

I see where you are coming from... and again it shows that this is true:

There are only two hard things in Computer Science: cache invalidation and naming things.
-- Phil Karlton

arg brg crg xD

@assignUser
Copy link
Collaborator Author

As a reference here my current poc https://gist.github.com/assignUser/af104a19c2b6819043bd7e290117e5cd
Generation is not finished yet but defining stuff works for now. At the moment I feel that it is a bit clunky with the generics and having to construct an object that is used just once... Maybe I can think of something smoother.

We want to keep the definition tables though right? Otherwise having different params for each dist wouldn't really be a problem, e.g. with a list instead of a dt.

@kgoldfeld
Copy link
Owner

The table was literally the underlying motivation for the whole package, so that might be too much for me to give up.

@assignUser
Copy link
Collaborator Author

I though so and didn't want to get rid of the idea of documenting the data generation process, just using another form factor. But I am not sure what that would look like...

@kgoldfeld
Copy link
Owner

How about adding another field - distArg or dist_arg or arg or arg1, so we would have

varname dist param1 param2 link arg1

So, balanced would be a value for arg1 in the trtAssign dist. And this would add a little flexibility in allowing for alternative specifications for some distributions - say the gamma distribution, which now uses a mean/dispersion specification, but the traditional way is using rate/scale. We could use arg1 to allow users to indicate what version they are using (in the case of the gamma dist, though this is also possible for the beta distribution as well). Also, in the categorical distribution, the alternative categories now specified in variance could be moved to arg1. I understand that all of this has backwards mapping compatibility issues. But just a thought.

@assignUser
Copy link
Collaborator Author

I published a first sketch of how this could work with using S3 methods here: https://github.com/kgoldfeld/simstudy/blob/issue-75-no-more-strings/R/no_strings.R
I currently think that it is pretty cumbersome internally and it might restrict us in our options. I have some other Ideas I am going to test soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change feature feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants