Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correlated data from multiple different distributions #103

Open
tombisho opened this issue Jul 30, 2021 · 4 comments
Open

Correlated data from multiple different distributions #103

tombisho opened this issue Jul 30, 2021 · 4 comments

Comments

@tombisho
Copy link

Thank you for this excellent package.

I have a dataset which consists of 5 continuous variables and 5 categorical variables. I can generate a correlation matrix for this data set*, along with means/SDs of the continuous and counts for the categorical variables.

At the moment it looks like I can build 2 different simstudy datasets, one using the correlations between the continuous variables, their means and SDs, and another using the same technique for the categorical variables. However, I don't see how I can make use of the correlations between the continuous and categorical variables to generate a complete dataset that

It may be that I am not using simstudy correctly in whcih case I would appreciate any advice on how I can do what I have described above.

*forgive my stats naivety if this is not a valid thing to do

@kgoldfeld
Copy link
Owner

Thanks for your note - it would be helpful if you shared the code that you are currently using.

@tombisho
Copy link
Author

tombisho commented Aug 2, 2021

Yes you are right, sorry for not following the guidance! Here is something that might help illustrate:

library("simstudy")

cont_data = mtcars[,-which(names(mtcars) %in% c("cyl","vs","am","gear","carb"))]
cols = colnames(cont_data)
corrs = cor(x=cont_data)
means = colMeans(x=cont_data)
sds = apply(cont_data,2,sd)

dd <- genCorData(n = 40, mu = means, sigma = sds, corMatrix = corrs, cnames = cols)

I can use simstudy to build a dataset that captures the properties and relationships between the continuous variables. But now I am stuck as to how I would apply this to the categorical and binary columns. It feels like I need to specify everything in one go to capture the relationships between all the variables, but I don't how I can do this with mixed distribution types.

Any thoughts would be greatly appreciated

@kgoldfeld
Copy link
Owner

simstudy can accommodate generating correlated data from different distributions using the function genCorFlex (see here). However, the distributions are currently limited to "binary", "poisson", "gamma", "normal", and "uniform" distributions. There is currently also functionality to generate correlated ordinal (categorical) data using genOrdCat, but this has not been integrated with other types of distributions.

@tombisho
Copy link
Author

tombisho commented Aug 2, 2021

OK that is great, thank you - I missed genCorFlex in the vignettes. Are there plans to add ordinal data to genCorFlex? I guess in the meantime one could convert the ordinal variables to binaries?

@assignUser assignUser added this to the Simstudy 1.0(?) milestone Oct 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants