Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Study/file size #58

Open
iqis opened this issue Jul 18, 2019 · 10 comments
Open

Study/file size #58

iqis opened this issue Jul 18, 2019 · 10 comments
Assignees
Labels
question Further information is requested

Comments

@iqis
Copy link
Contributor

iqis commented Jul 18, 2019

We want psyphr to work on a normal laptop, which nowadays has somewhere between 4-12G's of usable memory, and R normally should not use more than half of the total memory. Currently read_study() reads everything all at once. A really big study can create a problem.

If the problem exists, there are at least two ways to mitigate the problem:

  • Construct a promise in lieu of reading in the data; the data is read from disk as needed.
  • Read the study and cache the resulting R object onto disk incrementally.

What is a likely the total size of a study? I'm looking for a figure at about the 80th percentile, and I surely hope it will be small enough.

@wendtke wendtke changed the title How big can the size of studies go? How big can the size of studies get? Jul 18, 2019
@wendtke
Copy link
Owner

wendtke commented Jul 18, 2019

@MalloryJfeldman would have more insight on this, but I can tell you that the HRV and EDA output data from our pilot study (N = 67 individuals) adds up to about 2-3 GBs. Child HRV output data (N = 43) adds < 1 GB.

For comparison, the last study I managed had 150 families (nested data for parent-child pairs) at two time points for physio. Not all families had 2 parents involved, but you can see how the data multiply.

And I am only talking about physio output data. Many GBs would be added if a user started bringing in other types (e.g., surveys; observational codes) after the psyphr_study was aggregated.

@iqis
Copy link
Contributor Author

iqis commented Jul 18, 2019

Wow, looks like I need to do some extra thinking.

@iqis
Copy link
Contributor Author

iqis commented Jul 18, 2019

Right now as I'm trying to figure out the best approach, I need to know some common characteristics in downstream analyses. Some detailed use cases will help. For example, what are some frequently used statistical models? Are modeling usually done for each and every subject, or across some kind of summation of a group?

@wendtke
Copy link
Owner

wendtke commented Jul 20, 2019

@wendtke to

  • compare sqlite and filehash
  • review @iqis branch for stash/pointer approach
  • develop questions for team meeting 20190722

@wendtke wendtke changed the title How big can the size of studies get? Study/file size Jul 20, 2019
@wendtke
Copy link
Owner

wendtke commented Jul 20, 2019

@MalloryJfeldman would have more insight on this, but I can tell you that the HRV and EDA output data from our pilot study (N = 67 individuals) adds up to about 2-3 GBs. Child HRV output data (N = 43) adds < 1 GB.

For comparison, the last study I managed had 150 families (nested data for parent-child pairs) at two time points for physio. Not all families had 2 parents involved, but you can see how the data multiply.

And I am only talking about physio output data. Many GBs would be added if a user started bringing in other types (e.g., surveys; observational codes) after the psyphr_study was aggregated.

I think I miscalculated; see here for HRV output data for 67 individuals read and wrangled in R..

@wendtke wendtke added bug Something isn't working question Further information is requested labels Jul 20, 2019
@iqis
Copy link
Contributor Author

iqis commented Jul 20, 2019 via email

@wendtke
Copy link
Owner

wendtke commented Jul 20, 2019

Maybe you were referring to raw ECG signals? That could make more sense.

Maybe. I thought I was checking the properties of only the output files. Oh well.

@wendtke
Copy link
Owner

wendtke commented Jul 22, 2019

@geanders Do you have any thoughts on own solution vs. filehash vs. SQLite vs. RSQLite as underlying data management system for large studies within psyphr?

@MalloryJfeldman
Copy link

Hey- sorry I'm coming to this late. Our studies can generate close up to ~2 GB in output files. Like I said - we never actually ran our experience sampling data through proprietary software so I don't have a good sense for what that might look like (I think this study is not very representative but I would suspect that if we did try and run our experience sampling data through Mindware, we would generate closer to 5-6 GB in output). I think in general, it's fairly typical to generate output files across 2-5 channels for one person for sessions that last between 1 and 4 hours. So thats' 2-5 output files per person containing summaries of physio data from 1-4 hours of recording. I'd say a typical sample is between 50 and 150 subjects; although people are pushing for more these days. For within-subject analysis these numbers can be lower.

@iqis iqis mentioned this issue Jul 24, 2019
@iqis
Copy link
Contributor Author

iqis commented Jul 29, 2019

Looking at "Mallory Pilot 1" here, out of 600+ MBs of raw data comes only 1MB of .xlsx workbooks.

I know we're only dealing with workbooks at the moment, but it makes me wonder: following the above ratio, would 2GBs of output be coming from 1.2TB of input? Wow, that's massive!

@iqis iqis removed the bug Something isn't working label Aug 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants