Study/file size #58

iqis · 2019-07-18T17:45:53Z

We want psyphr to work on a normal laptop, which nowadays has somewhere between 4-12G's of usable memory, and R normally should not use more than half of the total memory. Currently read_study() reads everything all at once. A really big study can create a problem.

If the problem exists, there are at least two ways to mitigate the problem:

Construct a promise in lieu of reading in the data; the data is read from disk as needed.
Read the study and cache the resulting R object onto disk incrementally.

What is a likely the total size of a study? I'm looking for a figure at about the 80th percentile, and I surely hope it will be small enough.

The text was updated successfully, but these errors were encountered:

wendtke · 2019-07-18T18:10:09Z

@MalloryJfeldman would have more insight on this, but I can tell you that the HRV and EDA output data from our pilot study (N = 67 individuals) adds up to about 2-3 GBs. Child HRV output data (N = 43) adds < 1 GB.

For comparison, the last study I managed had 150 families (nested data for parent-child pairs) at two time points for physio. Not all families had 2 parents involved, but you can see how the data multiply.

And I am only talking about physio output data. Many GBs would be added if a user started bringing in other types (e.g., surveys; observational codes) after the psyphr_study was aggregated.

iqis · 2019-07-18T19:04:27Z

Wow, looks like I need to do some extra thinking.

iqis · 2019-07-18T20:04:27Z

Right now as I'm trying to figure out the best approach, I need to know some common characteristics in downstream analyses. Some detailed use cases will help. For example, what are some frequently used statistical models? Are modeling usually done for each and every subject, or across some kind of summation of a group?

wendtke · 2019-07-20T19:37:00Z

@wendtke to

compare sqlite and filehash
review @iqis branch for stash/pointer approach
develop questions for team meeting 20190722

wendtke · 2019-07-20T20:18:06Z

@MalloryJfeldman would have more insight on this, but I can tell you that the HRV and EDA output data from our pilot study (N = 67 individuals) adds up to about 2-3 GBs. Child HRV output data (N = 43) adds < 1 GB.

For comparison, the last study I managed had 150 families (nested data for parent-child pairs) at two time points for physio. Not all families had 2 parents involved, but you can see how the data multiply.

And I am only talking about physio output data. Many GBs would be added if a user started bringing in other types (e.g., surveys; observational codes) after the psyphr_study was aggregated.

I think I miscalculated; see here for HRV output data for 67 individuals read and wrangled in R..

iqis · 2019-07-20T20:28:25Z

Maybe you were referring to raw ECG signals? That could make more sense.

…

On Sat, Jul 20, 2019, 4:18 PM Kathleen Wendt ***@***.***> wrote: @MalloryJfeldman <https://github.com/MalloryJfeldman> would have more insight on this, but I can tell you that the HRV and EDA output data from our pilot study (N = 67 individuals) adds up to about 2-3 GBs. Child HRV output data (N = 43) adds < 1 GB. For comparison, the last study I managed had 150 families (nested data for parent-child pairs) at two time points for physio. Not all families had 2 parents involved, but you can see how the data multiply. And I am only talking about physio output data. Many GBs would be added if a user started bringing in other types (e.g., surveys; observational codes) after the psyphr_study was aggregated. I think I miscalculated; see here <https://media.discordapp.net/attachments/575140896249479184/601979176806776833/Screen_Shot_2019-07-19_at_21.29.02.png> for HRV output data for 67 individuals read and wrangled in R.. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#58?email_source=notifications&email_token=AKE6JFWTB5UMCUJVUOZ6R7DQANXH5A5CNFSM4IE6D5D2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2NVITI#issuecomment-513496141>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKE6JFTAUL5J2ZVHG7SMKQ3QANXH5ANCNFSM4IE6D5DQ> .

wendtke · 2019-07-20T20:32:36Z

Maybe you were referring to raw ECG signals? That could make more sense.

Maybe. I thought I was checking the properties of only the output files. Oh well.

wendtke · 2019-07-22T03:23:21Z

@geanders Do you have any thoughts on own solution vs. filehash vs. SQLite vs. RSQLite as underlying data management system for large studies within psyphr?

MalloryJfeldman · 2019-07-23T15:56:30Z

Hey- sorry I'm coming to this late. Our studies can generate close up to ~2 GB in output files. Like I said - we never actually ran our experience sampling data through proprietary software so I don't have a good sense for what that might look like (I think this study is not very representative but I would suspect that if we did try and run our experience sampling data through Mindware, we would generate closer to 5-6 GB in output). I think in general, it's fairly typical to generate output files across 2-5 channels for one person for sessions that last between 1 and 4 hours. So thats' 2-5 output files per person containing summaries of physio data from 1-4 hours of recording. I'd say a typical sample is between 50 and 150 subjects; although people are pushing for more these days. For within-subject analysis these numbers can be lower.

iqis · 2019-07-29T01:44:45Z

Looking at "Mallory Pilot 1" here, out of 600+ MBs of raw data comes only 1MB of .xlsx workbooks.

I know we're only dealing with workbooks at the moment, but it makes me wonder: following the above ratio, would 2GBs of output be coming from 1.2TB of input? Wow, that's massive!

iqis assigned iqis, wendtke and MalloryJfeldman Jul 18, 2019

wendtke changed the title ~~How big can the size of studies go?~~ How big can the size of studies get? Jul 18, 2019

wendtke mentioned this issue Jul 18, 2019

Downstream analyses: Common approaches and use cases #60

Open

wendtke changed the title ~~How big can the size of studies get?~~ Study/file size Jul 20, 2019

wendtke added bug Something isn't working question Further information is requested labels Jul 20, 2019

wendtke unassigned iqis Jul 22, 2019

iqis mentioned this issue Jul 24, 2019

On-disk caching #63

Open

iqis removed the bug Something isn't working label Aug 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Study/file size #58

Study/file size #58

iqis commented Jul 18, 2019

wendtke commented Jul 18, 2019 •

edited

Loading

iqis commented Jul 18, 2019

iqis commented Jul 18, 2019

wendtke commented Jul 20, 2019

wendtke commented Jul 20, 2019

iqis commented Jul 20, 2019 via email

wendtke commented Jul 20, 2019

wendtke commented Jul 22, 2019 •

edited

Loading

MalloryJfeldman commented Jul 23, 2019

iqis commented Jul 29, 2019 •

edited

Loading

Study/file size #58

Study/file size #58

Comments

iqis commented Jul 18, 2019

wendtke commented Jul 18, 2019 • edited Loading

iqis commented Jul 18, 2019

iqis commented Jul 18, 2019

wendtke commented Jul 20, 2019

wendtke commented Jul 20, 2019

iqis commented Jul 20, 2019 via email

wendtke commented Jul 20, 2019

wendtke commented Jul 22, 2019 • edited Loading

MalloryJfeldman commented Jul 23, 2019

iqis commented Jul 29, 2019 • edited Loading

wendtke commented Jul 18, 2019 •

edited

Loading

wendtke commented Jul 22, 2019 •

edited

Loading

iqis commented Jul 29, 2019 •

edited

Loading