-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring of statistical inference code #2269
base: dev
Are you sure you want to change the base?
Conversation
Classes responsible for statistical inference have undergone some restructuring in order to reduce unnecessary memory reallocation. - Primary GLM functors are no longer either const or thread-safe. Instead, they pre-allocate space for requisite intermediate data as class members. For multi-threading, instead of using a shared pointer to such classes across threads, duplicate instances of the relevant customised GLM testing class are generated from a pointer to the base class via a virtual function. - Information relevant for each GLM test class that are common across threads are stored in a dedicated shared class and accessed via a std::shared_ptr. - TestVariableHeteroscedastic no longer inherits from TestVariableHomoscedastic; instead, both inherit from TestVariableBase, which provides functions and member variables common to both cases. - Transformation functions to Z-statistics have been altered slightly to better reflect cases where the effective degrees of freedom are continuous or discrete, and lookup function class is not utilised in the former since the concept is not applicable. - In cases where the raw statistics are not of interest and only the z-statistics are necessary, storage of the intermediate pre-transform statistics is moved from the GLM test classes to the permutation testing classes.
In cases where the design matrix varies per element tested (whether due to the presence of non-finite values, or the addition of one or more element-wise design matrix columns), rather than resizing the relevant intermediate matrix data, do an initial allocation of the maximal required size of all such data, and instead utilise Eigen block capabilities. Hopefully this will reduce persistent memory re-allocation.
Consideration in light of #2294: Whether or not the Currently, the current features are calculated for all input data, regardless of the contents of any user-specified mask:
|
Commands for statistical inference now provide a new command-line option -posthoc. This facilitates providing a mask that influences only those elements that will contribute to statistical inference (i.e. can contribute to the null distribution(s), and p-values will be calculated). This post-hoc inference mask differs from a standard processing mask in that elements included in the latter but absent from the former will still have test statistics calculated, and will strill contribute to statistical enhancement, but the resulting enhanced statistics are discarded. This facilitates performing post hoc analysis following an F-test to infer specific effects that may have contributed to the observation of statistical significance within the F-test.
- Contrast matrix is no longer a compulsory positional argument, but is instead specified via command-line option -ttests. - Command-line option -ftests is removed, and replaced with -ftest, which can be specified multiple times (once for each F-test). Each input provides the contrast matrix for a single F-test, as opposed to the old interface where each row specified a single F-test and each column specified whether or not a particular row from the T-terst matrix is or is not included in that F-test. - Option -fonly is no longer required.
Includes pulling updated binary test data for vectorstats. Includes some minor fixes to said interface change.
Error introduced in 6ba0133.
Follow-up to previous comment: Over and above Edit: Constraining test statistic calculation in this fashion would also intrinsically solve #1366. |
Do not rely on individual commands masking FWE-corrected p-values after the relevant function; only perform the actual calculations for those elements within the mask. Also some minor re-arrangement, including moving the FWE function to the PermTest namespace.
Conflicts: core/math/stats/glm.cpp core/math/stats/glm.h
Add NaN-filling of matrix data expected to be unused when compiled in debugging mode.
Code was erroneously using full matrices rather than blocks relevant to the finite values from that element, which would include leftover values from previously tested elements.
This is next in the list of flushing out stale PRs. I put a bit of effort in here hoping that I might be able to put in a bit of a sprint and complete After generating a pretty exhaustive test suite I found one small glitch (b674712), and have now verified that the results generated on this branch are identical to Testing code for comparing PR code to `master`
I still want to have a bit more of a look at memory usage before merging. What triggered me to have a go at this initially is that just running Also I think I can revert #2259. Rather than having multiple GLM class instances being spawned from an instance of the base class, I should just do something more like what other commands already do, which is to have separate
|
|
- Change native storage type of shuffling matrices from default_type to int8_t, only casting to default_type on usage. - Reduce size of multi-threading queue buffer where shuffling matrices are stored.
The penny just dropped as to why it was reported that Prior to the complete re-write that was #1543, there were a number of things that were different about the GLM test. Ignoring all of the new features, the key differences are I think:
What this means is that it's likely that with a reasonably small amount of code modification I can resolve the excessive memory utilisation that was used as a primary justification for development and publication of an alternative statistical software package. |
Data can be promoted to double precision where required for calculations. In most instances (mrclusterstats, fixelcfestats), data were already being imported from disk using single-precision, but then stored in a double-precision matrix for compatibility with subsequent linear algebra.
…ce_new_features Conflicts: cmd/connectomestats.cpp cmd/fixelcfestats.cpp cmd/mrclusterstats.cpp cmd/vectorstats.cpp core/math/stats/fwe.cpp core/math/stats/glm.cpp core/math/stats/typedefs.h testing/binaries/data testing/binaries/tests/fixelcfestats
- Create new group "statistical inference" in documentation. Pages on mitigating brain cropping in DWI, and computing percentage effect relative to controls, have been moved into this group, since they are equally applicable to voxel-based analysis. This new group may also be the landing destination for future new pages on eg. explaining the content of the files produced by statistical inference commands, or explaining the operation of the GLM. - Create new documentation page in this statistical inference group, explaining the operation of the -posthoc option, both pragmatically (ie. how to do a robust post hoc analysis) and conceptually (ie. how it differs from the -mask option). This supersedes a Description paragraph that had been added to some command help pages, with that Description now providing a link to the online documentation. - Regenerate command documentation to reflect this change, as well as prior alterations to the interface for specifying t-tests and F-tests.
While -mask and -posthoc remain as the user command-line interface names, within the code internally these two similar-but-different entities are now referred to as the "processing mask" and the "inference mask".
New statistical inference features
I would like to get this into
|
|
In d0b3afa, for statistical inference commands the size of the shuffling matrix queue is set based on the number of processing threads, to reduce memory usage. This however fails to execute in the case -nthreads 1. Rather than identifying the blocking condition with the Queue class, this change simply ensures that the size of the queue is always at least 2 items.
Posting draft PR as a reminder to myself to redo the requisite tests and merge at some point. Also provides a nice link to have third parties test the code on their data.
Result of hunting for performance issues in
fixelcfestats
trying to do actual experiments of my own. Don't think it's actually yielded anything, but it's nevertheless a good cleanup of #1543. Main changes are to pre-allocate Eigen matrix variables and re-use them across permutations (and also across elements in the case of the variable-GLM-per-element classes), and to have a better classification of what information is shared across threads and what is local.Includes fix in #2260.
Was what spawned #2259, but as can be seen in the
src/stats/permtest.cpp
changes, those changes aren't actually necessary to pull off the changes here, specifically because the multi-threading back-end copy-constructs theStats::PermTest::Processor
/Preprocessor
classes, notMath::Stats::GLMBase
.