Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring of statistical inference code #2269

Draft
wants to merge 23 commits into
base: dev
Choose a base branch
from
Draft

Conversation

Lestropie
Copy link
Member

Posting draft PR as a reminder to myself to redo the requisite tests and merge at some point. Also provides a nice link to have third parties test the code on their data.

Result of hunting for performance issues in fixelcfestats trying to do actual experiments of my own. Don't think it's actually yielded anything, but it's nevertheless a good cleanup of #1543. Main changes are to pre-allocate Eigen matrix variables and re-use them across permutations (and also across elements in the case of the variable-GLM-per-element classes), and to have a better classification of what information is shared across threads and what is local.

Includes fix in #2260.

Was what spawned #2259, but as can be seen in the src/stats/permtest.cpp changes, those changes aren't actually necessary to pull off the changes here, specifically because the multi-threading back-end copy-constructs the Stats::PermTest::Processor / Preprocessor classes, not Math::Stats::GLMBase.

Classes responsible for statistical inference have undergone some restructuring in order to reduce unnecessary memory reallocation.
- Primary GLM functors are no longer either const or thread-safe. Instead, they pre-allocate space for requisite intermediate data as class members. For multi-threading, instead of using a shared pointer to such classes across threads, duplicate instances of the relevant customised GLM testing class are generated from a pointer to the base class via a virtual function.
- Information relevant for each GLM test class that are common across threads are stored in a dedicated shared class and accessed via a std::shared_ptr.
- TestVariableHeteroscedastic no longer inherits from TestVariableHomoscedastic; instead, both inherit from TestVariableBase, which provides functions and member variables common to both cases.
- Transformation functions to Z-statistics have been altered slightly to better reflect cases where the effective degrees of freedom are continuous or discrete, and lookup function class is not utilised in the former since the concept is not applicable.
- In cases where the raw statistics are not of interest and only the z-statistics are necessary, storage of the intermediate pre-transform statistics is moved from the GLM test classes to the permutation testing classes.
In cases where the design matrix varies per element tested (whether due to the presence of non-finite values, or the addition of one or more element-wise design matrix columns), rather than resizing the relevant intermediate matrix data, do an initial allocation of the maximal required size of all such data, and instead utilise Eigen block capabilities. Hopefully this will reduce persistent memory re-allocation.
@Lestropie
Copy link
Member Author

Consideration in light of #2294: Whether or not the all_stats() function should operate on all input data, or only those data included within a user-specified processing mask.

Currently, the current features are calculated for all input data, regardless of the contents of any user-specified mask:

  • Element-wise design matrix condition number (if relevant);
  • Beta coefficients;
  • Absolute effect size and standardized effect size (for t-tests);
  • Standard deviation(s).

Commands for statistical inference now provide a new command-line option -posthoc. This facilitates providing a mask that influences only those elements that will contribute to statistical inference (i.e. can contribute to the null distribution(s), and p-values will be calculated). This post-hoc inference mask differs from a standard processing mask in that elements included in the latter but absent from the former will still have test statistics calculated, and will strill contribute to statistical enhancement, but the resulting enhanced statistics are discarded. This facilitates performing post hoc analysis following an F-test to infer specific effects that may have contributed to the observation of statistical significance within the F-test.
- Contrast matrix is no longer a compulsory positional argument, but is instead specified via command-line option -ttests.
- Command-line option -ftests is removed, and replaced with -ftest, which can be specified multiple times (once for each F-test). Each input provides the contrast matrix for a single F-test, as opposed to the old interface where each row specified a single F-test and each column specified whether or not a particular row from the T-terst matrix is or is not included in that F-test.
- Option -fonly is no longer required.
Includes pulling updated binary test data for vectorstats.
Includes some minor fixes to said interface change.
@Lestropie
Copy link
Member Author

Lestropie commented Oct 4, 2021

Follow-up to previous comment: Over and above all_stats(), within null distribution generation test statistics are being computed for all elements, including those that will not contribute to statistical enhancement. This may result in slow execution if only a small subset of elements are included within the mask.

Edit: Constraining test statistic calculation in this fashion would also intrinsically solve #1366.

Do not rely on individual commands masking FWE-corrected p-values after the relevant function; only perform the actual calculations for those elements within the mask.

Also some minor re-arrangement, including moving the FWE function to the PermTest namespace.
Conflicts:
	core/math/stats/glm.cpp
	core/math/stats/glm.h
Add NaN-filling of matrix data expected to be unused when compiled in debugging mode.
Code was erroneously using full matrices rather than blocks relevant to the finite values from that element, which would include leftover values from previously tested elements.
@Lestropie
Copy link
Member Author

Lestropie commented Aug 23, 2023

This is next in the list of flushing out stale PRs. I put a bit of effort in here hoping that I might be able to put in a bit of a sprint and complete 3.1.0 prior to the workshop, but that's looking unlikely at this point.

After generating a pretty exhaustive test suite I found one small glitch (b674712), and have now verified that the results generated on this branch are identical to master.

Testing code for comparing PR code to `master`
#!/usr/bin/python3
#
# Test for equivalent outcomes between old and new GLM implementations
# Generate fixed set of permutations for use
# Generate examples to test the four classes
# Run both master and inference_refactor versions
# TODO Monitor memory usage of both implementations
# Compile both versions with assertions enabled
# Compare outputs of two implementations


import numpy as np
import os
import subprocess
import sys
import time

NUM_INPUTS = 20
NUM_ELEMENTS = 1000
NUM_DUPLICATE_ELEMENTS = 100
NUM_FACTORS = 2
NUM_NANS = 10
NUM_PERMUTATIONS = 1000
MULTITHREAD = False

SOFTWARE_VERSIONS = { 'master': '/home/unimelb.edu.au/robertes/src/master',
                      'inference_refactor': '/home/unimelb.edu.au/robertes/src/mrtrix3' }

# First element: true = variable design matrix; false = fixed design matrix
# Second element: true = heteroscedastic; false = homoscedastic
# Third element: true = nans present in input data; false = clean, no nans present
# Fourth element: true = nans present in extra factors; false = clean, no nans present
GLM_VERSIONS = [ (False, False, False, False),
                 (False, True,  False, False),
                 (True,  False, False, False),
                 (True,  False, False, True),
                 (True,  False, True,  False),
                 (True,  True,  False, False),
                 (True,  True,  False, True),
                 (True,  True,  True,  False) ]

SCRATCH_DIR = 'test_glm_scratch'
SINGLE_PERMUTATION_FILE = 'single_permutation.txt'
PERMUTATIONS_FILE = 'permutations.txt'
INPUTS_CLEAN_FILE = 'inputs_clean.txt'
INPUTS_DIRTY_FILE = 'inputs_dirty.txt'
FACTORS_CLEAN_FILE = 'factors_clean.txt'
FACTORS_DIRTY_FILE = 'factors_dirty.txt'
DESIGN_FILE = 'design.txt'
CONTRAST_FIXED_FILE = 'contrast_fixed.txt'
CONTRAST_VARIABLE_FILE = 'contrast_variable.txt'
VARIANCE_GROUPS_FILE = 'variance_groups.txt'

os.makedirs(SCRATCH_DIR)
inputcleanlist = []
inputdirtylist = []
factorcleanlist = []
factordirtylist = []
sys.stderr.write('Generating synthetic exemplar input data... ')
sys.stderr.flush()
for index in range(0, NUM_INPUTS):
    inputcleanname = 'inputclean%04i.txt' % index
    inputcleanpath = os.path.join(SCRATCH_DIR, inputcleanname)
    inputcleandata = np.random.normal(size=NUM_ELEMENTS)
    inputcleandata = np.concatenate((inputcleandata, inputcleandata[0:NUM_DUPLICATE_ELEMENTS]))
    np.savetxt(inputcleanpath, inputcleandata, fmt='%.15f')
    inputcleanlist.append(inputcleanname)
    inputdirtyname = 'inputdirty%04i.txt' % index
    inputdirtypath = os.path.join(SCRATCH_DIR, inputdirtyname)
    inputdirtydata = inputcleandata[0:NUM_ELEMENTS]
    inputdirtydata[np.random.randint(0, NUM_ELEMENTS, NUM_NANS)] = float('NaN')
    inputdirtydata = np.concatenate((inputdirtydata, inputdirtydata[0:NUM_DUPLICATE_ELEMENTS]))
    np.savetxt(inputdirtypath, inputdirtydata, fmt='%.15f')
    inputdirtylist.append(inputdirtypath)
    factorcleanname = 'factorclean%04i.txt' % index
    factorcleanpath = os.path.join(SCRATCH_DIR, factorcleanname)
    factorcleandata = np.random.normal(size=NUM_ELEMENTS)
    factorcleandata = np.concatenate((factorcleandata, factorcleandata[0:NUM_DUPLICATE_ELEMENTS]))
    np.savetxt(factorcleanpath, factorcleandata, fmt='%.15f')
    factorcleanlist.append(factorcleanname)
    factordirtyname = 'factordirty%04i.txt' % index
    factordirtypath = os.path.join(SCRATCH_DIR, factordirtyname)
    factordirtydata = factorcleandata[0:NUM_ELEMENTS]
    factordirtydata[np.random.randint(0, NUM_ELEMENTS, NUM_NANS)] = float('NaN')
    factordirtydata = np.concatenate((factordirtydata, factordirtydata[0:NUM_DUPLICATE_ELEMENTS]))
    np.savetxt(factordirtypath, factordirtydata, fmt='%.15f')
    factordirtylist.append(factordirtyname)
with open(os.path.join(SCRATCH_DIR, INPUTS_CLEAN_FILE), 'w') as f:
    f.write('\n'.join(inputcleanlist) + '\n')
with open(os.path.join(SCRATCH_DIR, INPUTS_DIRTY_FILE), 'w') as f:
    f.write('\n'.join(inputdirtylist) + '\n')
with open(os.path.join(SCRATCH_DIR, FACTORS_CLEAN_FILE), 'w') as f:
    f.write('\n'.join(factorcleanlist) + '\n')
with open(os.path.join(SCRATCH_DIR, FACTORS_DIRTY_FILE), 'w') as f:
    f.write('\n'.join(factordirtylist) + '\n')

single_permutation = np.arange(0, NUM_INPUTS)
permutations = np.transpose(np.tile(single_permutation, (NUM_PERMUTATIONS, 1)))
for col in range(1, NUM_PERMUTATIONS):
    np.random.shuffle(permutations[:,col])
np.savetxt(os.path.join(SCRATCH_DIR, SINGLE_PERMUTATION_FILE), single_permutation, fmt='%i')
np.savetxt(os.path.join(SCRATCH_DIR, PERMUTATIONS_FILE), permutations, fmt='%i')

sys.stderr.write('Done.\n')
sys.stderr.flush()

design = np.random.normal(size=(NUM_INPUTS, NUM_FACTORS))
np.savetxt(os.path.join(SCRATCH_DIR, DESIGN_FILE), design, fmt='%.15f')

contrast_fixed = np.zeros((2, NUM_FACTORS), dtype=int)
contrast_fixed[0, 0] = 1
contrast_fixed[1, NUM_FACTORS-1] = 1
np.savetxt(os.path.join(SCRATCH_DIR, CONTRAST_FIXED_FILE), contrast_fixed, fmt='%i')

contrast_variable = np.zeros((2, NUM_FACTORS+1), dtype=int)
contrast_variable[0, 0] = 1
contrast_variable[1, NUM_FACTORS] = 1
np.savetxt(os.path.join(SCRATCH_DIR, CONTRAST_VARIABLE_FILE), contrast_variable, fmt='%i')

variance_groups = [1] * int(NUM_INPUTS/2) + [2] * int(NUM_INPUTS/2)
np.savetxt(os.path.join(SCRATCH_DIR, VARIANCE_GROUPS_FILE), variance_groups, fmt='%i')



for glm_version in GLM_VERSIONS:
    is_variable = glm_version[0]
    is_heteroscedastic = glm_version[1]
    data_are_dirty = glm_version[2]
    factors_are_dirty = glm_version[3]
    output_dirs = []
    for software_key, software_root in SOFTWARE_VERSIONS.items():
        output_dir = os.path.join(SCRATCH_DIR,
                                  ('variable' if is_variable else 'fixed')
                                  + '_'
                                  + ('heteroscedastic' if is_heteroscedastic else 'homoscedastic')
                                  + '_'
                                  + ('dirtydata' if data_are_dirty else 'cleandata')
                                  + '_'
                                  + (('dirtyfactors_' if factors_are_dirty else 'cleanfactors_') if is_variable else '')
                                  + software_key)
        os.makedirs(output_dir)
        sys.stderr.write('Running ' + output_dir + '\n')
        cmd = [os.path.join(software_root, 'bin', 'vectorstats'),
               os.path.join(SCRATCH_DIR, (INPUTS_DIRTY_FILE if data_are_dirty else INPUTS_CLEAN_FILE)),
               os.path.join(SCRATCH_DIR, DESIGN_FILE),
               os.path.join(SCRATCH_DIR, (CONTRAST_VARIABLE_FILE if is_variable else CONTRAST_FIXED_FILE)),
               output_dir + '/',
               '-permutations',
               os.path.join(SCRATCH_DIR, PERMUTATIONS_FILE)]
        if is_variable:
            cmd.extend(['-column', os.path.join(SCRATCH_DIR, (FACTORS_DIRTY_FILE if factors_are_dirty else FACTORS_CLEAN_FILE))])
        if is_heteroscedastic:
            cmd.extend(['-variance', os.path.join(SCRATCH_DIR, VARIANCE_GROUPS_FILE)])
        if not MULTITHREAD:
            cmd.extend(['-nthreads', '0'])
        tic = time.perf_counter()
        subprocess.run(cmd)
        sys.stderr.write('Completed ' + output_dir + ' in ' + str(time.perf_counter()-tic) + '\n')
        output_dirs.append(output_dir)
    # Compare outputs of two versions
    # Two software versions have different names for uncorrected p-value files;
    #   requires a local modification of master to address
    filelist = sorted(os.listdir(output_dirs[0]))
    if sorted(os.listdir(output_dirs[1])) != filelist:
        sys.stderr.write('Inconsistent output directory contents: ' + str(output_dirs) + '\n')
    else:
        for filename in filelist:
            data_one = np.loadtxt(os.path.join(output_dirs[0], filename), delimiter=',')
            data_two = np.loadtxt(os.path.join(output_dirs[1], filename), delimiter=',')
            if not np.array_equiv(data_one, data_two):
                sys.stderr.write('Inconsistent output contents for ' + filename + ' in ' + str(output_dirs) + '\n')

I still want to have a bit more of a look at memory usage before merging. What triggered me to have a go at this initially is that just running top I could see the RAM usage jumping up and down, which I presumed was the memory being returned and then re-allocated between permutations. However elsewhere it was reported that memory usage was increasing linearly over time. I'll try using the tools they used there and see if I can reproduce.

Also I think I can revert #2259. Rather than having multiple GLM class instances being spawned from an instance of the base class, I should just do something more like what other commands already do, which is to have separate run_queue() calls depending on the derived class type.


  • Modify code in all_stats() function to use Eigen block functionality in the same way as it is used in the GLM classes
  • Objectively test memory consumption of this implementation against master and compare to external reporting
  • Remove derived class cloning from base class

@Lestropie
Copy link
Member Author

Lestropie commented Aug 30, 2023

  • Reduce size of multi-threading queue used when running permutations
    (while the Shuffler class is clever and stores permutations & sign-flips in a native compressed form, it writes the full shuffling matrix to the queue, and so if the number of inputs is large then each of those matrices is large, and if the multi-threading queue buffers a lot of them then that will chew up a lot of memory unnecessarily)

  • Change shuffling matrix to be natively of type int8_t, and cast to default_type only when required

- Change native storage type of shuffling matrices from default_type to int8_t, only casting to default_type on usage.
- Reduce size of multi-threading queue buffer where shuffling matrices are stored.
@Lestropie
Copy link
Member Author

Lestropie commented Sep 1, 2023

  • Use blocking for solving fixed design matrix case

  • Store input data as single-precision; cast to double-precision only when utilised


The penny just dropped as to why it was reported that fixelcfestats consumes so much memory when the number of inputs is large. And now I'm having another feeling sick moment like I did with this work.

Prior to the complete re-write that was #1543, there were a number of things that were different about the GLM test. Ignoring all of the new features, the key differences are I think:

  1. Use of Freedman-Lane rather than a basic data shuffle.
    This may require a bunch more memory to perform the requisite operations. For the fixed design matrix case, model partitioning should only have to occur once, but there's additional operations relating to the way that model residuals are shuffled that may utilise additional memory.
  2. Matrix block operations.
    As shown in 3.0_RC3 code here, the input data were broken into blocks of a fixed size, and the model inversion was performed on one block at a time. However when I eventually completed the mammoth work that was GLM enhancements & surrounding code #1543, I at some point concluded that that wasn't worth the effort, as things were working as they are. Therefore, what's happening currently (3.0.4 code here) is that an entire duplicate of the input data is being generated where shuffling has been applied. This might be tolerable for a typically-sized dataset, but for ~1000 inputs this is too much to be allocating per thread.
    Notably, in situations where the design matrix is variable per element, whether due to the specification of element-wise design matrix factors or due to the presence of non-finite values in the data, this doesn't apply: the model is by necessity inverted on an element-by-element basis.

What this means is that it's likely that with a reasonably small amount of code modification I can resolve the excessive memory utilisation that was used as a primary justification for development and publication of an alternative statistical software package.

Data can be promoted to double precision where required for calculations. In most instances (mrclusterstats, fixelcfestats), data were already being imported from disk using single-precision, but then stored in a double-precision matrix for compatibility with subsequent linear algebra.
…ce_new_features

Conflicts:
	cmd/connectomestats.cpp
	cmd/fixelcfestats.cpp
	cmd/mrclusterstats.cpp
	cmd/vectorstats.cpp
	core/math/stats/fwe.cpp
	core/math/stats/glm.cpp
	core/math/stats/typedefs.h
	testing/binaries/data
	testing/binaries/tests/fixelcfestats
- Create new group "statistical inference" in documentation. Pages on mitigating brain cropping in DWI, and computing percentage effect relative to controls, have been moved into this group, since they are equally applicable to voxel-based analysis. This new group may also be the landing destination for future new pages on eg. explaining the content of the files produced by statistical inference commands, or explaining the operation of the GLM.
- Create new documentation page in this statistical inference group, explaining the operation of the -posthoc option, both pragmatically (ie. how to do a robust post hoc analysis) and conceptually (ie. how it differs from the -mask option). This supersedes a Description paragraph that had been added to some command help pages, with that Description now providing a link to the online documentation.
- Regenerate command documentation to reflect this change, as well as prior alterations to the interface for specifying t-tests and F-tests.
While -mask and -posthoc remain as the user command-line interface names, within the code internally these two similar-but-different entities are now referred to as the "processing mask" and the "inference mask".
@Lestropie
Copy link
Member Author

I would like to get this into 3.1.0 given that there's now been multiple complaints of fixelcfestats chewing up more memory than it needs to be.

  • Keep 97c1baa, which changed storage of the input data to single precision, but make the default type double precision. While this doubles storage space, it's only one instance of the raw data used across all threads, and changing to single precision comes with from memory a 10-15% performance penalty due to having to cast to double every time it is used. By leaving the code there, adventurous individuals can set to single precision and re-compile if absolutely necessary.

@Lestropie
Copy link
Member Author

  • There's a large spike in memory usage during the "Calculating basic properties of default permutation" phase. Would be good to block that up to avoid large allocations.

In d0b3afa, for statistical inference commands the size of the shuffling matrix queue is set based on the number of processing threads, to reduce memory usage. This however fails to execute in the case -nthreads 1. Rather than identifying the blocking condition with the Queue class, this change simply ensures that the size of the queue is always at least 2 items.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant