This module contains a few helpful and flexible data models which can be easily extended for other applications. Here, I focused on making some (very) specific validators focusing on MRI data. These can be used to test DICOM files, SITK-readable images or numpy-loadable arrays.
The project should be readily instalable with poetry
(recommended) by running poetry install
.
Three fundamental abstract classes are used here: Check
, which focuses on checking whether specific conditions are true given an input data, DataValidator
, which focuses on validating a specific type of data, and DataModel
which focuses on validating a sample (composed by multiple types of data).
A Check
is a generic class that can be used as the base for data checks. These data checks, after being defined (check = Check(target="hello")
), can then be called with a given input argument (check("hello)
) and the output (True
) tells us whether the check has passed (True
if the check has passed, False
otherwise).
The check above considers only simple comparison - this is hardly useful when we want to perform more complicated comparisons. For example, we may want to know if our data is within a given range. To do this, we can define novel dataclasses from Test
that perform exactly these comparisons, by redefining the unpack
and compare
methods:
from imdm import Test
@dataclass
class CheckRange(Test):
target: Tuple[Union[int,float],Union[int,float]]
def __post_init__(self):
name = self.get_name()
self._success_msg = f"Target shape {name} contains input values"
self._fail_msg = f"Target shape {name} does not contain input values"
def unpack(self, x: Any) -> Any:
return np.min(x),np.max(x)
def compare(self, unpacked_x: Any) -> bool:
within_range = True
if self.target[0] is not None:
if unpacked_x[0] < self.target[0]:
within_range = False
if self.target[1] is not None:
if unpacked_x[1] > self.target[1]:
within_range = False
return within_range
By redefining our unpack
and compare
methods, we can ensure that the correct checks are performed. These methods are then executed in the __call__
method of the Test
abstract class:
def __call__(self, x: Any) -> bool:
"""Performs the comparison using the input data and defines the
``self.msg``.
Args:
x (Any): input data.
Returns:
bool: whether the comparison was successful or not.
"""
if self.target is None:
return self.target
unpacked_x = self.unpack(x)
result = self.compare(unpacked_x)
if result == True:
self.msg = self._success_msg
if result == False:
self.msg = self._fail_msg
return result
A msg
attribute is defined with either _success_msg
or _fail_msg
, depending on whether the result is True
or False
. This is helpful in case verbosity is important for your applications.
Defining a DataValidator
is relatively easy:
from imdm import DataValidator
data_validator = DataValidator(type=str,length=11,shape=None,range=None)
and running data validations is just as easy:
output = data_validator.validate("test_string")
print(output)
>>> {"type":True,"length":True,"shape":None,"range":None}
The DataValidator
method automatically checks for type
, length
, shape
and range
(if specified). If necessary, users can also add their own methods. For example, if you would to check whether a given path exists:
import os
data_validator.add_check(key="path",check_fn=os.path.exists,data_stage="raw")
output = data_validator.validate("test_string")
print(output)
>>> {"type":True,"length":True,"shape":None,"range":None,"path":False}
Easy! All arguments are relatively clear, but data_stage
is somewhat more ellusive; for this reason I introduce here the concept of three data stages:
raw
- the input exactly as it is. This is useful to check whether a file exists.preprocessed_data
- if apreprocess_fn
is specified in theDataValidator
constructor, checks can be applied to these functions. For instance, thetype
check is automatically ran on thepreprocessed_data
stage.value_data
- some files (SITK-readable files, for instance) require some non-obvious wrangling before one can actually use their values asnumpy
arrays, which is the assumed format for checking therange
. This function (value_fn
) is applied to the output ofpreprocess_fn
.
If no preprocess_fn
or value_fn
are supplied, then preprocessed_data
and value_fn
will be identical to the input data.
A DataModel
is simply a structure of DataValidators
, i.e.
from imdm import DataValidator,DataModel
data_model = DataModel(structures={
"a":DataValidator(type=str,length=11,shape=None,range=None),
"b":DataValidator(type=int,length=None,shape=None,range=[-10,10])
})
This data_model
can then be applied to any given data input that follows a structure similar to data_model.structures
.
An easy-to-use data validators have been implemented specifically for image data (ImageFile
). I work with images, so these were especially useful for me.
Additionally, since I work with a lot of MRI data, specific methods for MRI data were also implemented (DicomFile
and SitkFile
). A more generic method for numpy
files has also been (NumpyFile
).
pprint
is a simple function that allows you to more easily inspect the output of DataValidator
and DataModel
. It comes with colours!
Tests for the data validators and models are available in imdm/data_models/testing
. Test images were collected from: