-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Generic design support using formulaic
#328
Conversation
…d compatibiliy, but throw deprecation warning
…ey are no longer used
…ds without its unpicklable attributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really prefer this over the old appraoch, many thanks for moving this forward @BorisMuzellec!
|
||
|
||
@dataclass | ||
class FactorMetadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also quite a bunch of test cases for the _formulaic.py
file and the LinearModelBase
in pertpy. Would be great if you could also port them here!
@property | ||
def variables(self): | ||
"""Get the names of the variables used in the model definition.""" | ||
try: | ||
return self.obsm["design_matrix"].model_spec.variables_by_source["data"] | ||
except AttributeError: | ||
raise ValueError( | ||
"""Retrieving variables is only possible if the model was initialized | ||
using a formula.""" | ||
) from None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this stuff could really become a Mixin as you suggested, then we can more easily reuse it across pyDESeq2 and pertpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's simpler for perpty
than yes I can put this in a Mixin.
There's a difference though, because here the design is stored in .obsm["design_matrix"]
as opposed to .design
in pertpy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure yet what's the best solution. Maybe we just leave it as is now, and when I'll look into refactoring pertpy I can propose a PR with changes to pyDESeq2 if required.
pydeseq2/ds.py
Outdated
def cond(self, **kwargs): | ||
""" | ||
Get a contrast vector representing a specific condition. | ||
|
||
Parameters | ||
---------- | ||
**kwargs | ||
Column/value pairs. | ||
|
||
Returns | ||
------- | ||
ndarray | ||
A contrast vector that aligns to the columns of the design matrix. | ||
""" | ||
cond_dict = kwargs | ||
if not set(cond_dict.keys()).issubset(self.dds.variables): | ||
raise ValueError( | ||
"""You specified a variable that is not part of the model. Available | ||
variables: """ | ||
+ ",".join(self.dds.variables) | ||
) | ||
new_ref_idx = self.LFC.columns.get_loc(f"{factor}_{ref}_vs_{old_ref}") | ||
self.contrast_vector[new_alternative_idx] = 1 | ||
self.contrast_vector[new_ref_idx] = -1 | ||
for var in self.dds.variables: | ||
if var in cond_dict: | ||
self.dds._check_category(var, cond_dict[var]) | ||
else: | ||
cond_dict[var] = self.dds._get_default_value(var) | ||
df = pd.DataFrame([kwargs]) | ||
return self.dds.obsm["design_matrix"].model_spec.get_model_matrix(df).iloc[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on you if you want to adopt the .cond()
syntax for building contrasts (which was originally devised by @const-ae in glmGamPoi).
A lot of the code in the _formulaic.py
module is just around finding the baseline level for each condition such that this works nicely in the case of interaction terms.
In case you were just to support [column, baseline, treatment]
and numpy array contrasts, you could probably come up with a way simpler solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion on this, whatever offers the most flexibility is best.
At first I tried simplifying the code in _formulaic.py
to keep only what I need (mainly retrieving levels for a given factor + whether it has numerical or categorical type), but I ended up keeping everything because I didn't find a straightforward simplification.
A lot of the code in the _formulaic.py module is just around finding the baseline level for each condition such that this works nicely in the case of interaction terms.
How would you define a contrast to test interaction terms using .cond()
? Right now I don't see how to do it without using a numerical factor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's consider a design ~ disease * timepoint
disease | timepoint |
---|---|
healthy | T0 |
healthy | T1 |
diseased | T0 |
diseased | T1 |
which gives us the following coefficients:
Intercept
, diseased
, T1
, T1:diseased
Then you could test:
diseased vs. healthy
contrast = dds.cond(disease="diseased") - dds.cond(disease="healthy")
T1 vs. T0
contrast = dds.cond(timepoint="T1") - dds.cond(timepoint="T0")
Interaction T1:diseased
contrast = (
(dds.cond(timepoint="T1", disease="diseased") - dds.cond(timepoint="T0", disease="diseased")) -
(dds.cond(timepoint="T1", disease="healthy") - dds.cond(timepoint="T0", disease="healthy"))
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit there's no suitable documentation for this in pertpy. But in principle, using this "DSL", it should be possible to specify arbitrary contrasts.
…cal constrasts have incorrect shapes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
pydeseq2/dds.py
Outdated
# Also check continuous factors | ||
if self.continuous_factors is not None: | ||
self.continuous_factors = replace_underscores(self.continuous_factors) | ||
assert isinstance(self.design, (str, pd.DataFrame)) or isinstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: if this is meant as a user-facing error message, it should probably be a ValueError instead of an assertion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this great PR, I m convinced :) Just one thing that is not clear for me is what should be put in the LFC shrinkage: is it really a column of the design ? For me it should be LFC @ contrast, but maybe this is not relavant for this PR, I just spotted it since you made the modification.
Finally, could you add a comment on what the Materializer is supposed to do ? (i.e., just one line of comment on what a materializer is :))
Thanks again !
@@ -278,11 +282,11 @@ | |||
# LFC shrinkage. This is implemented by the :meth:`lfc_shrink() <DeseqStats.lfc_shrink>` | |||
# method. | |||
|
|||
stat_res.lfc_shrink(coeff="condition_B_vs_A") | |||
ds.lfc_shrink(coeff="condition[T.B]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment to explain what this means ?
…n with invalid type
Yes: LFC shrinkage performs MAP estimation with a prior on a given LFC coefficient (i.e. column) that the user must specify. In principle, I guess it would be possible to do the same thing with contrast @ LFC, I'm just not sure what it would mean. (Also, would the prior be amenable to linear combination?) |
Thanks for your reviews @grst @umarteauowkin ! I'm merging this :) |
Reference Issue or PRs
Closes #181
Closes #213
Closes #309
Closes #272
Closes #202
Closes #184
Closes #125
Will unblock scverse/pertpy#610
What does your PR implement? Be specific.
This PR implements support for general design matrices thanks to formulaic, and using utils from pertpy.
DeseqDataSet
DeseqDataSet
using thedesign
argument, either in the form of a string representing aformulaic
formula (e.g."~condition + treatment"
,"~condition + condition:treatment"
,"~condition + exp(cofactor)"
...), or an ndarray directly corresponding to a design matrix.design_factors
is still supported but throws aDeprecationWarning
continuous_factors
is deprecated, as continuous type inference is handled byformulaic
ref_level
is deprecatedDeseqDataSet
is no longer picklable. Ato_picklable_anndata()
method was added to allow users to pickle results for later use.DeseqStats
["treatment", "test", "control"]
), or directly in the form of a contrast vector (a numpy array).lfc_shrink
no longer supports a defaultcoef
argumentBREAKING CHANGE: python 3.9 is no longer supported.
TODO:
Failure seems to be due to the fact that it's not possible to pickle classes with decorated functions.Solved usingto_picklable_anndata
._formulaic.py
tests from pertpy