Proposing a standardized way to create theory objects #308
Replies: 1 comment 2 replies
-
To make this discussion a bit more concrete, I have mocked up a yaml-based spec, using jsonargparse to instantiate classes. A model spec might look like this: two-point:
weak_lensing:
global_systematics:
baryons:
class_path: firecrown.likelihood.gauss_family.statistic.source.weak_lensing.SelectField
init_args:
field: delta_matter_baryon
IA:
class_path: firecrown.likelihood.gauss_family.statistic.source.weak_lensing.LinearAlignmentSystematic
per_bin_systematics:
photoz_shift:
class_path: firecrown.likelihood.gauss_family.statistic.source.weak_lensing.PhotoZShift
number_counts:
per_bin_systematics:
galaxy_bias:
class_path: firecrown.likelihood.gauss_family.statistic.source.number_counts.PTNonLinearBiasSystematic
modeling_tools:
pt_calculator:
class_path: pyccl.nl_pt.EulerianPTCalculator
init_args:
with_NC: True
with_IA: True
log10k_min: -4
log10k_max: 2
nk_per_decade: 20 Loading would look something like modeling_tools, cfg_w_classes = read_config(config_str)
wl_sources, nc_sources = build_sources(
n_source=4, n_lens=5, cfg_w_classes=cfg_w_classes
) from which the statistics can be built. The implementation can be found here: https://gist.github.com/tilmantroester/e983f8d8bf933132d23789201881f458, which also includes specifying cluster counts. The main advantage I see for this setup is that it separates the specification of the model (e.g., there's intrinsic alignment and photo-z shift) from the definition of the statistics (e.g. there are 5 lens bins and 4 source bins). It abstracts away the creation of the The advantage of using jsonargparse is that it allows instantiating arbitrary classes, so if new systematics or statistics are added to firecrown or elsewhere, that doesn't need to be explicitly accounted for in the model factory functionality. It also allows using things like omegaconf, which give the option for powerful interpolations within the config, in case such functionality becomes important in the future. This would have to be complemented by a similar spec for the statistics, which covers things like number of bins, bin combinations, scale cuts, etc. Some thought will be required there to see how far the definition of statistic can be separated from the data. The goal here is to find a way to specify firecrown models at a high level and data independently, which is understood and used consistently by the different projects that use firecrown, such as TXPipe, TJPCov, augur, blinding, etc: @fjaviersanchez @jablazek @marcpaterno @vitenti @elisachisari @carlosggarcia @felipeaoli @arthurmloureiro @joezuntz Suggestions and discussion are welcome! |
Beta Was this translation helpful? Give feedback.
-
This discussion began on a thread in the
desc-mcp
channel. We have transcribed it here to complete the discussion. This “entry” in the discussion contains the set of messages on the thread.@vitenti : We want to propose a standardized way to create all the objects needed to calculate theoretical predictions. In Firecrown we call these objects "models". We want to make sure that the same models used in Firecrown can be used when making mock catalogs with Augur, and also the same predictions used when calculating covariances with TJPCov. Each of Firecrown, Augur, and TJPCov currently writes these separately, and have different testing and verification. We want to have factory functions that create these model objects, so that all of the client codes can use those factory functions to create the models. Please any insights about this are welcome!
@tilmantroester : This sounds great. Is the idea to provide a standardised way to serialise firecrown likelihood specifications so they can be shared between downstream tools such as augur, tjpcov, or txpipe?
@jablazek : I agree this is a good idea. Also will be relevant for the blind analysis pipeline
@elisachisari : I think this would be great. Is this related to [Agnes Ferte’s] effort?
@vitenti : Yes, that's the idea. Right now, part of the specification is in the metadata of the SACC file, and the rest is in the likelihood factory function. But we're thinking of splitting it into two parts: one for the theoretical predictions (with all the systematics included) and the other specifically for the likelihood.
This way, tools like Augur, TJPCov, and others can use the theoretical specification directly without messing around with a dummy SACC file.
We're still in the design phase (working on those requirements specifications), and we welcome everyone's input. Currently, we're thinking about with a structure where we have theory factory functions that'll produce a set of objects, maybe into a "ModelingTools", and make it easy to compute predictions.
All the actual parameter values will be stored in a separate file, maybe in that YAML format. This means the pair factory function and YAML can be shared among different projects.
@tilmantroester : In terms of initialising classes from yaml files, it might be worth looking at the implementations in libraries such as pytorch lightning (which uses jsonargparse and omegaconf under the hood IIRC).
@marcpaterno : We will modify one of the examples (probably a cluster counts example) to reflect this discussion, and then ask for a discussion of the proposed solution.
@aferte : Thanks for the tag, yes very similar to what I was looking into: I had started looking at the SACC source code to see where we could keep track of the model specifications, it looked possible but I haven’t pursued it. I think it would be great to have a chat about it because I am not sure I see the advantages of having these info in a separate file rather than SACC files as that is what we will always be using (I don’t see why we would need a dummy SACC file).
And the other thing I am meant to look into is the new data registry ie registering e.g. data vector files at NERSC, but I think that is complementary (whatever we choose to store model specifications we can register on the data registry) (as far as I understand).
Overall a meeting about our approach to keep modeling information would be very useful and I’d be happy to contribute!
@tilmantroester : I think it makes sense to have the model specification separate from the data: we probably want to analyse the same data with different models, analyse different data with the same model, or just make predictions with the model without any data. Keeping the model specification in a human-readable format (e.g. yaml or python) also has its advantages.
That being said, having the option to bundle model specifications with data in SACC would be a useful feature for the final data products.
@jablazek : what is meant by "final data products" here?
@jablazek : for things like synthetic data vectors, and covariances. But I agree that it would be good to have this info available without needing a full SACC file. The complication is metadata like n(z), which is needed to generate a model prediction and is something that must always come with a measured data vector.
@tilmantroester :
what is meant by “final data products” here?
I’m imagining the data products to reproduce the results of a paper. E.g. a single file with the data and model specification to reproduce something like the contour plots.
@jablazek : ah, I see. so model meta-data (including perhaps prior ranges)
Beta Was this translation helpful? Give feedback.
All reactions