Proposing a standardized way to create theory objects #308

marcpaterno · 2023-08-21T17:28:27Z

marcpaterno
Aug 21, 2023
Maintainer

This discussion began on a thread in the desc-mcp channel. We have transcribed it here to complete the discussion. This “entry” in the discussion contains the set of messages on the thread.

@vitenti : We want to propose a standardized way to create all the objects needed to calculate theoretical predictions. In Firecrown we call these objects "models". We want to make sure that the same models used in Firecrown can be used when making mock catalogs with Augur, and also the same predictions used when calculating covariances with TJPCov. Each of Firecrown, Augur, and TJPCov currently writes these separately, and have different testing and verification. We want to have factory functions that create these model objects, so that all of the client codes can use those factory functions to create the models. Please any insights about this are welcome!

@tilmantroester : This sounds great. Is the idea to provide a standardised way to serialise firecrown likelihood specifications so they can be shared between downstream tools such as augur, tjpcov, or txpipe?

@jablazek : I agree this is a good idea. Also will be relevant for the blind analysis pipeline

@elisachisari : I think this would be great. Is this related to [Agnes Ferte’s] effort?

@vitenti : Yes, that's the idea. Right now, part of the specification is in the metadata of the SACC file, and the rest is in the likelihood factory function. But we're thinking of splitting it into two parts: one for the theoretical predictions (with all the systematics included) and the other specifically for the likelihood.
This way, tools like Augur, TJPCov, and others can use the theoretical specification directly without messing around with a dummy SACC file.
We're still in the design phase (working on those requirements specifications), and we welcome everyone's input. Currently, we're thinking about with a structure where we have theory factory functions that'll produce a set of objects, maybe into a "ModelingTools", and make it easy to compute predictions.
All the actual parameter values will be stored in a separate file, maybe in that YAML format. This means the pair factory function and YAML can be shared among different projects.

@tilmantroester : In terms of initialising classes from yaml files, it might be worth looking at the implementations in libraries such as pytorch lightning (which uses jsonargparse and omegaconf under the hood IIRC).

@marcpaterno : We will modify one of the examples (probably a cluster counts example) to reflect this discussion, and then ask for a discussion of the proposed solution.

@aferte : Thanks for the tag, yes very similar to what I was looking into: I had started looking at the SACC source code to see where we could keep track of the model specifications, it looked possible but I haven’t pursued it. I think it would be great to have a chat about it because I am not sure I see the advantages of having these info in a separate file rather than SACC files as that is what we will always be using (I don’t see why we would need a dummy SACC file).
And the other thing I am meant to look into is the new data registry ie registering e.g. data vector files at NERSC, but I think that is complementary (whatever we choose to store model specifications we can register on the data registry) (as far as I understand).
Overall a meeting about our approach to keep modeling information would be very useful and I’d be happy to contribute!

@tilmantroester : I think it makes sense to have the model specification separate from the data: we probably want to analyse the same data with different models, analyse different data with the same model, or just make predictions with the model without any data. Keeping the model specification in a human-readable format (e.g. yaml or python) also has its advantages.
That being said, having the option to bundle model specifications with data in SACC would be a useful feature for the final data products.

@jablazek : what is meant by "final data products" here?

@jablazek : for things like synthetic data vectors, and covariances. But I agree that it would be good to have this info available without needing a full SACC file. The complication is metadata like n(z), which is needed to generate a model prediction and is something that must always come with a measured data vector.

@tilmantroester :

what is meant by “final data products” here?

I’m imagining the data products to reproduce the results of a paper. E.g. a single file with the data and model specification to reproduce something like the contour plots.

@jablazek : ah, I see. so model meta-data (including perhaps prior ranges)

tilmantroester · 2023-11-16T16:11:55Z

tilmantroester
Nov 16, 2023
Collaborator

To make this discussion a bit more concrete, I have mocked up a yaml-based spec, using jsonargparse to instantiate classes. A model spec might look like this:

two-point:
    weak_lensing:
        global_systematics:
            baryons:
                class_path: firecrown.likelihood.gauss_family.statistic.source.weak_lensing.SelectField
                init_args:
                    field: delta_matter_baryon
            IA:
                class_path: firecrown.likelihood.gauss_family.statistic.source.weak_lensing.LinearAlignmentSystematic

        per_bin_systematics:
            photoz_shift:
                class_path: firecrown.likelihood.gauss_family.statistic.source.weak_lensing.PhotoZShift

    number_counts:
        per_bin_systematics:
            galaxy_bias:
                class_path: firecrown.likelihood.gauss_family.statistic.source.number_counts.PTNonLinearBiasSystematic

modeling_tools:
    pt_calculator:
        class_path: pyccl.nl_pt.EulerianPTCalculator
        init_args:
            with_NC: True
            with_IA: True
            log10k_min: -4
            log10k_max: 2
            nk_per_decade: 20

Loading would look something like

modeling_tools, cfg_w_classes = read_config(config_str)

wl_sources, nc_sources = build_sources(
    n_source=4, n_lens=5, cfg_w_classes=cfg_w_classes
)

from which the statistics can be built. The implementation can be found here: https://gist.github.com/tilmantroester/e983f8d8bf933132d23789201881f458, which also includes specifying cluster counts.

The main advantage I see for this setup is that it separates the specification of the model (e.g., there's intrinsic alignment and photo-z shift) from the definition of the statistics (e.g. there are 5 lens bins and 4 source bins).

It abstracts away the creation of the WeakLensing and NumberCounts object per bin and takes care of dealing setting the correct sacc_tracer argument for each bin while being completely general in the systematics that are being included.

The advantage of using jsonargparse is that it allows instantiating arbitrary classes, so if new systematics or statistics are added to firecrown or elsewhere, that doesn't need to be explicitly accounted for in the model factory functionality. It also allows using things like omegaconf, which give the option for powerful interpolations within the config, in case such functionality becomes important in the future.

This would have to be complemented by a similar spec for the statistics, which covers things like number of bins, bin combinations, scale cuts, etc. Some thought will be required there to see how far the definition of statistic can be separated from the data.

The goal here is to find a way to specify firecrown models at a high level and data independently, which is understood and used consistently by the different projects that use firecrown, such as TXPipe, TJPCov, augur, blinding, etc: @fjaviersanchez @jablazek @marcpaterno @vitenti @elisachisari @carlosggarcia @felipeaoli @arthurmloureiro @joezuntz

Suggestions and discussion are welcome!

2 replies

arthurmloureiro Nov 17, 2023

Thanks for tagging me @tilmantroester !

This would be crucial for the Blinding Library we are building as we must ensure the blinding factors are calculated using the exact same theory modelling and specification as it will be used for the cosmological analysis. At the moment, it seems like I must build the recipe for theory vectors myself as it is done in TXPipe here: https://github.com/LSSTDESC/TXPipe/blob/df0dcc8c1e974576dd1942624ab5ff7bd0fbbaa0/txpipe/utils/theory_model.py

Now, this seems slightly different than what's in Augur here: https://github.com/LSSTDESC/augur/blob/91fc88a509f861a7592ee2523020d23df2316f68/augur/generate.py#L45

Hence, blinding would have to create a third way of estimating the 3x2pt data-vector while the main goal of the library is to simply use what is already in firecrown to ensure consistency in generating the theory vectors is never lost (as far as I understood, mI may be quite wrong here!).

arthurmloureiro Dec 8, 2023

Tagging @bruno-moraes here too as this interests him

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposing a standardized way to create theory objects #308

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Proposing a standardized way to create theory objects #308

marcpaterno Aug 21, 2023 Maintainer

Replies: 1 comment · 2 replies

tilmantroester Nov 16, 2023 Collaborator

arthurmloureiro Nov 17, 2023

arthurmloureiro Dec 8, 2023

marcpaterno
Aug 21, 2023
Maintainer

Replies: 1 comment 2 replies

tilmantroester
Nov 16, 2023
Collaborator