Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace cache namedtuple with explicit struct #2217

Closed
Tracked by #1980
Sbozzolo opened this issue Oct 9, 2023 · 11 comments · Fixed by #2296
Closed
Tracked by #1980

Replace cache namedtuple with explicit struct #2217

Sbozzolo opened this issue Oct 9, 2023 · 11 comments · Fixed by #2296
Assignees

Comments

@Sbozzolo
Copy link
Member

Sbozzolo commented Oct 9, 2023

Currently, the integrator cache is a complex heterogeneous NamedTuple. The cache is partially flat, partially nested. For instance, precomputed_quantities(Y, atmos) is unpacked into the cache, but simulation is not. The details are not super easy to follow given the mix of splatting, unpacking, and merging that occur when building the cache.

Here's some of the fields that are in the cache:

    is_init
    simulation
    spaces
    atmos
    comms_ctx
    sfc_setup
    test
    moisture_model
    model_config
    Yₜ
    limiter
    ᶜΦ
    ᶠgradᵥ_ᶜΦ
    ᶜρ_ref
    ᶜp_ref
    ᶜT
    ᶜf
    ∂ᶜK∂ᶠu₃_data
    params
    energy_upwinding
    tracer_upwinding
    density_upwinding
    edmfx_upwinding
    do_dss
    ghost_buffer
    net_energy_flux_toa
    net_energy_flux_sfc
    env_thermo_quad
    ᶜspecific
    ᶜu
    ᶠu³
    ᶜK
    ᶜts
    ᶜp
    ᶜh_tot
    sfc_conditions
    ᶠtemp_scalar
    ᶜtemp_scalar
    ᶜtemp_scalar_2
    temp_data_level
    temp_data_level_2
    temp_data_level_3
    ᶜtemp_CT3
    ᶠtemp_CT3
    ᶠtemp_CT12
    ᶠtemp_CT12ʲs
    ᶠtemp_C123
    ᶜtemp_UVWxUVW
    sfc_temp_C3
    ᶜ∇²u
    ᶜ∇²specific_energy
    ᶜ∇²specific_tracers
    hyperdiffusion_ghost_buffer
    ᶜ∇²uʲs
    center_space
    radiation_model
    rayleigh_sponge_cache
    viscous_sponge_cache
    precipitation_cache
    subsidence_cache
    large_scale_advection_cache
    edmf_coriolis_cache
    forcing_cache
    radiation_cache
    non_orographic_gravity_wave_cache
    orographic_gravity_wave_cache
    edmfx_nh_pressure_cache
    Δt
    turbconv_cache

Some of the cache items are always added (e.g., non_orographic_gravity_wave_cache), others are added conditionally. Some fields in the cache are directly controlled by flags in parsed_args (e.g. use_reference_state, test_dycore_consistency).

The cache also contains information that is not related to the model (e.g., output_dir), information that is available elsewhere (e.g., Δt), or that is redundant (model_config = atmos.model_config).

Some values are hardcoded in the computation of the cache (e.g., T_ref = FT(255)), others are added with possibly fragile checks (e.g., ᶜf is set by checking if ᶜcoord is a LatLongZPoint, and otherwise set using f_plane_coriolis_frequency).

@Sbozzolo
Copy link
Member Author

Sbozzolo commented Oct 9, 2023

This is where I started my experiment to turn the cache into a struct a little while ago. (I did everything quickly and poorly--my goal was to check the compilation time)

@simonbyrne
Copy link
Member

Maybe we should start by trying to identify stuff which is not needed, or can be accessed from somewhere else (e.g. comms_ctx)

@Sbozzolo
Copy link
Member Author

Sbozzolo commented Oct 10, 2023

Maybe we should start by trying to identify stuff which is not needed, or can be accessed from somewhere else (e.g. comms_ctx)

At the least the following are trivially redundant:

  • do_dss (obtained from the space)
  • moisture_model, radiation_model, turbconv_model, ls_adv, forcing_type, model_config, precip_model, (from the atmos model)
  • comms_ctx (from the space)
  • Δt (from the integrator)

Another one that should probably be removed is simulation, which contains comms_ctx, is_debugging_tc, output_dir, restart, job_id, dt, start_date, t_end, information that we can pass directly to the integrator/diagnostics.

@charleskawczynski
Copy link
Member

This is where I started my experiment to turn the cache into a struct a little while ago. (I did everything quickly and poorly--my goal was to check the compilation time)

Can you make the struct concretely typed and confirm that it still compiles quickly? Maybe that was the main runtime performance issue?

@charleskawczynski
Copy link
Member

This issue is entangled with two important design tradeoffs:

  • computing variables on the fly vs using a cache.
  • Using a “scratch”-like cache, where the same temporary field is used to compute more than one intermediate quantity.

In my opinion the first bullet has a trade-off:

  • on the fly: less storage is needed, less stateful, may lessen this latency issue, but requires recomputation
  • cached: recomputation is avoided, but more stateful and more storage is required

Also, every cached variable can be thought of as having some sort of efficiency. A good example of a high efficiency cache is the thermo state: two fields are needed, but we can compute many variables from it. Another way to put this is: adding a cache could increase or decrease the number of heap reads/writes. I think that this is a good quantitative metric we can use to decide if a variable should be cached or computer on the fly, if we want a balanced solution.

@Sbozzolo
Copy link
Member Author

This is where I started my experiment to turn the cache into a struct a little while ago. (I did everything quickly and poorly--my goal was to check the compilation time)

Can you make the struct concretely typed and confirm that it still compiles quickly? Maybe that was the main runtime performance issue?

@time CA.get_integrator(CA.AtmosConfig())

Struct with concrete types:
66.282692 seconds (137.65 M allocations: 8.298 GiB, 3.16% gc time, 99.81% compilation time: <1% of which was recompilation)
Struct with no types:
51.040393 seconds (140.87 M allocations: 8.512 GiB, 4.08% gc time, 99.75% compilation time: <1% of which was recompilation)
Mutable Fields:
42.463701 seconds (141.20 M allocations: 8.529 GiB, 4.92% gc time, 99.29% compilation time: <1% of which was recompilation)
NamedTuple:
77.731601 seconds (141.11 M allocations: 8.524 GiB, 3.20% gc time, 99.84% compilation time: <1% of which was recompilation)

So yes, there is a performance penalty is using concrete types, but fixing the root of the issue is still much faster.

This issue is entangled with two important design tradeoffs:
computing variables on the fly vs using a cache.
Using a “scratch”-like cache, where the same temporary field is used to > compute more than one intermediate quantity.

When it comes to design, I also think that this is a good moment to ensure that we make the cache composible and extensible. I believe that this was the original intent with the default_cache and the additional_cache, and it is mostly already implemented. For some of the fields in the additional_cache, *_cache functions are defined with dispatches over the value of the respective entry in atmos_model. It is not fully implemented as in the default cache contains all sorts of stuff, and not all the entries follow the patter (e.g., the gravity waves).

If we were to just look at a clean design and ignore performance, we could have AtmosCache struct with some default fields (that include the scratch space) and with one subfield for each entry in the AtmosModel.

E.g.,

struct AtmosCache
    core
    temporary
    moisture_model
    precipitation_model
    ...
end

Different models would implement their own struct for what they need. E.g.,

struct DryModelCache <: AbstractCache
     var1
     var2
end
struct EquilMoistModelCache <: AbstractCache
     var1
     var2
     var3 
     var4
end

Enforcing the above mentioned pattern and moving all the named tuples to structs.

@charleskawczynski
Copy link
Member

So yes, there is a performance penalty is using concrete types, but fixing the root of the issue is still much faster.

👍🏻, it's good to know that the main issue is the ClimaCore field.

And I agree with the cache design points.

@Sbozzolo
Copy link
Member Author

Sbozzolo commented Oct 11, 2023

Incidentally, the function get_cache has a non negligible contribution to the latency. Even with the mutable workaround, it takes 30 seconds to infer/compile on my laptop (subsequent evaluations take less than 1 second).

Hopefully, cleaning up the cache will also reduce that time (which, with the mutable fix, can be 25 % of the time to get the first integrator).

@Sbozzolo
Copy link
Member Author

30 % of the compilation time for the cache is to compile orographic gravity waves (mostly compute_OGW_info) and for radiation (mostly RRTMGPI.RRTMGPModel).

@charleskawczynski
Copy link
Member

Related: CliMA/RRTMGP.jl#391

@simonbyrne
Copy link
Member

radius we can get from the "global geometry" object (though in the longer term, we shouldn't be computing the gradient based on lat/long)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants