Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce TimeVaryingInput2D #465

Closed
wants to merge 2 commits into from
Closed

Introduce TimeVaryingInput2D #465

wants to merge 2 commits into from

Conversation

Sbozzolo
Copy link
Member

@Sbozzolo Sbozzolo commented Jan 23, 2024

Okay, this is a massive commit (and I feel bad about it).

This commit splits PrescribedDataStatic and PrescibedDataTemporal in multiple independent modules.

Before, PrescribedData was:

  • reading files
  • regridding
  • discovering dates available
  • interpolating
  • taking care of edge cases (e.g., what to do when reading outside of where the date is outside of the definition range)

This commit splits all the different functions in different independent modules so that they can be more easily maintained and upgraded.

This will soon be needed to efficiently support reading and buffering reads for GPU runs.

In all of this, we are still using ClimaCoreTempestRemap and doing everything ahead of time. At the moment, we are not efficiently reusing any informtion across variables that share the same remapping weights, but this can be added in the future.

# Delete testing directory and files
rm(regrid_dir_static; recursive = true, force = true)
rm(regrid_dir_temporal; recursive = true, force = true)
# using Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to uncomment?

@@ -87,17 +87,8 @@ tf = 50 * 86400;
Δt = 3600.0;

# Construct albedo parameter object using temporal map
# Use separate regridding directory for CPU and GPU runs to avoid race condition
device_suffix =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise here


"""

Note: It is best to have one file per variable with all the temporal data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may be a little inconvenient/ add a step - all the reanalysis data is in one file to start with.

import ..FileReader: NCFileReader
import ..Regridder
import ..Regridder:
AbstractRegridder, TempestRegridder, regrid, AVAILABLE_REGRIDDERS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see anything in the style guide for all caps in names - why is AVAILABLE_REGRIDDERS capitalized?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah - I see some languages use that for constants. it doesnt seem like Julia follows that notation: https://docs.julialang.org/en/v1/base/base/#const

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALL_CAPS is mostly a way to tell the reader that the variable is a "global fixed parameter", with the emphasis being in "global". We don't have to do this, but I personally find it an useful additional qualifier.

(const in Julia doesn't even ensure that the value doesn't change.)

Note: It is best to have one file per variable with all the temporal data.

Assumptions:
- There is only one file with all the data in time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to contradict the above Note?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably poorly worded: instead of having multiple files with ranges of dates, it is best to have one single file with alle the dates ("all the data (for a given variable) in time").

TSTART <: AbstractFloat,
DATES <: AbstractArray{<:Dates.DateTime},
DIMS,
TIMES <: AbstractArray{<:AbstractFloat},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this last time, butTIMES <: AbstractArray{<:AbstractFloat} is the same as AbstractArray{<:AbstractFloat} in this context, right?

If so, I dont see that we need to have these types as parametric types.

end

function next_time(data_handler::DataHandler, date::Dates.DateTime)
if date in data_handler.available_dates
Copy link
Member

@kmdeck kmdeck Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth using the same code as is inside the functionnext_time(data_handler::DataHandler, time::AbstractFloat) for clarity (otherwise the reader wonders if next_date does something different than next_time)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second this! I think readability is more important than being concise here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the ClimaUtilities version.


snapshot = get!(data_handler._cached_regridded_fields, date) do
# Add here the arguments when we have more regridders
regrid_args = Dict(:TempestRegridder => (date,))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused why the regridding needs the date.

I am a little confused by the notation also. we get the value of data_handler._cached_regridded_fields corresponding do the date, but then what is that called in the do block/where is it used in the do block? is that what date then refers to (instead of the actual date)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TempestRegridder regrids everything ahead of time, saves everything into files, and then read the a filename that is constructed with the date. That's why we need the date.

This block does the following: if date is a key in data_handler._cached_regridded_fields, then return the associated value. If date is not yet a key, then execute the do-block and add the return value as key. So, we change data_handler._cached_regridded_fields to add a new date with the regridded value.

snapshot = get!(data_handler._cached_regridded_fields, date) do
# Add here the arguments when we have more regridders
regrid_args = Dict(:TempestRegridder => (date,))
regrid(data_handler.regridder, regrid_args[regridder_type]...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we need to regrid frame by frame, or is a future option to regrid batch by batch (many frames at once)? If regridding is matrix operations by a static weight matrix, we could process many snapshots at once?

date_idx0 =
[argmin(abs.(Dates.value(date_start) .- Dates.value.(all_dates[:])))]
# NOTE: We are hardcoding a few things here!
dimensions = dataset["lon"][:], dataset["lat"][:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not what dimensions::DIMS is for? if not, they could also pass in the dimension_names. then we dont need to hardcode


args = (file_info, file_states, sim_info)
_cached_reads = Dict{Dates.DateTime, Array}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the future, this is where we would pre-allocate the space for the batch of data we would read in?

@@ -81,19 +81,54 @@ When passing single-site data
When a `times` and `vals` are passed, `times` have to be sorted and the two arrays have to
have the same length.

=======
When the input is a function, the signature of the function can be `func(time, args...;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!!


For example:
```julia
CO2fromp(time, Y, p) = p.atmos.co2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but maybe this isnt a good example? is this for the analytic time varying input case?


Check if the given `time` is in the range of definition for `itp`.
"""
function Base.in(time, itp::InterpolatingTimeVaryingInput0D)
function Base.in(time, itp::InterpolatingTimeVaryingInput)
Copy link
Member

@kmdeck kmdeck Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still being used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. It is defined at the beginning of this file:

InterpolatingTimeVaryingInput =
    Union{InterpolatingTimeVaryingInput0D, InterpolatingTimeVaryingInput2D}

@Sbozzolo Sbozzolo force-pushed the gb/gpu_albedo branch 7 times, most recently from 49698ca to f9bdd2e Compare March 13, 2024 22:51
@Sbozzolo Sbozzolo marked this pull request as ready for review March 14, 2024 18:31
@Sbozzolo Sbozzolo self-assigned this Mar 14, 2024
@Sbozzolo Sbozzolo requested a review from juliasloan25 March 14, 2024 18:33
Copy link
Member

@juliasloan25 juliasloan25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all this work Gabriele! I left many comments, and I have a few more here:

  • We're opening Netcdf files and keeping them open throughout the simulation, then closing at the end. What will happen if the simulation crashes?
  • There are some things left to add in future PRs that we should open issues for:
    • allow regridder-specific args: mono, regrid_dir, dates to regrid
    • regrid only simulation dates (related to dates arg in previous point)
    • add FileReaders cache
    • add DataHandling cache cleanup function/implement LRU cache
  • please update the NEWS.md file too!

@@ -1,6 +1,6 @@
# This file is machine-generated - editing it directly is not advised

julia_version = "1.10.1"
julia_version = "1.10.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just update to 1.10.2 and use that across land/atmos/coupler now

uuid = "d414da3d-4745-48bb-8d80-42e94e092884"
version = "0.13.2"
version = "0.13.1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this downgraded?

uuid = "d414da3d-4745-48bb-8d80-42e94e092884"
version = "0.13.2"
version = "0.13.1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, packages are downgraded. I just updated everything today so you probably don't need to modify the manifests

- CPU/GPU communication can be a bottleneck

The `DataHandling` takes the divide and conquer approach: the various core tasks and
features and split into other independent modules (chiefly `FileReaders`, and `Regridders`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "and" -> "are"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the ClimaUtilities version

- `available_times` (`available_dates`): to list all the `times` (`dates`) over which the
data is defined.
- `previous_time(time/date)` (`next_time(time/date)`): to obtain the time of the snapshot
before the given `time` or `date`. This can be used to compute the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could change "before" to "before/after" to explain both cases of previous and next times

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the ClimaUtilities version

Comment on lines 56 to 57
PATH = temporal_data_path
varname = "sp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these could be moved outside of the for loop for consistency with the previous test (but it doesn't really matter)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ClimaUtilities

@test ncreader.dimensions[1] == nc["lon"][:]
@test ncreader.dimensions[2] == nc["lat"][:]

close(ncreader)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could check that OPEN_NCFILES is updated correctly in the case without time too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't test anything that we are not testing in the case with time

end

@testset "Temporal TimeVaryingInput 1D" begin
@testset "InteprolatingTimeVaryingInput0D" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Inteprolating -> Interpolating

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ClimaUtilities

@@ -102,3 +112,106 @@ end
end
end
end

@testset "InteprolatingTimeVaryingInput2D" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Inteprolating -> Interpolating

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ClimaUtilities

TimeVaryingInputs.evaluate!(dest, input_nearest, target_time)

@test isequal(
Array(parent(dest)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to use Array here but not in the "on node" case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the Array there too

@juliasloan25 juliasloan25 added enhancement New feature or request breaking change and removed enhancement New feature or request labels Mar 15, 2024
Okay, this is a massive commit (and I feel bad about it).

This commit splits PrescribedDataStatic and PrescibedDataTemporal in
multiple independent modules.

Before, PrescribedData was:
- reading files
- regridding
- discovering dates available
- interpolating
- taking care of edge cases (e.g., what to do when reading outside of
where the date is outside of the definition range)

This commit splits all the different functions in different independent
modules so that they can be more easily maintained and upgraded.

This commit also introduces InterpolationsRemapper, which uses
Interpolations.jl to do online/ and distributed remapping from files.

This will soon be needed to efficiently support reading and buffering
reads for GPU runs.
@Sbozzolo
Copy link
Member Author

  • We're opening Netcdf files and keeping them open throughout the simulation, then closing at the end. What will happen if the simulation crashes?

Since all the files are opened in read-only mode, it shouldn't be too much of a problem if a simulation crashes, we should start doing error handling and have a cleanup function

  • allow regridder-specific args: mono, regrid_dir, dates to regrid
  • regrid only simulation dates (related to dates arg in previous point)

Opened an issue in ClimaUtiltiees

  • add FileReaders cache

This is implemented.

  • add DataHandling cache cleanup function/implement LRU cache

Opened an issue in ClimaUtilities

@Sbozzolo
Copy link
Member Author

Implemented in #560

@Sbozzolo Sbozzolo closed this Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants