Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add callback that estimates walltime #2428

Merged
merged 1 commit into from
Dec 16, 2023
Merged

Conversation

charleskawczynski
Copy link
Member

@charleskawczynski charleskawczynski commented Dec 12, 2023

This PR adds a callback that logs some helpful runtime information:

┌ Info: Progress
│   simulation_time = "21 hours, 30 minutes"
│   n_steps_completed = 129
│   wall_time_per_step = "101 milliseconds, 824 microseconds"
│   wall_time_total = "2 minutes, 26 seconds"
│   wall_time_remaining = 133.4913111176602
│   wall_time_spent = "13 seconds, 135 milliseconds"
│   percent_complete = "9.0%"
│   sypd = 16.144
│   date_now = 2023-12-14T10:42:19.127
└   estimated_finish_date = 2023-12-14T10:44:32.618

The callback uses a doubling backoff (we log on step 2, 4, 8, 16, 32, ...) to provide frequent estimates near the beginning of simulations, but increasingly infrequent as time goes on to avoid bloating the log (I don't want people to have to scroll through a zillion info statements in the longruns to get to the end of the log).

@Sbozzolo
Copy link
Member

Thanks, this is very useful! I will review it tomorrow morning.

A very quick comment is that I find total a little misleading: at the beginning, I thought I meant "total elapsed time since the beginning"

@charleskawczynski
Copy link
Member Author

Ah, time_and_units looks wrong.. It'd be nice if we can do this more elegantly. Looking at DateTIme now..

@Sbozzolo
Copy link
Member

Ah, time_and_units looks wrong.. It'd be nice if we can do this more elegantly. Looking at DateTIme now..

Let me know when you want me to review the PR :)

@charleskawczynski
Copy link
Member Author

Let me know when you want me to review the PR :)

I think I have a nice solution, but maybe you can review the rest of the PR in the mean time (just ignore time_and_units for now). @Sbozzolo

@simonbyrne
Copy link
Member

simonbyrne commented Dec 13, 2023

I find the output kind of hard to parse: could you put the values next to the name? Or even on multiple lines (the macros do this automatically if you provide extra arguments

julia> @info "Wall time" var"per step"=100 total=200 remaining=300
┌ Info: Wall time
│   per step = 100
│   total = 200
└   remaining = 300

@simonbyrne
Copy link
Member

Ah, time_and_units looks wrong.. It'd be nice if we can do this more elegantly. Looking at DateTIme now..

canonicalize is helpful here:

julia> using Dates

julia> canonicalize(Second(10_000))
2 hours, 46 minutes, 40 seconds

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Dec 13, 2023

canonicalize is helpful here:

julia> using Dates

julia> canonicalize(Second(10_000))
2 hours, 46 minutes, 40 seconds

Yeah but, for now (JuliaLang/julia#52519 😄), there's no way to convert from Real.

@Sbozzolo
Copy link
Member

Let me know when you want me to review the PR :)

I think I have a nice solution, but maybe you can review the rest of the PR in the mean time (just ignore time_and_units for now). @Sbozzolo

One general design comment before I look at the details is that maybe we can be a little bit more ambitious that just tracking the walltime. You added a TWE object to the cache, maybe we can turn this into a more general instrumentation object for ClimaAtmos runs. For example, the object might also track memory usage, and maybe optionally print more detailed/frequent stats to file. At this stage, it would just track walltime, but this would provide a natural place where to put extra instrumentation.

What do you think?

(For this PR, this would probably just amounts to renaming the struct)

(And I also agree with Simon that the output is hard to read.)

@charleskawczynski
Copy link
Member Author

Let's also add time spent

@simonbyrne
Copy link
Member

I think it would be good to have:

  • time elapsed (how long so far)
  • sypd estimate (can compute from wall time / time step)

@simonbyrne
Copy link
Member

ideally we would also want to exclude the first time step.

@charleskawczynski charleskawczynski force-pushed the ck/estimate_walltime branch 3 times, most recently from 675d990 to b1acc78 Compare December 13, 2023 20:38
@charleskawczynski
Copy link
Member Author

charleskawczynski commented Dec 13, 2023

Alright, I've

  • added time elapsed / spent
  • Added simulation time
  • Renamed n_steps to n_calls, which is a more accurate name
  • Added n_steps_completed
  • Avoided n_steps = 1 (since the cb is called during init, which messed up the timings)
  • Added a sypd estimate
  • Updated names a bit so that some things are a bit more clear, but there's probably still room for improvement.
  • Added Dates.now()
  • Added an estimated finish date - this should be useful because then we now won't need to know when the last progress log was printed!

it now looks like:

┌ Info: Progress
│   simulation_time = "21 hours, 30 minutes"
│   n_steps_completed = 129
│   wall_time_per_step = "101 milliseconds, 824 microseconds"
│   wall_time_total = "2 minutes, 26 seconds"
│   wall_time_remaining = 133.4913111176602
│   wall_time_spent = "13 seconds, 135 milliseconds"
│   percent_complete = "9.0%"
│   sypd = 16.144
│   date_now = 2023-12-14T10:42:19.127
└   estimated_finish_date = 2023-12-14T10:44:32.618

I think this is ready for review + bike shedding cc @simonbyrne, @Sbozzolo . I'll update the original post to include this snapshot.

@charleskawczynski charleskawczynski force-pushed the ck/estimate_walltime branch 9 times, most recently from a0dfbf3 to c4a1ae9 Compare December 14, 2023 18:27
Copy link
Member

@Sbozzolo Sbozzolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I have only minor cosmetic comments.

perf/flame.jl Outdated Show resolved Hide resolved
src/cache/cache.jl Outdated Show resolved Hide resolved
src/callbacks/callbacks.jl Outdated Show resolved Hide resolved
src/callbacks/callbacks.jl Outdated Show resolved Hide resolved
src/callbacks/callbacks.jl Outdated Show resolved Hide resolved
src/callbacks/callbacks.jl Show resolved Hide resolved
src/callbacks/callbacks.jl Outdated Show resolved Hide resolved
src/callbacks/callbacks.jl Outdated Show resolved Hide resolved
src/utils/utilities.jl Outdated Show resolved Hide resolved
src/utils/utilities.jl Outdated Show resolved Hide resolved
Copy link
Member

@Sbozzolo Sbozzolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! :)

Increase allocation limits

Bump allocation limits

Finish comment in flame graph

Fix doc string

Improve names, add docs

Improve names

Improve docs and names

Improve docs, qualify Period

Improve names

Add comment for eval function

Maintain log after 50%

Add warning for restarted simulations
@charleskawczynski charleskawczynski added this pull request to the merge queue Dec 15, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 15, 2023
@charleskawczynski charleskawczynski added this pull request to the merge queue Dec 16, 2023
Merged via the queue into main with commit 7320f1a Dec 16, 2023
9 of 10 checks passed
@charleskawczynski charleskawczynski deleted the ck/estimate_walltime branch December 16, 2023 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants