Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO overwriting of monthly averages #2890

Open
mahf708 opened this issue Jul 5, 2024 · 6 comments
Open

IO overwriting of monthly averages #2890

mahf708 opened this issue Jul 5, 2024 · 6 comments
Labels

Comments

@mahf708
Copy link
Contributor

mahf708 commented Jul 5, 2024

Another concerning issue in the EAMxx IO. Consider the following atm.log snippet:

Atmosphere step = 342143
  model start-of-step time = 2020-08-31 23:58:20

[EAMxx::output_manager] - Writing model-output:
[EAMxx::output_manager]      FILE: 1ma_ne30pg2.AVERAGE.nmonths_x1.2020-06-01-00000.nc
[EAMxx::scorpio_output] Writing variables to file
  file name: 1ma_ne30pg2.AVERAGE.nmonths_x1.2020-06-01-00000.nc

The result: the monthly output file was overwritten. This happened in two instances in one run:

1ma_ne30pg2.AVERAGE.nmonths_x1.2019-08-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-09-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-10-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-11-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-12-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-01-01-00000.nc <<<<<<<<<<<<<<< overwriting 2020-01-01
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-04-01-00000.nc >>>>>>>>>>>>>>>
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-05-01-00000.nc 
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-06-01-00000.nc <<<<<<<<<<<<<<< overwriting 2020-06-01
                                                   >>>>>>>>>>>>>>> simulation ends

See internal outputs https://acme-climate.atlassian.net/wiki/spaces/EAMXX/pages/4334223933/EAMxx+ERFaer+production from a recent run using commit 29bdb81 on branch https://github.com/E3SM-Project/scream/tree/mahf708-ff-a73d48a

@mahf708 mahf708 added bug Something isn't working severe bug I/O labels Jul 5, 2024
@crterai
Copy link
Contributor

crterai commented Jul 5, 2024

I think this is the first time we've seen this, but checking with @ndkeen to see if he has seen something like this.
@AaronDonahue @bartgol : any ideas on what might be going on here? And if there's a fix, we should make sure to get it into @brhillman's decadal run. And we should keep an eye on the averaged output in the decadal sim until we find the cause and solution.

@AaronDonahue
Copy link
Contributor

@mahf708, can you share the YAML file for these outputs?

@mahf708
Copy link
Contributor Author

mahf708 commented Jul 8, 2024

Here's the output yaml: https://acme-climate.atlassian.net/wiki/spaces/EAMXX/pages/3969187877/1ma+ne30pg2.yaml, which is a verbatim copy of the outputs Ben is using (circa May 1) but with small additions.

@AaronDonahue
Copy link
Contributor

thanks, I'll start working on this.

@AaronDonahue
Copy link
Contributor

Does this happen w/ a restarted run?

@mahf708
Copy link
Contributor Author

mahf708 commented Jul 8, 2024

Does this happen w/ a restarted run?

We will unlikely find a deterministic reproducer for this in any short period of time. This happened in two runs, in two separate occasions in each, so four times total. Here's how it played out (roughly)

  • model fails with a system-side issue
  • model starts overwriting the monthly files the next time it tries to output them
  • model keeps doing that whacky stuff
  • model finally finishes a good submission (with no fail) and starts behaving normally

The wildest thing? It starts behaving normally.

The short answer, yes, this can only happen in restarts. I think it is important to consider all four issues I filed so far as one large issue (I suspect they are related).

Note in OP:

1ma_ne30pg2.AVERAGE.nmonths_x1.2019-08-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-09-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-10-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-11-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-12-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-01-01-00000.nc <<<<<<<<<<<<<<< overwriting 2020-01-01
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-04-01-00000.nc >>>>>>>>>>>>>>> 2 files gone, 1 misnamed
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-05-01-00000.nc 
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-06-01-00000.nc <<<<<<<<<<<<<<< overwriting 2020-06-01
                                                   >>>>>>>>>>>>>>> simulation ends; 2 files gone, 1 misnamed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants