Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ClimaUtilities.OutputPathGenerator to organize output directories #2606

Merged
merged 2 commits into from
Apr 3, 2024

Conversation

Sbozzolo
Copy link
Member

@Sbozzolo Sbozzolo commented Jan 30, 2024

When a simulation is run with the driver, remove existing output. This is helpful when dealing with diagnostics, given that NetCDF output is appended.

I personally think that this is a dangerous feature to have, and we should probably think of a safer solution for the long term.

See comment below about ClimaUtilities

@Sbozzolo Sbozzolo requested a review from szy21 January 30, 2024 00:41
Copy link
Member

@szy21 szy21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trontrytel What do you think?

@trontrytel
Copy link
Member

Would it be possible to automatically name the output directory with the current date and time? This will prevent us from trying to append to files that already exist. And would not require automatically delete everything?

If thats not a good idea then I'm fine with it. I keep changing folder names manually anyway to deal with the simulation trying to append

@Sbozzolo Sbozzolo force-pushed the gb/remove_existing_folders branch from 3fbed25 to 0a9a8fc Compare January 30, 2024 13:48
@Sbozzolo
Copy link
Member Author

Sbozzolo commented Jan 30, 2024

I changed it to move it to output_dir_YYYYMMDD_HHMM (with the current time). The latest simulation will always be in the same path, but the previous one will be "archived". The drawback is that it is up to you to keep your folder clean.

I think this is a much better solution, but you will have to do spring-cleaning every so often.

@Sbozzolo Sbozzolo force-pushed the gb/remove_existing_folders branch 2 times, most recently from 85d4b26 to 4890ce1 Compare January 30, 2024 16:00
@Sbozzolo Sbozzolo changed the title Remove pre-existing output directory Move pre-existing output directory Jan 30, 2024
@Sbozzolo
Copy link
Member Author

Sbozzolo commented Feb 2, 2024

Do we want to merge this?

@szy21
Copy link
Member

szy21 commented Feb 2, 2024

Sounds good to me, thanks! (In the long-term I have some other ideas about output file names, we can talk about that later)

@Sbozzolo Sbozzolo added this pull request to the merge queue Feb 2, 2024
Copy link
Member

@charleskawczynski charleskawczynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break our reproducibiity tests. Why is this needed exactly?

@charleskawczynski charleskawczynski removed this pull request from the merge queue due to a manual request Feb 2, 2024
@Sbozzolo Sbozzolo force-pushed the gb/remove_existing_folders branch 2 times, most recently from c99027a to bc87f56 Compare February 2, 2024 20:06
@Sbozzolo
Copy link
Member Author

Sbozzolo commented Feb 2, 2024

This will break our reproducibiity tests. Why is this needed exactly?

The issue that this PR is solving is this:

If you run the driver multiple times with the same configuration file, the same output folder is used. Data from different simulations is mixed, and NetCDF files from different simulations get appended to each others. For interactive work, this is really annoying.

With this PR, when you run the driver and an output folder already exists, then the preexisting folder gets moved to a different location and a new one is created.

I don't think that this would break the reproducibility tests: this PR changes the behavior of the driver exclusively when there is a preexisting output folder where you are running a new simulations.

I also don't think that this logic should be in /src. I think that the most convervative behvior is to error out if the output folder already exists: touching existing users' files is never a good idea.

@charleskawczynski
Copy link
Member

charleskawczynski commented Feb 2, 2024

I don't think that this would break the reproducibility tests: this PR changes the behavior of the driver exclusively when there is a preexisting output folder where you are running a new simulations.

Ok, this makes sense, and I agree that this shouldn't break reproducibility tests (right now). This workflow really won't work well for reproducibility tests for restarted simulations (#2604), though.

If you run the driver multiple times with the same configuration file, the same output folder is used. Data from different simulations is mixed, and NetCDF files from different simulations get appended to each others. For interactive work, this is really annoying.

Why is data from different simulations mixed? And why are NC files from different simulations appended? Can't we instead change that logic? This workflow is equally annoying to me: running a simulation 10x interactively will generate 10x the amount of data compared to just overwriting old files. Isn't that what the restart flag is for?

@Sbozzolo
Copy link
Member Author

Sbozzolo commented Feb 2, 2024

I don't think that this would break the reproducibility tests: this PR changes the behavior of the driver exclusively when there is a preexisting output folder where you are running a new simulations.

Ok, this makes sense, and I agree that this shouldn't break (most) reproducibility tests.

If you run the driver multiple times with the same configuration file, the same output folder is used. Data from different simulations is mixed, and NetCDF files from different simulations get appended to each others. For interactive work, this is really annoying.

Why is data from different simulations mixed? And why are NC files from different simulations appended? Can't we instead change that logic? This workflow is equally annoying to me: running a simulation 10x interactively will generate 10x the amount of data compared to just overwriting old files. Isn't that what the restart flag is for?

New data is appended because that's how the NetCDF writer works: if there's already a file (e.g., after the first time you output), you new data there. So, overwriting would mean cleaning the output folder before the new run. However, I strongly believe we should never erase existing information.

We can compromise and add an option to either overwrite or move existing folders, but the default should be to move.

@szy21
Copy link
Member

szy21 commented Feb 2, 2024

Maybe I can talk about my other idea on this: I think we can add simulation information to the output file name, especially after we update the time type. So the file name would be something like ts_1d_average_19790101-19791231.nc instead of ts_1d_average.nc. If restarting, the new file would be something like ts_1d_average_19800101-19801231.nc, so no overwriting. If the output file name is the same, I think it's ok to overwrite, as that's usually what the users want when rerunning a simulation. I was thinking we can do this later though.

@Sbozzolo
Copy link
Member Author

Sbozzolo commented Feb 5, 2024

Maybe I can talk about my other idea on this: I think we can add simulation information to the output file name, especially after we update the time type. So the file name would be something like ts_1d_average_19790101-19791231.nc instead of ts_1d_average.nc. If restarting, the new file would be something like ts_1d_average_19800101-19801231.nc, so no overwriting. If the output file name is the same, I think it's ok to overwrite, as that's usually what the users want when rerunning a simulation. I was thinking we can do this later though.

Adding the dates is a little tricky. should the filename always match the content (I think it should). So, should we rename the file every time we add new data? What happens if the run ends before the end (for a crash, or because it was gracefully exited)?

I think we can add start dates because it is always well defined and not problematic. We could also create maybe subfolders (e.g., one per start-date) for ease of navigation.

In any case, I stand by the good user-experience principle that a program should never remove users' files without explicit consent (but we can add a --overwrite flag).

@charleskawczynski
Copy link
Member

We discussed this today: @Sbozzolo is going to update the logic to move the folder to one with an appended counter, so that the output folder (after restarting) is easier to predict. This will allow us to test restarted simulations and know where the data will end up much more easily.

@Sbozzolo Sbozzolo force-pushed the gb/remove_existing_folders branch 3 times, most recently from f9971d6 to 7fcd68a Compare March 29, 2024 21:00
@Sbozzolo
Copy link
Member Author

Sbozzolo commented Mar 29, 2024

I implemented the logic to handle this in a module CliMA/ClimaUtilities.jl#28 (so that it can be used by Land and Coupler too).

The OutputPathGenerator module offers different styles for preparing the directory structure for the simulation output. The one I am using in ClimaAtmos is called ActiveLinkStyle and is approximately what is discussed in the previous message.

With the ActiveLinkStyle, files are saved in a subfolder of output_path, with the latest output always available in output_path/output-active.

Example:

Suppose we run a simulation with output_path abc. The first time the simulation is run, a subfolder abc/output-0000 is created and files are saved there. A link abc/output-active -> abc/output-0000 is created too so that the output path is predicatable (always abc/output-active). The second time the simulation is run, files are saved to abc/output-0001, and the link is moved to abc/output-active -> abc/output-0001. Once again, the output is predicatbly found in abc/output-active.

@Sbozzolo Sbozzolo force-pushed the gb/remove_existing_folders branch from 7fcd68a to f58942a Compare March 29, 2024 22:07
@Sbozzolo Sbozzolo changed the title Move pre-existing output directory Use ClimaUtilities.OutputPathGenerator to organize output directories Apr 1, 2024
@Sbozzolo Sbozzolo force-pushed the gb/remove_existing_folders branch from f58942a to 42b737f Compare April 1, 2024 15:19
@Sbozzolo Sbozzolo force-pushed the gb/remove_existing_folders branch 11 times, most recently from 807f9b6 to 8d75e66 Compare April 2, 2024 23:15
@Sbozzolo
Copy link
Member Author

Sbozzolo commented Apr 3, 2024

If there are no more comments, tomorrow I will merge the PR in ClimaUtilities and merge this too.

Copy link
Member

@szy21 szy21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@charleskawczynski charleskawczynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me

@Sbozzolo Sbozzolo force-pushed the gb/remove_existing_folders branch from 8d75e66 to 27ef166 Compare April 3, 2024 16:13
@Sbozzolo Sbozzolo enabled auto-merge April 3, 2024 16:37
@Sbozzolo Sbozzolo disabled auto-merge April 3, 2024 16:41
Sbozzolo added 2 commits April 3, 2024 09:42
Instead of writing directly to output_path, we now write to
output_path/output_active, a link that points to
output_path/output_XXXX, with output_XXXX a counter that is increased
every time a simulation is launched with this output_path
@Sbozzolo Sbozzolo force-pushed the gb/remove_existing_folders branch from 27ef166 to b2b6c96 Compare April 3, 2024 16:42
@Sbozzolo Sbozzolo enabled auto-merge April 3, 2024 16:55
@Sbozzolo Sbozzolo added this pull request to the merge queue Apr 3, 2024
Merged via the queue into main with commit ba3cf37 Apr 3, 2024
10 of 11 checks passed
@Sbozzolo Sbozzolo deleted the gb/remove_existing_folders branch April 3, 2024 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants