-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ClimaUtilities.OutputPathGenerator to organize output directories #2606
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trontrytel What do you think?
Would it be possible to automatically name the output directory with the current date and time? This will prevent us from trying to append to files that already exist. And would not require automatically delete everything? If thats not a good idea then I'm fine with it. I keep changing folder names manually anyway to deal with the simulation trying to append |
3fbed25
to
0a9a8fc
Compare
I changed it to move it to I think this is a much better solution, but you will have to do spring-cleaning every so often. |
85d4b26
to
4890ce1
Compare
Do we want to merge this? |
Sounds good to me, thanks! (In the long-term I have some other ideas about output file names, we can talk about that later) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break our reproducibiity tests. Why is this needed exactly?
c99027a
to
bc87f56
Compare
The issue that this PR is solving is this: If you run the driver multiple times with the same configuration file, the same output folder is used. Data from different simulations is mixed, and NetCDF files from different simulations get appended to each others. For interactive work, this is really annoying. With this PR, when you run the driver and an output folder already exists, then the preexisting folder gets moved to a different location and a new one is created. I don't think that this would break the reproducibility tests: this PR changes the behavior of the driver exclusively when there is a preexisting output folder where you are running a new simulations. I also don't think that this logic should be in |
Ok, this makes sense, and I agree that this shouldn't break reproducibility tests (right now). This workflow really won't work well for reproducibility tests for restarted simulations (#2604), though.
Why is data from different simulations mixed? And why are NC files from different simulations appended? Can't we instead change that logic? This workflow is equally annoying to me: running a simulation 10x interactively will generate 10x the amount of data compared to just overwriting old files. Isn't that what the restart flag is for? |
New data is appended because that's how the NetCDF writer works: if there's already a file (e.g., after the first time you output), you new data there. So, overwriting would mean cleaning the output folder before the new run. However, I strongly believe we should never erase existing information. We can compromise and add an option to either overwrite or move existing folders, but the default should be to move. |
Maybe I can talk about my other idea on this: I think we can add simulation information to the output file name, especially after we update the time type. So the file name would be something like |
Adding the dates is a little tricky. should the filename always match the content (I think it should). So, should we rename the file every time we add new data? What happens if the run ends before the end (for a crash, or because it was gracefully exited)? I think we can add start dates because it is always well defined and not problematic. We could also create maybe subfolders (e.g., one per start-date) for ease of navigation. In any case, I stand by the good user-experience principle that a program should never remove users' files without explicit consent (but we can add a |
We discussed this today: @Sbozzolo is going to update the logic to move the folder to one with an appended counter, so that the output folder (after restarting) is easier to predict. This will allow us to test restarted simulations and know where the data will end up much more easily. |
f9971d6
to
7fcd68a
Compare
I implemented the logic to handle this in a module CliMA/ClimaUtilities.jl#28 (so that it can be used by Land and Coupler too). The With the Example: Suppose we run a simulation with output_path |
7fcd68a
to
f58942a
Compare
f58942a
to
42b737f
Compare
807f9b6
to
8d75e66
Compare
If there are no more comments, tomorrow I will merge the PR in ClimaUtilities and merge this too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me
8d75e66
to
27ef166
Compare
Instead of writing directly to output_path, we now write to output_path/output_active, a link that points to output_path/output_XXXX, with output_XXXX a counter that is increased every time a simulation is launched with this output_path
27ef166
to
b2b6c96
Compare
When a simulation is run with the driver, remove existing output. This is helpful when dealing with diagnostics, given that NetCDF output is appended.I personally think that this is a dangerous feature to have, and we should probably think of a safer solution for the long term.See comment below about ClimaUtilities