-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s04_stack2 bug fixes + mov_stack improvement #372
Conversation
Added:
To do:
|
re:doc : this documentation should work - but you'll need to set up a biggy test folder with some data (esp. for the examples to build) - My idea (didn't have time yet) is to provide two big "pooch" payloads 1: with only the raw test data & the recipe to build then same as 2: the environment having all the data processed for doing the examples etc |
Okay think working nicely now. Did quite a different tests to check working properly with gaps, processing jobs in stages rather than all together, etc. As said, no svd component for now as unsure how will work cleanly if not processing all data at same time (i.e. to ensure data processed equaly. Functionality of wiener filter right now is:
Users set three params in config: wienerfilt: bool, False by default) Quick example showing final dv/v for one month of data at Ruapehu, 2d stacks Still to do: documentation |
A couple smaller changes, and started adding documenting (just in s04_stack2 for now). Ended up going down rabbit hole regarding how much padding with data outside of Todo jobs actually reduces edge influence (even if you pad well beyond the width of the filter, you can still get subtle differences propagate to the middle of the 'image' despite neighbouring points not changing). Given that, I added a warning in the documentation that caution should be exercised if processing data in steps (i.e. not all at the same time). Below is an example where i tested processing dv/v after applying wiener to all data together, and data in different stages (cutoff indicated by dashed-lines, so for example.... end of month is simulating reading in 1-day of new data each time). Similar, but some difference. |
Demonstration that padding doesn't fully prevent values further away from edge of image slightly changing. Top row i am starting adding 2 rows each time (of 1s), e.g. to reflect new CCF data coming in, and then applying wiener filter of length (2,2). Can see that the values corresponding to the original pattern have subtle differences even when adding four or six rows of constant value. So not sure there is a 'perfect' way to do it, other than maybe having a fixed moving window where we are applying the wiener filter i.e. every N days, so that it is consistent if processing in different stages. But... that would be pretty horrid for computation time i imagine so maybe just having the warning is best for cases where not processing all data together. |
@asyates : does the filtering work like any other "smoothing" filter in msnoise: aka: "not relying on information in the future" to define the values of day D (so only with D-N -> D] inclusive) ? |
At the moment, no... the only way i can see to do that would be to always apply wiener filter to a fixed window length prior to stacking. e.g. if stacksize is 1d and correlation length is 1h, could apply wiener filter to the twenty-four 1h functions prior to stacking. Need to look at implementation again though... i remember thinking that would be quite painful to implement from current code (also with regard to computation time). Will try take a look again start of Dec. Worst case, i'll remove the wiener filter related-changes so we can commit the bug-fixes for stacking process. |
ok but so, let's imagine you do the computation daily, the wiener would apply until that day, and when at D+1, it'd recompute the last N days with new updated information (and it's still toggle-able, default "not apply", right) ? this way, I can live with the PR as is, just make sure the documentation at the top of the file is accurate/complete & it'll be ready to merge |
Yes, off by default. For computation daily, and wiener filter length (wlen) of say 2-days, the wiener would apply filter with that day + padding of 2*wlen.... so 4 day extra, and discard the first half of padding window before saving (so saving, D and re-saving D-1, D-2... and discarding D-3 and D-4). So when at D+1... D and D-1, would be saved. So a day only stops being overwritten when there is at least the length of the wiener filter between it and the new day. |
I noticed there was a commit to s04_stack2 that i didn't account for, assuming that is why i have conflicts now; will check/update |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #372 +/- ##
==========================================
+ Coverage 62.67% 63.07% +0.40%
==========================================
Files 43 44 +1
Lines 7482 7601 +119
==========================================
+ Hits 4689 4794 +105
- Misses 2793 2807 +14 ☔ View full report in Codecov by Sentry. |
I hadn't actually considered that keep_all could potentially be False; i actually made an assumption the keep_all toggle might eventually be removed but i guess it makes sense to keep it for daily processing. In any case, I noticed that the current implementation of pulling in 1-day CCFs following previous merge doesn't appear to be changing the mov_stack params to ('1D','1D') as it implies; the mov_rolling and mov_sample remain unchanged. Need to look at this closer and then, at same time, see what happens with wiener filter when pulling in one day CCFs... |
So turns out while it was true the effective sample was '1D' (though with copied values corresponding to original mov_sample), the original mov_rolling was still working more or less (so was not '1D','1D'). Have adjusted warning, and made a few other changes. it looks like the error that required the 'construct' thing has a recent fix. I was finding that it was using way too much memory otherwise (my process was getting killed when using stack sizes of just 10 days). So i've removed it. Otherwise, this is my most recent test of results with keep_all on and off. Not so satisfied as to why there are quite significant amplitude differences, so need to look closer. |
My dream indeed is to make keep_all=Y default by design, indeed removing it from the configurable parameters... but for laaaarge networks, it might create very large volumes of data, e.g for anyone only willing to create a single REF per pair... not fully sure how to solve that (if possible at all) |
I don't see anything wrong causing the amplitude differences so I think its expected behaviour that relates to differences between working with many smaller CCFs (corr_duration) vs less 1-day stacks (already averaged). Certainly the biggest amplitude difference between keep_all enabled or not is when wiener fllter is applied, which is not surprising as it completely changes the no. of CCFs used in smoothing by wiener (as i currently define the smooth based on time). e.g. for 1h CCFs vs 1-day CCFs, for a 2day wiener smooth (in datetime axis) the wiener filter length will be 48 vs 2 values respectively. So, i think... given keep_all = True, and wienerfilt = False by default, and a user is unlikely to be swapping between keep_all = True and keep_all = False (rather sticking with one or the other), I think i am happy for this to be merged (also fixing many other issues in the s04_stack2.py. Ofc, it could also be expected that the user would toggle the length of the wiener filter depending on whether using keep_all = True or False (i.e. need longer window if less CCFs). |
e.g. if i normalize CCFs by maxabs, and set wiener filter to 2h length with corr_duration = 1800s (to approximately match 5d length with 1-day CCFs), amplitudes start looking closer with keep_all toggled on vs off. So think all is okay; and as mentioned, wiener filt is toggled off by default (and keep_all on by default), so unlikely any problems will be created (and changing params and getting varied results has always been a thing). |
not fully sure I understand how the two datasets are generated on the last plot, but if indeed the default stuff mimics what the past did, I'm good with this. |
last call, good to merge ? If yes, I'm 🟢 lighting ! |
Difference in last two datasets: Both use cc_normalization = absmax (i can imagine this could reduce differences between keep_all = Y/N in some cases). blue scatter: keep_all = True, wiener_mlen = 2h -> with CCFs at 1800s sampling interval (corr_duration), this means a wiener length of 4 CCFs orange scatter: keep_all = False, wiener_mlen = 5d -> with CCFs at 1-day sampling interval, this means a wiener length of 5 CCFs. idea was to compare while making changes to wiener_mlen based on whether keep_all was toggled. Originally i left wiener_mlen = 5d for both, which (for keep_all = Y) would mean having a length of 240 CCFs (of 1800s duration) for the filter... hence more complicated comparison when it would only be 5 CCFs for keep_all = N. |
anyway, good to merge for me. The differences to past (if wienerfilt = N, as by default) are bug fixes and pulling in older data when needed for stacking at the edges of continuous segments (e.g. an extra length of mov_rolling, which is then discarded before saving). |
Oki doki ! Merging & will do some tests one day if I have some time! It'd be good to have a new "Example" showcasing the effects of the filters etc, if you have time to work on that in the future! Thanks for this PR ! |
The following things are fixed with these changes (mentioned in issue #371)
e.g. for days = ['2006-02-01', '2006-02-02', '2006-02-03', '2006-02-10'] and stack size (max(mov_rolling)) of two days, the days '2006-01-30', '2006-01-31', '2006-02-08', and '2006-02-09' would be added prior to calling get_results_all.
ie.
These additional days are then removed prior to saving via:
Will use this same branch to implement wiener filter... but these changes/fixes more fundamental.