Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s04_stack2 bug fixes + mov_stack improvement #372

Merged
merged 21 commits into from
Nov 28, 2024

Conversation

asyates
Copy link
Contributor

@asyates asyates commented Sep 27, 2024

The following things are fixed with these changes (mentioned in issue #371)

  • bugfix: reference function creation no longer based only on Todo jobs, now calls get_results_all using datelist generated from build_ref_datelist.
  • bugfix: boolean overwrite set to false in xr_save_ccf to avoid overwriting of previous STACKS output with only Todo jobs.
  • when computing moving stacks (-m), gaps in the day list (generated from current jobs) are identified and filled with dates necessary to correctly apply rolling mean, taking into account the mov_stack size.

e.g. for days = ['2006-02-01', '2006-02-02', '2006-02-03', '2006-02-10'] and stack size (max(mov_rolling)) of two days, the days '2006-01-30', '2006-01-31', '2006-02-08', and '2006-02-09' would be added prior to calling get_results_all.

ie.

                # Calculate the maximum mov_rolling value (in days)
                max_mov_rolling = max(pd.to_timedelta(mov_stack[0]).total_seconds() for mov_stack in mov_stacks)
                max_mov_rolling_days = max(1, max_mov_rolling / 86400)

                days = list(days)
                days.sort()
                days = [datetime.datetime.strptime(day, '%Y-%m-%d') for day in days]
                day_diffs = np.diff(days)
                gaps = [i+1 for i, diff in enumerate(day_diffs) if diff.days > 1] #get index of days with gaps
                gaps.insert(0,0) #zero index also 'gap' (need previous data for stacking)

                all_days = list(days)
                added_days = [] #keep track of days included for eventual removal pre-saving CCFs

                for gap_idx in gaps:
                    start = days[gap_idx]
                    #Add preceding days
                    for j in range(1, max_mov_rolling_days+1):
                        preceding_day = start - datetime.timedelta(days=j)
                        if preceding_day not in all_days:
                            all_days.append(preceding_day)
                            added_days.append(preceding_day)

                added_dates = pd.to_datetime(added_days).values
                c = get_results_all(db, sta1, sta2, filterid, components, all_days, format="xarray") #get ccfs needed for -m stacking

These additional days are then removed prior to saving via:

                   mask = xx.times.dt.floor('D').isin(added_dates)
                   xx_cleaned = xx.where(~mask, drop=True) #remove days not associated with current jobs

Will use this same branch to implement wiener filter... but these changes/fixes more fundamental.

@asyates
Copy link
Contributor Author

asyates commented Oct 2, 2024

Added:

  • reference function no longer uses jobs i.e. running msnoise cc stack -r will always compute reference function regardless of whether STACK jobs are Todo or Done.

To do:

  • Add wiener filter. I think for now will do this without SVD component and test... as im not sure how to implement SVD decomp cleanly in a way that all data is processed 'equally' i.e. number of eigenvectors available could vary a lot depending how many CCFs being processed.
  • update documentation

@ThomasLecocq
Copy link
Member

re:doc : this documentation should work - but you'll need to set up a biggy test folder with some data (esp. for the examples to build) - My idea (didn't have time yet) is to provide two big "pooch" payloads 1: with only the raw test data & the recipe to build then same as 2: the environment having all the data processed for doing the examples etc

@asyates
Copy link
Contributor Author

asyates commented Oct 3, 2024

Okay think working nicely now. Did quite a different tests to check working properly with gaps, processing jobs in stages rather than all together, etc.

As said, no svd component for now as unsure how will work cleanly if not processing all data at same time (i.e. to ensure data processed equaly.

Functionality of wiener filter right now is:

  • if wiener filter applied, will also pad with previous/future 2*M of CCFs in addition to current jobs, where M is the smoothing duration in datetime axis.
  • checks for 'continuous' CCFs to apply wiener to. Gaps less than length of M will be 'ignored' i.e. wiener filter will consider CCFs adjacent. Otherwise, wiener applied separately on different groups of adjacent stacks. Note, gaps are restored post-wiener (pre-stacking).
  • for saving, the first M duration of pre-job stacks (previously pulled in for purpose of stacking) are removed, as at 'edge' of 'image', so less neighbouring points to use in filter. The second M duration of pre-job stacks stay, however, and overwrite the previous saved stacks. Idea is that this previous data, if processing real-time for example, may not have had neighbouring (future) points previously... but now can be updated based on data that has come in (to be consistent with other data processed).

Users set three params in config:

wienerfilt: bool, False by default)
wiener_Mlen: str (timedelta), smoothing in datetime axis
wiener_Nlen: str (timedelta), smoothing in lagtime axis

Quick example showing final dv/v for one month of data at Ruapehu, 2d stacks

image

Still to do: documentation

@asyates
Copy link
Contributor Author

asyates commented Oct 4, 2024

A couple smaller changes, and started adding documenting (just in s04_stack2 for now).

Ended up going down rabbit hole regarding how much padding with data outside of Todo jobs actually reduces edge influence (even if you pad well beyond the width of the filter, you can still get subtle differences propagate to the middle of the 'image' despite neighbouring points not changing).

Given that, I added a warning in the documentation that caution should be exercised if processing data in steps (i.e. not all at the same time).

Below is an example where i tested processing dv/v after applying wiener to all data together, and data in different stages (cutoff indicated by dashed-lines, so for example.... end of month is simulating reading in 1-day of new data each time). Similar, but some difference.

image

@asyates
Copy link
Contributor Author

asyates commented Oct 4, 2024

Demonstration that padding doesn't fully prevent values further away from edge of image slightly changing. Top row i am starting adding 2 rows each time (of 1s), e.g. to reflect new CCF data coming in, and then applying wiener filter of length (2,2). Can see that the values corresponding to the original pattern have subtle differences even when adding four or six rows of constant value.

So not sure there is a 'perfect' way to do it, other than maybe having a fixed moving window where we are applying the wiener filter i.e. every N days, so that it is consistent if processing in different stages. But... that would be pretty horrid for computation time i imagine so maybe just having the warning is best for cases where not processing all data together.

image

@ThomasLecocq
Copy link
Member

@asyates : does the filtering work like any other "smoothing" filter in msnoise: aka: "not relying on information in the future" to define the values of day D (so only with D-N -> D] inclusive) ?

@asyates
Copy link
Contributor Author

asyates commented Nov 21, 2024

At the moment, no... the only way i can see to do that would be to always apply wiener filter to a fixed window length prior to stacking.

e.g. if stacksize is 1d and correlation length is 1h, could apply wiener filter to the twenty-four 1h functions prior to stacking. Need to look at implementation again though... i remember thinking that would be quite painful to implement from current code (also with regard to computation time).

Will try take a look again start of Dec. Worst case, i'll remove the wiener filter related-changes so we can commit the bug-fixes for stacking process.

@ThomasLecocq
Copy link
Member

ok but so, let's imagine you do the computation daily, the wiener would apply until that day, and when at D+1, it'd recompute the last N days with new updated information (and it's still toggle-able, default "not apply", right) ? this way, I can live with the PR as is, just make sure the documentation at the top of the file is accurate/complete & it'll be ready to merge

@asyates
Copy link
Contributor Author

asyates commented Nov 27, 2024

Yes, off by default.

For computation daily, and wiener filter length (wlen) of say 2-days, the wiener would apply filter with that day + padding of 2*wlen.... so 4 day extra, and discard the first half of padding window before saving (so saving, D and re-saving D-1, D-2... and discarding D-3 and D-4).

So when at D+1... D and D-1, would be saved.
When at D+2... D+1 and D would be saved.
etc. etc.

So a day only stops being overwritten when there is at least the length of the wiener filter between it and the new day.

@asyates
Copy link
Contributor Author

asyates commented Nov 27, 2024

I noticed there was a commit to s04_stack2 that i didn't account for, assuming that is why i have conflicts now; will check/update

Copy link

codecov bot commented Nov 27, 2024

Codecov Report

Attention: Patch coverage is 87.34940% with 21 lines in your changes missing coverage. Please review.

Project coverage is 63.07%. Comparing base (ab464c9) to head (12af69c).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
msnoise/s04_stack2.py 89.34% 13 Missing ⚠️
msnoise/api.py 0.00% 6 Missing ⚠️
msnoise/wiener.py 93.93% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #372      +/-   ##
==========================================
+ Coverage   62.67%   63.07%   +0.40%     
==========================================
  Files          43       44       +1     
  Lines        7482     7601     +119     
==========================================
+ Hits         4689     4794     +105     
- Misses       2793     2807      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@asyates
Copy link
Contributor Author

asyates commented Nov 27, 2024

I hadn't actually considered that keep_all could potentially be False; i actually made an assumption the keep_all toggle might eventually be removed but i guess it makes sense to keep it for daily processing.

In any case, I noticed that the current implementation of pulling in 1-day CCFs following previous merge doesn't appear to be changing the mov_stack params to ('1D','1D') as it implies; the mov_rolling and mov_sample remain unchanged.

Need to look at this closer and then, at same time, see what happens with wiener filter when pulling in one day CCFs...

@asyates
Copy link
Contributor Author

asyates commented Nov 28, 2024

So turns out while it was true the effective sample was '1D' (though with copied values corresponding to original mov_sample), the original mov_rolling was still working more or less (so was not '1D','1D'). Have adjusted warning, and made a few other changes.

it looks like the error that required the 'construct' thing has a recent fix. I was finding that it was using way too much memory otherwise (my process was getting killed when using stack sizes of just 10 days). So i've removed it.

Otherwise, this is my most recent test of results with keep_all on and off. Not so satisfied as to why there are quite significant amplitude differences, so need to look closer.

image

@ThomasLecocq
Copy link
Member

My dream indeed is to make keep_all=Y default by design, indeed removing it from the configurable parameters... but for laaaarge networks, it might create very large volumes of data, e.g for anyone only willing to create a single REF per pair... not fully sure how to solve that (if possible at all)

@asyates
Copy link
Contributor Author

asyates commented Nov 28, 2024

I don't see anything wrong causing the amplitude differences so I think its expected behaviour that relates to differences between working with many smaller CCFs (corr_duration) vs less 1-day stacks (already averaged).

Certainly the biggest amplitude difference between keep_all enabled or not is when wiener fllter is applied, which is not surprising as it completely changes the no. of CCFs used in smoothing by wiener (as i currently define the smooth based on time).

e.g. for 1h CCFs vs 1-day CCFs, for a 2day wiener smooth (in datetime axis) the wiener filter length will be 48 vs 2 values respectively.

So, i think... given keep_all = True, and wienerfilt = False by default, and a user is unlikely to be swapping between keep_all = True and keep_all = False (rather sticking with one or the other), I think i am happy for this to be merged (also fixing many other issues in the s04_stack2.py.

Ofc, it could also be expected that the user would toggle the length of the wiener filter depending on whether using keep_all = True or False (i.e. need longer window if less CCFs).

@asyates
Copy link
Contributor Author

asyates commented Nov 28, 2024

e.g. if i normalize CCFs by maxabs, and set wiener filter to 2h length with corr_duration = 1800s (to approximately match 5d length with 1-day CCFs), amplitudes start looking closer with keep_all toggled on vs off.

So think all is okay; and as mentioned, wiener filt is toggled off by default (and keep_all on by default), so unlikely any problems will be created (and changing params and getting varied results has always been a thing).

image

@ThomasLecocq
Copy link
Member

not fully sure I understand how the two datasets are generated on the last plot, but if indeed the default stuff mimics what the past did, I'm good with this.

@ThomasLecocq
Copy link
Member

last call, good to merge ? If yes, I'm 🟢 lighting !

@asyates
Copy link
Contributor Author

asyates commented Nov 28, 2024

Difference in last two datasets:

Both use cc_normalization = absmax (i can imagine this could reduce differences between keep_all = Y/N in some cases).

blue scatter: keep_all = True, wiener_mlen = 2h -> with CCFs at 1800s sampling interval (corr_duration), this means a wiener length of 4 CCFs

orange scatter: keep_all = False, wiener_mlen = 5d -> with CCFs at 1-day sampling interval, this means a wiener length of 5 CCFs.

idea was to compare while making changes to wiener_mlen based on whether keep_all was toggled. Originally i left wiener_mlen = 5d for both, which (for keep_all = Y) would mean having a length of 240 CCFs (of 1800s duration) for the filter... hence more complicated comparison when it would only be 5 CCFs for keep_all = N.

@asyates
Copy link
Contributor Author

asyates commented Nov 28, 2024

anyway, good to merge for me. The differences to past (if wienerfilt = N, as by default) are bug fixes and pulling in older data when needed for stacking at the edges of continuous segments (e.g. an extra length of mov_rolling, which is then discarded before saving).

@ThomasLecocq
Copy link
Member

Oki doki ! Merging & will do some tests one day if I have some time! It'd be good to have a new "Example" showcasing the effects of the filters etc, if you have time to work on that in the future! Thanks for this PR !

@ThomasLecocq ThomasLecocq merged commit ec5e555 into ROBelgium:master Nov 28, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants