s04_stack2 bug fixes + mov_stack improvement #372

asyates · 2024-09-27T13:24:47Z

The following things are fixed with these changes (mentioned in issue #371)

bugfix: reference function creation no longer based only on Todo jobs, now calls get_results_all using datelist generated from build_ref_datelist.
bugfix: boolean overwrite set to false in xr_save_ccf to avoid overwriting of previous STACKS output with only Todo jobs.
when computing moving stacks (-m), gaps in the day list (generated from current jobs) are identified and filled with dates necessary to correctly apply rolling mean, taking into account the mov_stack size.

e.g. for days = ['2006-02-01', '2006-02-02', '2006-02-03', '2006-02-10'] and stack size (max(mov_rolling)) of two days, the days '2006-01-30', '2006-01-31', '2006-02-08', and '2006-02-09' would be added prior to calling get_results_all.

ie.

                # Calculate the maximum mov_rolling value (in days)
                max_mov_rolling = max(pd.to_timedelta(mov_stack[0]).total_seconds() for mov_stack in mov_stacks)
                max_mov_rolling_days = max(1, max_mov_rolling / 86400)

                days = list(days)
                days.sort()
                days = [datetime.datetime.strptime(day, '%Y-%m-%d') for day in days]
                day_diffs = np.diff(days)
                gaps = [i+1 for i, diff in enumerate(day_diffs) if diff.days > 1] #get index of days with gaps
                gaps.insert(0,0) #zero index also 'gap' (need previous data for stacking)

                all_days = list(days)
                added_days = [] #keep track of days included for eventual removal pre-saving CCFs

                for gap_idx in gaps:
                    start = days[gap_idx]
                    #Add preceding days
                    for j in range(1, max_mov_rolling_days+1):
                        preceding_day = start - datetime.timedelta(days=j)
                        if preceding_day not in all_days:
                            all_days.append(preceding_day)
                            added_days.append(preceding_day)

                added_dates = pd.to_datetime(added_days).values
                c = get_results_all(db, sta1, sta2, filterid, components, all_days, format="xarray") #get ccfs needed for -m stacking

These additional days are then removed prior to saving via:

                   mask = xx.times.dt.floor('D').isin(added_dates)
                   xx_cleaned = xx.where(~mask, drop=True) #remove days not associated with current jobs

Will use this same branch to implement wiener filter... but these changes/fixes more fundamental.

asyates · 2024-10-02T09:36:49Z

Added:

reference function no longer uses jobs i.e. running msnoise cc stack -r will always compute reference function regardless of whether STACK jobs are Todo or Done.

To do:

Add wiener filter. I think for now will do this without SVD component and test... as im not sure how to implement SVD decomp cleanly in a way that all data is processed 'equally' i.e. number of eigenvectors available could vary a lot depending how many CCFs being processed.
update documentation

ThomasLecocq · 2024-10-02T13:14:44Z

re:doc : this documentation should work - but you'll need to set up a biggy test folder with some data (esp. for the examples to build) - My idea (didn't have time yet) is to provide two big "pooch" payloads 1: with only the raw test data & the recipe to build then same as 2: the environment having all the data processed for doing the examples etc

asyates · 2024-10-03T16:02:26Z

Okay think working nicely now. Did quite a different tests to check working properly with gaps, processing jobs in stages rather than all together, etc.

As said, no svd component for now as unsure how will work cleanly if not processing all data at same time (i.e. to ensure data processed equaly.

Functionality of wiener filter right now is:

if wiener filter applied, will also pad with previous/future 2*M of CCFs in addition to current jobs, where M is the smoothing duration in datetime axis.
checks for 'continuous' CCFs to apply wiener to. Gaps less than length of M will be 'ignored' i.e. wiener filter will consider CCFs adjacent. Otherwise, wiener applied separately on different groups of adjacent stacks. Note, gaps are restored post-wiener (pre-stacking).
for saving, the first M duration of pre-job stacks (previously pulled in for purpose of stacking) are removed, as at 'edge' of 'image', so less neighbouring points to use in filter. The second M duration of pre-job stacks stay, however, and overwrite the previous saved stacks. Idea is that this previous data, if processing real-time for example, may not have had neighbouring (future) points previously... but now can be updated based on data that has come in (to be consistent with other data processed).

Users set three params in config:

wienerfilt: bool, False by default)
wiener_Mlen: str (timedelta), smoothing in datetime axis
wiener_Nlen: str (timedelta), smoothing in lagtime axis

Quick example showing final dv/v for one month of data at Ruapehu, 2d stacks

Still to do: documentation

asyates · 2024-10-04T14:36:47Z

A couple smaller changes, and started adding documenting (just in s04_stack2 for now).

Ended up going down rabbit hole regarding how much padding with data outside of Todo jobs actually reduces edge influence (even if you pad well beyond the width of the filter, you can still get subtle differences propagate to the middle of the 'image' despite neighbouring points not changing).

Given that, I added a warning in the documentation that caution should be exercised if processing data in steps (i.e. not all at the same time).

Below is an example where i tested processing dv/v after applying wiener to all data together, and data in different stages (cutoff indicated by dashed-lines, so for example.... end of month is simulating reading in 1-day of new data each time). Similar, but some difference.

asyates · 2024-10-04T14:50:24Z

Demonstration that padding doesn't fully prevent values further away from edge of image slightly changing. Top row i am starting adding 2 rows each time (of 1s), e.g. to reflect new CCF data coming in, and then applying wiener filter of length (2,2). Can see that the values corresponding to the original pattern have subtle differences even when adding four or six rows of constant value.

So not sure there is a 'perfect' way to do it, other than maybe having a fixed moving window where we are applying the wiener filter i.e. every N days, so that it is consistent if processing in different stages. But... that would be pretty horrid for computation time i imagine so maybe just having the warning is best for cases where not processing all data together.

ThomasLecocq · 2024-11-13T14:54:45Z

@asyates : does the filtering work like any other "smoothing" filter in msnoise: aka: "not relying on information in the future" to define the values of day D (so only with D-N -> D] inclusive) ?

asyates · 2024-11-21T12:30:34Z

At the moment, no... the only way i can see to do that would be to always apply wiener filter to a fixed window length prior to stacking.

e.g. if stacksize is 1d and correlation length is 1h, could apply wiener filter to the twenty-four 1h functions prior to stacking. Need to look at implementation again though... i remember thinking that would be quite painful to implement from current code (also with regard to computation time).

Will try take a look again start of Dec. Worst case, i'll remove the wiener filter related-changes so we can commit the bug-fixes for stacking process.

ThomasLecocq · 2024-11-21T13:45:59Z

ok but so, let's imagine you do the computation daily, the wiener would apply until that day, and when at D+1, it'd recompute the last N days with new updated information (and it's still toggle-able, default "not apply", right) ? this way, I can live with the PR as is, just make sure the documentation at the top of the file is accurate/complete & it'll be ready to merge

asyates · 2024-11-27T19:09:24Z

Yes, off by default.

For computation daily, and wiener filter length (wlen) of say 2-days, the wiener would apply filter with that day + padding of 2*wlen.... so 4 day extra, and discard the first half of padding window before saving (so saving, D and re-saving D-1, D-2... and discarding D-3 and D-4).

So when at D+1... D and D-1, would be saved.
When at D+2... D+1 and D would be saved.
etc. etc.

So a day only stops being overwritten when there is at least the length of the wiener filter between it and the new day.

asyates · 2024-11-27T19:13:40Z

I noticed there was a commit to s04_stack2 that i didn't account for, assuming that is why i have conflicts now; will check/update

codecov · 2024-11-27T20:26:55Z

Codecov Report

Attention: Patch coverage is 87.34940% with 21 lines in your changes missing coverage. Please review.

Project coverage is 63.07%. Comparing base (ab464c9) to head (12af69c).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
msnoise/s04_stack2.py	89.34%	13 Missing ⚠️
msnoise/api.py	0.00%	6 Missing ⚠️
msnoise/wiener.py	93.93%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #372      +/-   ##
==========================================
+ Coverage   62.67%   63.07%   +0.40%     
==========================================
  Files          43       44       +1     
  Lines        7482     7601     +119     
==========================================
+ Hits         4689     4794     +105     
- Misses       2793     2807      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

asyates · 2024-11-27T22:15:49Z

I hadn't actually considered that keep_all could potentially be False; i actually made an assumption the keep_all toggle might eventually be removed but i guess it makes sense to keep it for daily processing.

In any case, I noticed that the current implementation of pulling in 1-day CCFs following previous merge doesn't appear to be changing the mov_stack params to ('1D','1D') as it implies; the mov_rolling and mov_sample remain unchanged.

Need to look at this closer and then, at same time, see what happens with wiener filter when pulling in one day CCFs...

asyates · 2024-11-28T01:14:13Z

So turns out while it was true the effective sample was '1D' (though with copied values corresponding to original mov_sample), the original mov_rolling was still working more or less (so was not '1D','1D'). Have adjusted warning, and made a few other changes.

it looks like the error that required the 'construct' thing has a recent fix. I was finding that it was using way too much memory otherwise (my process was getting killed when using stack sizes of just 10 days). So i've removed it.

Otherwise, this is my most recent test of results with keep_all on and off. Not so satisfied as to why there are quite significant amplitude differences, so need to look closer.

ThomasLecocq · 2024-11-28T08:43:14Z

My dream indeed is to make keep_all=Y default by design, indeed removing it from the configurable parameters... but for laaaarge networks, it might create very large volumes of data, e.g for anyone only willing to create a single REF per pair... not fully sure how to solve that (if possible at all)

asyates · 2024-11-28T11:22:16Z

I don't see anything wrong causing the amplitude differences so I think its expected behaviour that relates to differences between working with many smaller CCFs (corr_duration) vs less 1-day stacks (already averaged).

Certainly the biggest amplitude difference between keep_all enabled or not is when wiener fllter is applied, which is not surprising as it completely changes the no. of CCFs used in smoothing by wiener (as i currently define the smooth based on time).

e.g. for 1h CCFs vs 1-day CCFs, for a 2day wiener smooth (in datetime axis) the wiener filter length will be 48 vs 2 values respectively.

So, i think... given keep_all = True, and wienerfilt = False by default, and a user is unlikely to be swapping between keep_all = True and keep_all = False (rather sticking with one or the other), I think i am happy for this to be merged (also fixing many other issues in the s04_stack2.py.

Ofc, it could also be expected that the user would toggle the length of the wiener filter depending on whether using keep_all = True or False (i.e. need longer window if less CCFs).

asyates · 2024-11-28T11:35:37Z

e.g. if i normalize CCFs by maxabs, and set wiener filter to 2h length with corr_duration = 1800s (to approximately match 5d length with 1-day CCFs), amplitudes start looking closer with keep_all toggled on vs off.

So think all is okay; and as mentioned, wiener filt is toggled off by default (and keep_all on by default), so unlikely any problems will be created (and changing params and getting varied results has always been a thing).

ThomasLecocq · 2024-11-28T13:03:28Z

not fully sure I understand how the two datasets are generated on the last plot, but if indeed the default stuff mimics what the past did, I'm good with this.

ThomasLecocq · 2024-11-28T13:09:01Z

last call, good to merge ? If yes, I'm 🟢 lighting !

asyates · 2024-11-28T13:16:30Z

Difference in last two datasets:

Both use cc_normalization = absmax (i can imagine this could reduce differences between keep_all = Y/N in some cases).

blue scatter: keep_all = True, wiener_mlen = 2h -> with CCFs at 1800s sampling interval (corr_duration), this means a wiener length of 4 CCFs

orange scatter: keep_all = False, wiener_mlen = 5d -> with CCFs at 1-day sampling interval, this means a wiener length of 5 CCFs.

idea was to compare while making changes to wiener_mlen based on whether keep_all was toggled. Originally i left wiener_mlen = 5d for both, which (for keep_all = Y) would mean having a length of 240 CCFs (of 1800s duration) for the filter... hence more complicated comparison when it would only be 5 CCFs for keep_all = N.

asyates · 2024-11-28T13:16:38Z

anyway, good to merge for me. The differences to past (if wienerfilt = N, as by default) are bug fixes and pulling in older data when needed for stacking at the edges of continuous segments (e.g. an extra length of mov_rolling, which is then discarded before saving).

ThomasLecocq · 2024-11-28T13:45:06Z

Oki doki ! Merging & will do some tests one day if I have some time! It'd be good to have a new "Example" showcasing the effects of the filters etc, if you have time to work on that in the future! Thanks for this PR !

asyates added 8 commits September 26, 2024 18:15

stacking including previous N-1 days

7297201

bugs fixed from prev. commit

350cb5a

fix ref stacking to not use only Todo jobs

2301439

fix non-int max_mov_rolling

b48853d

check if daylist already datetime before conversion

1da76a1

fix ref to datetime class

3cd839b

pulled REF stack out of job world

78cf760

minor fix

fe36a7e

asyates added 4 commits October 2, 2024 16:45

wiener filter added pre-stack

1343b64

overwrite stacks that were at edge of prev filter

e08bafb

bugfix

df8ec0c

better handling of gaps wiener

be60c14

better padding + added documentation

909a16f

adding changes to default.csv

2b39e9a

improved/proper padding at beginning and end of segments

e52c986

asyates added 4 commits November 27, 2024 20:37

rebased changes

40c9c80

rebased changes

076c578

get_result handles datetime list also

504c28e

bugfix after rebase mistake

ecbf7b6

added wiener filt use to tests

6d2fbfd

more efficient rolling + keep_all changes

12af69c

ThomasLecocq merged commit ec5e555 into ROBelgium:master Nov 28, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s04_stack2 bug fixes + mov_stack improvement #372

s04_stack2 bug fixes + mov_stack improvement #372

asyates commented Sep 27, 2024

asyates commented Oct 2, 2024

ThomasLecocq commented Oct 2, 2024

asyates commented Oct 3, 2024 •

edited

Loading

asyates commented Oct 4, 2024 •

edited

Loading

asyates commented Oct 4, 2024

ThomasLecocq commented Nov 13, 2024

asyates commented Nov 21, 2024

ThomasLecocq commented Nov 21, 2024

asyates commented Nov 27, 2024 •

edited

Loading

asyates commented Nov 27, 2024

codecov bot commented Nov 27, 2024 •

edited

Loading

asyates commented Nov 27, 2024 •

edited

Loading

asyates commented Nov 28, 2024

ThomasLecocq commented Nov 28, 2024

asyates commented Nov 28, 2024 •

edited

Loading

asyates commented Nov 28, 2024

ThomasLecocq commented Nov 28, 2024

ThomasLecocq commented Nov 28, 2024

asyates commented Nov 28, 2024

asyates commented Nov 28, 2024 •

edited

Loading

ThomasLecocq commented Nov 28, 2024

s04_stack2 bug fixes + mov_stack improvement #372

s04_stack2 bug fixes + mov_stack improvement #372

Conversation

asyates commented Sep 27, 2024

asyates commented Oct 2, 2024

ThomasLecocq commented Oct 2, 2024

asyates commented Oct 3, 2024 • edited Loading

asyates commented Oct 4, 2024 • edited Loading

asyates commented Oct 4, 2024

ThomasLecocq commented Nov 13, 2024

asyates commented Nov 21, 2024

ThomasLecocq commented Nov 21, 2024

asyates commented Nov 27, 2024 • edited Loading

asyates commented Nov 27, 2024

codecov bot commented Nov 27, 2024 • edited Loading

Codecov Report

asyates commented Nov 27, 2024 • edited Loading

asyates commented Nov 28, 2024

ThomasLecocq commented Nov 28, 2024

asyates commented Nov 28, 2024 • edited Loading

asyates commented Nov 28, 2024

ThomasLecocq commented Nov 28, 2024

ThomasLecocq commented Nov 28, 2024

asyates commented Nov 28, 2024

asyates commented Nov 28, 2024 • edited Loading

ThomasLecocq commented Nov 28, 2024

asyates commented Oct 3, 2024 •

edited

Loading

asyates commented Oct 4, 2024 •

edited

Loading

asyates commented Nov 27, 2024 •

edited

Loading

codecov bot commented Nov 27, 2024 •

edited

Loading

asyates commented Nov 27, 2024 •

edited

Loading

asyates commented Nov 28, 2024 •

edited

Loading

asyates commented Nov 28, 2024 •

edited

Loading