Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QC on GPS data + removing dlr/ulr for bad t_rad + recalculating RH from averaged vapour pressures #241

Merged
merged 22 commits into from
May 13, 2024

Conversation

BaptisteVandecrux
Copy link
Member

@BaptisteVandecrux BaptisteVandecrux commented May 7, 2024

Persistence QC is done on both raw and tx data, and extended to rh and wspd

ds = persistence_qc(ds) # Flag and remove persistence outliers
# if ds.attrs['format'] == 'TX':
# # TODO: The configuration should be provided explicitly
# outlier_detector = ThresholdBasedOutlierDetector.default()
# ds = outlier_detector.filter_data(ds) # Flag and remove percentile outliers

"rh": {"max_diff": 0.0001, "period": 2}, # gets special handling to allow constant 100%
"wspd": {"max_diff": 0.0001, "period": 6},

QC on GPS data (#238)

Done through

  • a persistence QC on gps_lat, gps_lon

  • a persistence QC on gps_alt

    'gps_lat_lon':{"max_diff": 0.000001, "period": 12}, # gets special handling to remove simultaneously constant gps_lat and gps_lon
    'gps_alt':{"max_diff": 0.0001, "period": 12},

    If variable 'gps_lat_lon' is given then only times when both gps_lat and gps_lon are static are being removed
    if (v in df) | (v == 'gps_lat_lon'):
    if v != 'gps_lat_lon':
    mask = find_persistent_regions(df[v], period, max_diff)
    if 'rh' in v:
    mask = mask & (df[v]<99)
    n_masked = mask.sum()
    n_samples = len(mask)
    logger.info(
    f"Applying persistent QC in {v}. Filtering {n_masked}/{n_samples} samples"
    )
    # setting outliers to NaN
    df.loc[mask, v] = np.nan
    else:
    mask = (find_persistent_regions(df['gps_lon'], period, max_diff) \
    & find_persistent_regions(df['gps_lat'], period, max_diff) )
    n_masked = mask.sum()
    n_samples = len(mask)
    logger.info(
    f"Applying persistent QC in {v}. Filtering {n_masked}/{n_samples} samples"
    )
    # setting outliers to NaN

  • removing all the elevations that deviate from an elevation baseline (gap-filled monthly median) by more than 100 m

  • removing gps_lat and gps_lon when gps_alt is missing or has been removed by the steps above

    # filtering gps_lat, gps_lon and gps_alt based on the difference to a baseline elevation
    # right now baseline elevation is gapfilled monthly median elevation
    baseline_elevation = (ds.gps_alt.to_series().resample('M').median()
    .reindex(ds.time.to_series().index, method='nearest')
    .ffill().bfill())
    mask = (np.abs(ds.gps_alt - baseline_elevation) < 100) & ds.gps_alt.notnull()
    ds[['gps_alt','gps_lon', 'gps_lat']] = ds[['gps_alt','gps_lon', 'gps_lat']].where(mask)

See for example here. All the pink points have missing or bad gps_alt. The baseline elevation is the black dashed line.
QAS_U_0

Removing dlr and ulr when t_rad is bad or not available (#240)

  • static t_rad are identified and removed using persistence QC
    't_rad':{"max_diff": 0.0001, "period": 2},
  • once QC has been done (either manually or with persistence_qc), dlr and ulr are filtered from non-null t_rad
    # removing dlr and ulr that are missing t_rad
    # this is done now becasue t_rad can be filtered either manually or with persistence
    ds['dlr'] = ds.dlr.where(ds.t_rad.notnull())
    ds['ulr'] = ds.ulr.where(ds.t_rad.notnull())

Re-calculating rh from time-average vapour pressures (#193)

RH is defined as the ratio between partial vapour pressure and saturation vapour pressure. When calculating hourly, daily, monthly averages, the ratio of the mean should be taken instead of the mean of the ratio (currently used)

for var in ['rh_u','rh_l']:
lvl = var.split('_')[1]
if var in df_d.columns:
if ('t_'+lvl in ds_h.keys()):
es_wtr, es_cor = calculateSaturationVaporPressure(ds_h['t_'+lvl])
p_vap = ds_h[var] / 100 * es_wtr
df_d[var] = (p_vap.to_series().resample(t).mean() \
/ es_wtr.to_series().resample(t).mean())*100
df_d[var+'_cor'] = (p_vap.to_series().resample(t).mean() \
/ es_cor.to_series().resample(t).mean())*100

def calculateSaturationVaporPressure(t, T_0=273.15, T_100=373.15, es_0=6.1071,
es_100=1013.246, eps=0.622):
'''Calculate specific humidity
Parameters
----------
T_0 : float
Steam point temperature. Default is 273.15.
T_100 : float
Steam point temperature in Kelvin
t : xarray.DataArray
Air temperature
es_0 : float
Saturation vapour pressure at the melting point (hPa)
es_100 : float
Saturation vapour pressure at steam point temperature (hPa)
Returns
-------
xarray.DataArray
Specific humidity data array
'''
# Saturation vapour pressure above 0 C (hPa)
es_wtr = 10**(-7.90298 * (T_100 / (t + T_0) - 1) + 5.02808 * np.log10(T_100 / (t + T_0))
- 1.3816E-7 * (10**(11.344 * (1 - (t + T_0) / T_100)) - 1)
+ 8.1328E-3 * (10**(-3.49149 * (T_100 / (t + T_0) -1)) - 1) + np.log10(es_100))
# Saturation vapour pressure below 0 C (hPa)
es_ice = 10**(-9.09718 * (T_0 / (t + T_0) - 1) - 3.56654
* np.log10(T_0 / (t + T_0)) + 0.876793
* (1 - (t + T_0) / T_0)
+ np.log10(es_0))
# Saturation vapour pressure (hPa)
es_cor = xr.where(t < 0, es_ice, es_wtr)
return es_wtr, es_cor

removing percentile QC

see #242

smoothing and gap-filling tilt (#176, #123)

This is necessary for the calculation of dsr_cor and usr_cor. It is done there:

# smoothing tilt
for v in ['tilt_x','tilt_y']:
threshold = 0.2
ds[v] = ds[v].where(
ds[v].to_series().resample('H').median().rolling(
3*24, center=True, min_periods=2
).std().reindex(ds.time, method='bfill').values <threshold
).ffill(dim='time').bfill(dim='time')
# rotation gets harsher treatment
v = 'rot'
ds[v] = ('time', (ds[v].where(
ds[v].to_series().resample('H').median().rolling(
3*24, center=True, min_periods=2
).std().reindex(ds.time, method='bfill').values <4
).ffill(dim='time')
.to_series().resample('D').median()
.rolling(7*2,center=True,min_periods=2).median()
.reindex(ds.time, method='bfill').values
))

src/pypromice/qc/persistence.py Outdated Show resolved Hide resolved
@@ -66,24 +66,39 @@ def toL2(
except Exception:
logger.exception('Flagging and fixing failed:')

if ds.attrs['format'] == 'TX':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the how the "periods" parameter is interpret by the persistence_qc. I am not sure if sample rates are addressed correctly. This means that a 10-minutes raw dataset will be processed differently than hourly tx data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the doc of persistence_qc said period is how many hours a value can stay the same without being set to NaN, it was, as you suspected, number of persistent values that were counted.

I adapted the code to handle duration of persistent periods:

def duration_consecutive_true(
series: Union[pd.Series, pd.DataFrame]
) -> Union[pd.Series, pd.DataFrame]:
"""
Froma a boolean series, calculates the duration, in hours, of the periods with connective true values.
Examples
--------
>>> duration_consecutive_true(pd.Series([False, True, False, False, True, True, True, False, True]))
pd.Series([0, 1, 0, 0, 1, 2, 3, 0, 1])
Parameters
----------
series
Boolean pandas Series or DataFrame
Returns
-------
consecutive_true_duration
Integer pandas Series or DataFrame with values representing the number of connective true values.
"""
# assert series.dtype == bool
cumsum = ((series.index - series.index[0]).total_seconds()/3600).to_series(index=series.index)
is_first = series.astype("int").diff() == 1
offset = (is_first * cumsum).replace(0, np.nan).fillna(method="ffill").fillna(0)
return (cumsum - offset) * series

I updated the default threshold values to reflect this (but they are stil up for discussion):

# period is given in hours, 2 persistent 10 min values will be flagged if period < 0.333
DEFAULT_VARIABLE_THRESHOLDS = {
    "t": {"max_diff": 0.0001, "period": 0.3},
    "p": {"max_diff": 0.0001, "period": 0.3},
    'gps_lat_lon':{"max_diff": 0.000001, "period": 6}, # gets special handling to remove simultaneously constant gps_lat and gps_lon
    'gps_alt':{"max_diff": 0.0001, "period": 6},
    't_rad':{"max_diff": 0.0001, "period": 0.3},
    "rh": {"max_diff": 0.0001, "period": 0.3}, # gets special handling to allow constant 100%
    "wspd": {"max_diff": 0.0001, "period": 6},
}

@@ -224,7 +224,7 @@ def loadL0(self):

except pd.errors.ParserError as e:
# ParserError: Too many columns specified: expected 40 and found 38
logger.info(f'-----> No msg_lat or msg_lon for {k}')
# logger.info(f'-----> No msg_lat or msg_lon for {k}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this log messages? Is it a common case?

Copy link
Member Author

@BaptisteVandecrux BaptisteVandecrux May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msg_lat and msg_lon are no longer printed in the output files because deemed unreliable (#199)
They are still extracted from the SBD messages and appended to some, but not all transmission files. In the future we should stop extracting them, but in the meantime, no need to warn anyone when they are not found in transmission files.

src/pypromice/process/aws.py Show resolved Hide resolved
src/pypromice/process/aws.py Show resolved Hide resolved
src/pypromice/process/L1toL2.py Show resolved Hide resolved
src/pypromice/process/L1toL2.py Outdated Show resolved Hide resolved
@@ -103,6 +118,28 @@ def toL2(
lat = ds['gps_lat'].mean()
lon = ds['gps_lon'].mean()

# smoothing tilt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you are doing 72 hour standard deviation filter with different thresholds for tilt and rotation. And in addition a subsequent smoothing on the rotation data. Correct?

I find this part difficult to read. Consider splitting out the functionality to a helper function for readability testability.
I've asked ChatGPT for a suggestion 😄 :

import xarray as xr
import pandas as pd

def apply_std_threshold_filter(data: xr.DataArray, resample_period: str, rolling_window_size: int,
                               std_threshold: float, time_dim: str = 'time') -> xr.DataArray:
    """Apply a standard deviation threshold filter to time series data."""
    # Convert to pandas Series, resample, and calculate rolling standard deviation
    series: pd.Series = data.to_series()
    medians: pd.Series = series.resample(resample_period).median()
    std_dev: pd.Series = medians.rolling(window=rolling_window_size, center=True, min_periods=2).std()
    reindexed_std: np.ndarray = std_dev.reindex(data[time_dim], method='bfill').values

    # Filter data based on standard deviation threshold
    filtered_data: xr.DataArray = data.where(reindexed_std < std_threshold)
    return filtered_data.ffill(dim=time_dim).bfill(dim=time_dim)

def process_tilt(ds: xr.Dataset, variables: list[str], std_threshold: float = 0.2) -> None:
    """Process tilt data variables using a standard deviation threshold."""
    for variable in variables:
        ds[variable] = apply_std_threshold_filter(ds[variable], 'H', 3*24, std_threshold)

def process_rotation(ds: xr.Dataset, variable: str = 'rot', std_threshold: float = 4,
                     smoother_window_size: int = 14) -> None:
    """Process rotation data with a harsher treatment and additional smoothing."""
    # Apply harsher std threshold filtering
    filtered_data: xr.DataArray = apply_std_threshold_filter(ds[variable], 'H', 3*24, std_threshold)

    # Additional smoothing by resampling to daily and applying median rolling
    daily_medians: pd.Series = filtered_data.to_series().resample('D').median()
    smoothed_daily: pd.Series = daily_medians.rolling(window=smoother_window_size, center=True, min_periods=2).median()
    reindexed_smoothed: np.ndarray = smoothed_daily.reindex(ds.time, method='bfill').values

    # Update the dataset
    ds[variable] = ('time', reindexed_smoothed)

# Example usage:
ds = xr.Dataset({
    'tilt_x': ('time', pd.Series(range(100))),
    'tilt_y': ('time', pd.Series(range(100))),
    'rot': ('time', pd.Series(range(100)))
})
ds['time'] = pd.date_range(start='2020-01-01', periods=100, freq='T')

process_tilt(ds, ['tilt_x', 'tilt_y'])
process_rotation(ds, 'rot')

Copy link
Member Author

@BaptisteVandecrux BaptisteVandecrux May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is a matter of taste.

I don't think a two-line function like process_tilt make sense.
I also think that declaring variable type at each operation (e.g. daily_medians: pd.Series = ...) decreases readability.

I have moved these lines into two seprate functions:

# smoothing tilt and rot
ds['tilt_x'] = smoothTilt(ds['tilt_x'])
ds['tilt_y'] = smoothTilt(ds['tilt_y'])
ds['rot'] = smoothRot(ds['rot'])

def smoothTilt(da: xr.DataArray, threshold=0.2):
'''Smooth the station tilt
Parameters
----------
da : xarray.DataArray
either X or Y tilt inclinometer measurements
threshold : float
threshold used in a standrad.-deviation based filter
Returns
-------
xarray.DataArray
either X or Y smoothed tilt inclinometer measurements
'''
# we calculate the moving standard deviation over a 3-day sliding window
# hourly resampling is necessary to make sure the same threshold can be used
# for 10 min and hourly data
moving_std_gap_filled = da.to_series().resample('H').median().rolling(
3*24, center=True, min_periods=2
).std().reindex(da.time, method='bfill').values
# we select the good timestamps and gapfill assuming that
# - when tilt goes missing the last available value is used
# - when tilt is not available for the very first time steps, the first
# good value is used for backfill
return da.where(
moving_std_gap_filled < threshold
).ffill(dim='time').bfill(dim='time')
def smoothRot(da: xr.DataArray, threshold=4):
'''Smooth the station rotation
Parameters
----------
da : xarray.DataArray
rotation measurements from inclinometer
threshold : float
threshold used in a standrad-deviation based filter
Returns
-------
xarray.DataArray
smoothed rotation measurements from inclinometer
'''
moving_std_gap_filled = da.to_series().resample('H').median().rolling(
3*24, center=True, min_periods=2
).std().reindex(da.time, method='bfill').values
# same as for tilt with, in addition:
# - a resampling to daily values
# - a two week median smoothing
# - a resampling from these daily values to the original temporal resolution
return ('time', (da.where(moving_std_gap_filled <4).ffill(dim='time')
.to_series().resample('D').median()
.rolling(7*2,center=True,min_periods=2).median()
.reindex(da.time, method='bfill').values
))

Is that better?

ds['usr_cor'] = ds['usr_cor'].interpolate_na(dim='time', use_coordinate=False)


ds['dsr_cor'] = ds.dsr_cor.where(ds.dsr.notnull())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the intension of this line? is it to ensure wer are having NaNs instead of Nones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just correcting an issue I found.

When calculating dsr_cor and usr_cor, some additional checks are performed and some timestamps are set to NaNs.
The original code meant to linearly interpolate the timestamps that were filtered to increase data coverage.
But what I saw is that sometimes, days were set to NaN, while the previous and following nights (when usr=dsr=0) remained. The interpolation then bridged the two nights and set the whole day to dsr_cor = usr_cor = 0.

Another way of putting it is that we should not interpolate over gaps without a thorough check, and even less as early as L2. So I removed the interpolation and made sure that not values were added in dsr_cor compared to dsr.

@ladsmund
Copy link
Contributor

Next time:
Please split up the updates into multiple pull requests. 🙏
This will make it much easier to read and discussion the changes 😅

- better doc for calculateSaturationVaporPressure
- default threshold values converted to hours in persistence_qc
- removed unecessary comment regarding cloud coverage calculation for bedrock stations
src/pypromice/qc/persistence.py Outdated Show resolved Hide resolved
src/pypromice/qc/persistence.py Outdated Show resolved Hide resolved
'gps_alt':{"max_diff": 0.0001, "period": 12},
't_rad':{"max_diff": 0.0001, "period": 2},
"rh": {"max_diff": 0.0001, "period": 2}, # gets special handling to allow constant 100%
"t": {"max_diff": 0.0001, "period": 0.3},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the period length fro t and p was thought to be 2 hours and not 20 minutes.

Copy link
Member Author

@BaptisteVandecrux BaptisteVandecrux May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, because it was meant to handle transmissions.
But now that we are talking about it, I suggest that we keep it at period = 0.3 hours because we know that the loggers were, during a time, wrongly programed to report the last valid number instead of NaN. So we cannot differentiate a good measurement with the same value as the previous one (rather unlikely) from a missing observation filled with the last valid value.

Also, when period = 2 hours , 12 persistent 10 min values would remain at the start of a long static period. I would like those to be removed, which would be the case if period = 0.3 hours.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is necessary to define configurations specifically for tx, raw, hourly and 10 minutes data.

p_u and t_u only have a single decimal precision in the tx data files hence I think there is a significant risk of filtering out valid data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p_u and t_u only have a single decimal precision in the tx data files hence I think there is a significant risk of filtering out valid data.

I didn't know that, good point! I switch back to period = 2 hours until we have a specific handling for raw data.

@BaptisteVandecrux BaptisteVandecrux merged commit 6ad5893 into main May 13, 2024
4 checks passed
@BaptisteVandecrux BaptisteVandecrux deleted the added-qc-on-gps-data branch May 13, 2024 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants