Added functionality to output to parquet #761

karnesh · 2024-04-23T17:27:12Z

PR to add functionality to output flow, velocity and depth to parquet file. The output parquet is a timeseries data(details in Notes). It is required for the TEEHR input. TEEHR comprises of set of tools for hydrologic model and forecast evaluation. Storing the t-route output in parquet format will lead to efficient query of the data. Also, it will help in automation by connecting TEEHR with NextGen water model. CIROH along with Lynker has developed a containerized version of NextGen National Water Model NextGen In A Box (NGIAB). TEEHR uses DuckDB to query the parquet output from NGIAB stored on cloud.

Additions

Added functionality to output write flow, velocity and depth to parquet format.

Changes

Changed yaml input file to include parquet output format

parquet_output:
    #---------
    parquet_output_folder: output/
    configuration: short_range
    prefix_ids: nex

Testing

The modified code is tested by executing LowerColorado_TX demonstration test.

Screenshots

Here is a screenshot of successful compilation.

Parquet output timeserie data from LowerColorado_TX demonstration test.

Notes

The flowveldepth dateframe is modified to create a timeseries containing following variables:

location_id: string
value: double
value_time: timestamp[us]
variable_name: string
configuration: string
units: string
reference_time: timestamp[us]

Location_id variable contains nexus IDs and has 'nex-' prepended to it.
Configuration (short range, medium range, long range etc.) has to be entered by the user in the input yaml file.

Todos

Checklist

Testing checklist

Target Environment support

Windows
Linux
Browser

hellkite500

In general, I like the idea of adding this as an option for possible output formats. There are some details of the implementation that need to be discussed/considered a bit I think, however.

Also, a more general note, but it configuring your local dev environment git to ignore whitespace changes may be helpful on PR's like this which touch many files. It looks like some auto formatting was applied, and the formatting changes can make review more challenging when trying to see all the places in the code which have changed in functionality. If you want to contribute formatting changes, I would suggest putting them all into a single commit on a PR, or an independent PR.

hellkite500 · 2024-04-26T13:37:07Z

src/troute-nwm/src/nwm_routing/output.py

+    timeseries_df['units'] = timeseries_df['variable_name'].map(variable_to_units_map)
+    timeseries_df['reference_time'] = start_datetime.date()
+    timeseries_df['location_id'] = timeseries_df['location_id'].astype('string')
+    timeseries_df['location_id'] = 'nex-' + timeseries_df['location_id']


I don't think this is generically appropriate to label all locations in the routing data frame with a nex- prefix. This may be applicable to routing which uses a hy_features network, but wouldn't be applicable to one which uses a nhd network. Is there any guarantee before this point that this function only gets applied to hy_features network results? Also, even within a hy_features network, the routed segments this ouptut relates to is for the waterbodies which typically have a wb- prefix in the hydrofabric identifiers, not nex. The nexus and waterbody features are related, but destinct concepts.

See comment below...

hellkite500 · 2024-04-26T13:44:53Z

src/troute-nwm/src/nwm_routing/output.py

+
+    df.index.name = 'location_id'
+    df.reset_index(inplace=True)
+    timeseries_df = df.melt(id_vars=['location_id'], var_name='var')


I think if you use value_vars=[ 'q', 'v', 'd' ] as a kwarg to melt, you might have an easier time extracting from un-pivoted table?

The below method seems like there should be a better way besides casting to string, manipulating the string, and recasting to numeric/datetime types.

I'm not sure exactly what the df looks like that is trying to be manipulated at this point, but I would try to consider a different method(s) for extracting the needed data from it.

If for some reason this is the only way, then please comment this implementation to describe what the state of the df is and why this is the way it needs to be manipulated.

I looked into options and this seems to be the only way. I cannot use value_vars=[ 'q', 'v', 'd' ] as a kwarg to melt because of the format of df column names. Also, the column names needed to be manipulated as strings. Please look at the screenshot below.

Each column name consists of a time step and a variable name (q, v or d) in string format. The time steps values are used to get the value_time.

hellkite500 · 2024-04-26T13:48:08Z

src/troute-nwm/src/nwm_routing/output.py

+        timeseries_df = _parquet_output_format_converter(flowveldepth, restart_parameters.get("start_datetime"), dt,
+                                                         output_parameters["parquet_output"].get("configuration"))
+
+        parquet_output_segments_str = ['nex-' + str(segment) for segment in parquet_output_segments]


another use of nex- prefix that may not be generically appropriate.

Would suggest we make it a variable with a default argument in the wrapping function and a value in the yaml.
parquet_output_segments_str = [prefix_str + str(segment) for segment in parquet_output_segments]

We would need to do something more comprehensive to update T-Route to comprehend the naming/labeling of the IDs...

I have modified the PR to add user defined value in yaml file for the prefix string.

jameshalgren · 2024-05-20T13:41:35Z

src/troute-config/troute/config/output_parameters.py

 class StreamOutput(BaseModel):
    # NOTE: required if writing StreamOutput files
    stream_output_directory: Optional[DirectoryPath] = None
    stream_output_time: int = 1
-    stream_output_type: streamOutput_allowedTypes = ".nc"
+    stream_output_type:streamOutput_allowedTypes = ".nc"


Remove this formatting change from the commit?

jameshalgren · 2024-05-20T13:42:54Z

src/troute-nwm/src/nwm_routing/output.py

-    lakeids = np.fromiter(crosswalk.keys(), dtype=int)
+    lakeids = np.fromiter(crosswalk.keys(), dtype = int)
    idxs = target_df.index.to_numpy()
    lake_index_intersect = np.intersect1d(
        idxs,
        lakeids,
-        return_indices=True
+        return_indices = True
    )

    # replace lake ids with link IDs in the target_df index array
-    linkids = np.fromiter(crosswalk.values(), dtype=int)
+    linkids = np.fromiter(crosswalk.values(), dtype = int)
    idxs[lake_index_intersect[1]] = linkids[lake_index_intersect[2]]

    # (re) set the target_df index
-    target_df.set_index(idxs, inplace=True)
+    target_df.set_index(idxs, inplace = True)

    return target_df


-def _parquet_output_format_converter(df, start_datetime, dt, configuration):
+def _parquet_output_format_converter(df, start_datetime, dt, configuration, prefix_ids):


We can drop most of these format only changes to keep the PR super clean.

I have reverted the format changes.

shorvath-noaa · 2024-07-22T16:14:37Z

@hellkite500 @jameshalgren does this PR still need changes in your opinions or can we review/approve?

jameshalgren · 2024-07-22T20:35:31Z

@hellkite500 @jameshalgren does this PR still need changes in your opinions or can we review/approve?

Thanks for asking -- please proceed with final review. GitHub still shows a conflict -- do you need us to resolve that?

shorvath-noaa

Apologies for the slow review. Can you rebase this on the master branch to resolve conflicts?

shorvath-noaa · 2024-07-23T20:42:06Z

src/troute-config/troute/config/output_parameters.py

@@ -4,13 +4,16 @@
 from typing_extensions import Literal
 from .types import FilePath, DirectoryPath

-streamOutput_allowedTypes = Literal['.csv', '.nc', '.pkl']
+streamOutput_allowedTypes = Literal['.csv', '.nc', '.pkl', '.parquet']


streamOutput_allowedTypes is specific for the stream_output class of parameters. It looks like you've created a new class, ParquetOutput that doesn't use this. Is this change needed?

Thank you for pointing this out. I have removed .parquet from streamOutput_allowedTypes

karnesh · 2024-07-26T17:18:45Z

I have rebased the PR to resolve conflicts.

shorvath-noaa · 2024-07-26T17:31:38Z

@karnesh it looks like you are missing this line in your PR. I think this is why it is failing the GitActions tests. Can you add it in?

…rmatting

hellkite500 requested changes Apr 26, 2024

View reviewed changes

jameshalgren reviewed May 20, 2024

View reviewed changes

karnesh marked this pull request as ready for review May 23, 2024 01:22

shorvath-noaa requested changes Jul 23, 2024

View reviewed changes

karnesh added 10 commits July 24, 2024 10:35

Added functionality to write flow, velocity and depth to parquet

4ddffea

Modified parquet output format to match TEEHR input

2a8a10f

cleaned the code

6d23372

cleaned the code

4260674

sample yaml file changes for parquet output

e5b45c4

added functionality to include user defined prefix for IDs

68d0f76

reverted back the formatting changes

3fd4d4b

reverted back the formatting changes

9cdd4fe

reverted back the formatting changes

be18931

reverted back the formatting changes

7068e17

karnesh force-pushed the master branch from 33a0aa5 to 7068e17 Compare July 24, 2024 18:16

removed parquet from streamOutput_allowedTypes

0f58ae6

added missing parameter in nwm_output_generator function and fixed fo…

de8d8ab

…rmatting

shorvath-noaa approved these changes Jul 26, 2024

View reviewed changes

shorvath-noaa merged commit 0a5046c into NOAA-OWP:master Jul 26, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added functionality to output to parquet #761

Added functionality to output to parquet #761

karnesh commented Apr 23, 2024 •

edited

Loading

hellkite500 left a comment

hellkite500 Apr 26, 2024

jameshalgren May 2, 2024

hellkite500 Apr 26, 2024

karnesh May 9, 2024 •

edited

Loading

hellkite500 Apr 26, 2024

jameshalgren May 2, 2024

karnesh May 9, 2024

jameshalgren May 20, 2024 •

edited

Loading

jameshalgren May 20, 2024

karnesh May 20, 2024

shorvath-noaa commented Jul 22, 2024

jameshalgren commented Jul 22, 2024 •

edited

Loading

shorvath-noaa left a comment

shorvath-noaa Jul 23, 2024

karnesh Jul 26, 2024

karnesh commented Jul 26, 2024

shorvath-noaa commented Jul 26, 2024

Added functionality to output to parquet #761

Added functionality to output to parquet #761

Conversation

karnesh commented Apr 23, 2024 • edited Loading

Additions

Changes

Testing

Screenshots

Notes

Todos

Checklist

Testing checklist

Target Environment support

hellkite500 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karnesh May 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameshalgren May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shorvath-noaa commented Jul 22, 2024

jameshalgren commented Jul 22, 2024 • edited Loading

shorvath-noaa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karnesh commented Jul 26, 2024

shorvath-noaa commented Jul 26, 2024

karnesh commented Apr 23, 2024 •

edited

Loading

karnesh May 9, 2024 •

edited

Loading

jameshalgren May 20, 2024 •

edited

Loading

jameshalgren commented Jul 22, 2024 •

edited

Loading