Pipeline setup step 2: Fill forecast.py #93

Fuhan-Yang · 2025-01-10T17:25:23Z

This PR adds forecast function and run the forecast sequentially given the time frame in config.yaml. The results are saved as parquet data in the designated path. While this function hasn't been tested because of several errors:

Error: "typeFloat64 is incompatible with expected type String" is returned in split_train_test. This happens when the uptake data is a single polars DataFrame (instead of a uptake data list).
The NaN in training data prevents the fitting of LinearIncidenceUptakeModel, checks or methods (impute NA, or delete) are needed to ensure fitting can be run.

eschrom

Similar to my review for #92, I think we should get this to pass pre-commit and root out errors before merging.

If split_train_test breaks when given a single polars data frame, let's fix it. It looks like preprocess.py intends to always return a single data frame, not a list. This can easily remain true even after preprocess.py knows how to process multiple data sets, because it can just concatenate them into one before returning it. So we can just edit split_train_test to expect only a single data frame.

Likewise, if NaNs in the data cause problems, let's find out where they come from and fix them if possible. I don't recall missing values in the raw data - I wonder if there are further issues lurking in our code that cause the NaNs.

Normally, such things could be separate issues for separate PRs, but if they prevent us from testing the content of this PR properly, then I think they should be solved here.

Please also see my separate comment on the way to increment the forecast date.

eschrom · 2025-01-13T16:45:29Z

scripts/forecast.py

+            cumulative_projections.with_columns(estimate_type=pl.lit("cumulative")),
+            incident_projections.with_columns(estimate_type=pl.lit("incident")),
+        ]
+    )


Same question I had on #92 - why combine cumulative and incident uptake into a single data frame? We could just return the cumulative data and derive the incident data in a future step if needed.

Fixed in 862faf8

eschrom · 2025-01-13T16:54:35Z

scripts/forecast.py

+                clean_data,
+                grouping_factors=config["groups"],
+                forecast_start=config["timeframe"]["start"],
+                forecast_end=forecast_date,


Shouldn't it be forecast_start that is being incremented?

As written here, with forecast_end being incremented, we are asking questions like:

suppose we have data through Nov 1, 2024 - how well can we estimate uptake through May 1, 2025?

suppose we have data through Nov 1, 2024 - how well can we estimate uptake through May 8, 2025?

suppose we have data through Nov 1, 2024 - how well can we estimate uptake through May 15, 2025?

and so on...
These questions do not require fitting the model multiple times in a loop! Just fit it once for the latest forecast_end date (e.g. May 29, 2025) and look up the projections "along the way" for all the intermediate dates (May 22, May 15, May 8, May 1, etc.).

But instead, I think we had intended to increment forecast_start to ask questions like:

suppose we have data through Nov 1, 2024 - how well can we estimate uptake through May 29, 2025?

suppose we have data through Nov 8, 2024 - how well can we estimate uptake through May 29, 2025?

suppose we have data through Nov 15, 2024 - how well can we estimate uptake through May 29, 2025?

and so on...
In this case, because the starting date for forecasting keeps getting pushed later, the training data set keeps expanding, so the model does need to be refit multiple times in a loop.

So the latter is what we should be doing here.

Fixed in 857c080

Fuhan-Yang · 2025-01-13T22:33:19Z

Similar to my review for #92, I think we should get this to pass pre-commit and root out errors before merging.

If split_train_test breaks when given a single polars data frame, let's fix it. It looks like preprocess.py intends to always return a single data frame, not a list. This can easily remain true even after preprocess.py knows how to process multiple data sets, because it can just concatenate them into one before returning it. So we can just edit split_train_test to expect only a single data frame.

Likewise, if NaNs in the data cause problems, let's find out where they come from and fix them if possible. I don't recall missing values in the raw data - I wonder if there are further issues lurking in our code that cause the NaNs.

Normally, such things could be separate issues for separate PRs, but if they prevent us from testing the content of this PR properly, then I think they should be solved here.

Please also see my separate comment on the way to increment the forecast date.

For now, I would suggest to merge the PRs when the questions are well-addressed. Once the entire pipeline is merged, we can test it and open PRs if needed!

eschrom

I added one more comment regarding cumulative vs. incident. I think it will be useful to only pass cumulative data around, converting to incident only at the point where incident is needed (e.g. fitting the linear model, calculating mspe, etc.). But take a look and let me know what you think!

eschrom · 2025-01-15T18:10:01Z

scripts/forecast.py

-        ]
-    )
+    # Note that here returns incident projections only, for evaluation
+    return incident_projections


I think we should return cumulative projections here. The reason is that incident data for a partial season loses information. Suppose you are making forecasts from Jan 1 - May 31. Cumulative forecasts contain all the incident information, but incident forecasts "forget" what the total cumulative uptake had been up to Jan 1. So if you convert to incident here, it will be harder to convert back to cumulative later (e.g. for calculating end-of-season absolute error) - you'll have to look up the observed data on Jan 1 again. But if you keep the cumulative version here, it is easy to convert to incident later (e.g. for calculating mspe).

I agree. After all our discussions, my thinking now is that the inputs and outputs from the models should be in cumulative space (although many of the models might internally convert back and forth with incident)

I just looked back at your eval.py code, and I realized that end-of-season absolute error is, more specifically the error in cumulative uptake since the forecast date. There is no error before the forecast date - that data is known. So your approach makes more sense to me now.

I still propose we follow the convention only to pass cumulative data around and only to convert to incident at the points it's required, but I wanted you to know that I resolved some of my own confusion!

Agree that cumulative projections are more informative and should pass along. Just fixed this c2cbdd0. As #92 and #98 are merged, I think we can merge this one if there is no other issue.

swo

I'd like to resolve #92 and #98 before working deeply on this one

Also, it would be nice if this were a PR into #92's branch, rather than into main, so that we see only the incremental changes that this makes (and so we don't see, e.g., the changes in eval.py)

Fuhan-Yang · 2025-01-15T19:34:15Z

I'd like to resolve #92 and #98 before working deeply on this one

Also, it would be nice if this were a PR into #92's branch, rather than into main, so that we see only the incremental changes that this makes (and so we don't see, e.g., the changes in eval.py)

Sure, I will use this recursive branching method when it comes to incremental change in the future!

eschrom

Looks good to me.

This mismatch might be causing some problems about line breaks that differ between GitHub and local VSCode

This PR fixes some errors when calling `make all`. The changes are only made to run the code successfully, and not guarantee if it introduces errors in the output. --------- Co-authored-by: Scott Olesen <ulp7@cdc.gov>

- Remove the concept of datasets - Only return the cumulative uptake data - Rename "rollout" to "rollout***s***" everywhere, to emphasize that you can insert multiple dates

fill forecast.py

9b7be26

Fuhan-Yang requested review from swo and eschrom January 10, 2025 17:28

eschrom requested changes Jan 13, 2025

View reviewed changes

eschrom mentioned this pull request Jan 13, 2025

Pipeline setup step 3: Add eval.py #94

Merged

Fuhan-Yang added 4 commits January 13, 2025 16:17

changes based on the output is cumulative data only

862faf8

change forecast start and end

857c080

add pre-commit compliance

da7fde2

fix pytest

6e07aef

Fuhan-Yang requested a review from eschrom January 13, 2025 22:29

eschrom requested changes Jan 15, 2025

View reviewed changes

swo reviewed Jan 15, 2025

View reviewed changes

return cumulative forecasts

c2cbdd0

eschrom approved these changes Jan 15, 2025

View reviewed changes

swo and others added 6 commits January 15, 2025 16:32

Update ruff pre-commit hooks (#97)

a403811

This mismatch might be causing some problems about line breaks that differ between GitHub and local VSCode

Pipeline setup step 1: Fix preprocess to run Makefile (#92)

aee047c

This PR fixes some errors when calling `make all`. The changes are only made to run the code successfully, and not guarantee if it introduces errors in the output. --------- Co-authored-by: Scott Olesen <ulp7@cdc.gov>

Update preprocessing (#98)

f3f56d6

- Remove the concept of datasets - Only return the cumulative uptake data - Rename "rollout" to "rollout***s***" everywhere, to emphasize that you can insert multiple dates

add pre-commit compliance

f6a3ff4

Merge branch 'main' into fy_config_forecast

99b8a66

fix bugs in rebase

de1b426

Fuhan-Yang merged commit c543676 into main Jan 15, 2025
2 checks passed

Fuhan-Yang deleted the fy_config_forecast branch January 15, 2025 21:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline setup step 2: Fill forecast.py #93

Pipeline setup step 2: Fill forecast.py #93

Fuhan-Yang commented Jan 10, 2025 •

edited

Loading

eschrom left a comment

eschrom Jan 13, 2025

Fuhan-Yang Jan 13, 2025

eschrom Jan 13, 2025

Fuhan-Yang Jan 13, 2025

Fuhan-Yang commented Jan 13, 2025

eschrom left a comment

eschrom Jan 15, 2025 •

edited

Loading

swo Jan 15, 2025

eschrom Jan 15, 2025

Fuhan-Yang Jan 15, 2025

swo left a comment

Fuhan-Yang commented Jan 15, 2025

eschrom left a comment

Pipeline setup step 2: Fill forecast.py #93

Pipeline setup step 2: Fill forecast.py #93

Conversation

Fuhan-Yang commented Jan 10, 2025 • edited Loading

eschrom left a comment

Choose a reason for hiding this comment

eschrom Jan 13, 2025

Choose a reason for hiding this comment

Fuhan-Yang Jan 13, 2025

Choose a reason for hiding this comment

eschrom Jan 13, 2025

Choose a reason for hiding this comment

Fuhan-Yang Jan 13, 2025

Choose a reason for hiding this comment

Fuhan-Yang commented Jan 13, 2025

eschrom left a comment

Choose a reason for hiding this comment

eschrom Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

swo Jan 15, 2025

Choose a reason for hiding this comment

eschrom Jan 15, 2025

Choose a reason for hiding this comment

Fuhan-Yang Jan 15, 2025

Choose a reason for hiding this comment

swo left a comment

Choose a reason for hiding this comment

Fuhan-Yang commented Jan 15, 2025

eschrom left a comment

Choose a reason for hiding this comment

Fuhan-Yang commented Jan 10, 2025 •

edited

Loading

eschrom Jan 15, 2025 •

edited

Loading