The Milk Price Forecasting Model is a pilot project of the Federal Office for Agriculture. It is implemented in an R script. It integrates various statistical and machine learning models, offering a detailed analysis of milk price trends and predictions.
In order to run the milk price forecasting model, a few conditions must be met:
- The excel file
milk-price-data.xlsx
with the relevant data must be found in the same directory as the R script. - Additional values may be added directly to the excel file, but header names must remain the same.
- Before running the R script, the working directory must be set correctly.
Running the R script will trigger a series of processes including data preprocessing, model training and evaluation, and the generation of visualizations and reports.
The output from the script is multi-faceted – but everything will be
written into a directory called output
. Two PDF reports are produced:
the file time-series-decomposition.pdf
summarizes the data and
visualizes the seasonal decomposition of each variable, while
machine-learning-report.pdf
summarizes the models applied and their
respective forecastings. In addition, various tables with numerical
results are written into the directory output
.
Figure 1: Overview of the different files written into the directory
output
by the R script script.R
.
This documentation encapsulates the script’s capabilities, underlying principles, and offers guidance for users to leverage its functionalities effectively.
The script begins by setting up the necessary data environment, using
milk-price-data.xlsx
as its primary data source. By default, it
specifically targets CH_Milchpreis
for prediction—although this might
be changed by adjusting the variable target
—, reserving the final 18
months of data for testing. This number can be adjusted via the variable
test_set_size
.
The models incorporate a broad spectrum of features, encompassing various milk price types and related economic indicators, allowing for a comprehensive analysis.
The first part of the code generates a PDF file called
time-series-decomposition.pdf
. As the name suggests, this file mainly
incorporates seasonal decompositions, but also some more exploratory
analyses.
The PDF report starts by showing a visualization of all missing values
in the data later used to train the forecasting models. This simply
serves as an overview – for the purpose of forecasting, NA
values are
replaced by zero.
Next, the report produces a correlation matrix of all features in the data set (Figure 2). Positively correlating features are higlighted in red, negatively correlated features in blue. As to be seen, most features are either positively or negatively correlated with each other – ther are relatively few uncorrelated variabels. This means we are dealing with a high degree of multicollinearity in the feature matrix.
Figure 2: Visualization of the correlation matrix of the feature data set. As to be seen, there are many positively and negatively correlating features – as well as some that are seemingly independent from the others.The next part generates plots of a seasonal decomposition of the data (Figure 3). It breaks down the time-series into trend, seasonal, and residual elements, offering insights into the underlying patterns that govern changes over time.
Figure 3: Example of a seasonal decomposition plot forCH_Milchpreis
. The milk price is
decomposed into a trend (via a moving average), a seasonal effect and
the remainder. The bars on the right indicate the relative magnitude of
the effects.
Time series plots are generated for every single feature. However, if no significant seasonal effect is detected using Ollech and Webel’s combined seasonality test, the plot shows the feature value over time only (Ollech 2021; Ollech and Webel 2020).
First, the data is prepared: The time
feature is replaced by
respective sine and cosine transformations. The target variable,
i.e. CH_Milchpreis
, is lagged with different forecast horizons,
h = 1
, h = 2
, and h = 3
. All NA
values in the features are
replaced by zero – except for the ones created by lagging the target
variable.
Furthermore, all original features are z-score normalized. This ensures a more reliable model fitting and also equivalent variable weight.
To evaluate the different forecasting models, the root mean squared
error is used (RMSE, Equation 1). It compares the predicted
values of the milk price for any following period (
One apparent advantage of the RMSE as a performance metric is its interpretability – it shares the same unit as the variable of interest (i.e., CHF) and represents the expected absolute deviation.
The prepared data is split into training and testing sets. The training
set consists of all data up until test_set_size = 18
months before the
last entry. These observations are used to calibrate the model.
Correspondingly, the latest test_set_size = 18
observations are used
to test the models. In the machine-learning-report.pdf
file, the model
performance on the test set is reported (Figure 4).
The machine learning models are trained using cross-validation. As time series data, the values are heavily autocorrelated over time. To avoid overfitting, the folds for cross-validation are split over time, specifically in six month periods (Figure 5).
Figure 5: Visualization of the data division into test and training data, as well as the subdivision into folds of the training data for cross validation.The pairs panel also included in the machine-learning-report.pdf
file
also illustrates the performance of the different models
(Figure 6). Furthermore, it shows how different model
predictions are correlated – lasso and ridge regression make similar
predictions, and so do the ARIMA and SARIMA models.
For the milk price forecasting, four different models are fit to the
data: A lasso regression model, a ridge regression model, an ARIMA model
as well as a SARIMA model. These models are briefly described here. The
file machine-learning-report.pdf
displays all the forecast values of
the Swiss milk price (CH_Milchpreis
, Figure 7).
Generally speaking, any forecasting model
For the purpose of predicting milk prices more than one month into the
future, three different models are trained in the script – each with a
different forecast horizons
Ridge and Lasso regression are especially effective in handling
multicollinearity and preventing overfitting. Both models are linear in
nature, i.e. the response variable
Ridge Regression introduces an
The mathematical formulations for these regressions are centered around
minimizing the sum of squared residuals, with added regularization terms
(
The hyperparameter cv.glmnet
from the
glmnet
package. Figure 8 shows the magnitude of
the coefficients under
Figure 9 shows both the cross-validated error as
well as the coefficient magnitude as a response to an increasing penalty
term
Additionally, the script employs an autoregressive integrated moving average (ARIMA) and its seasonal variant (SARIMA). ARIMA is most suitable for non-seasonal data and combines autoregressive and moving average components. In contrast, SARIMA extends this to accommodate seasonal fluctuations in data, making it more robust for datasets with seasonal trends.
The two models are mainly fit to have a benchmark for the ridge and lasso regressio models.
Both models are fit using the arima
function from the R package
stats
. In both cases, the order (
Ollech, Daniel. 2021. Seastests: Seasonality Tests. https://CRAN.R-project.org/package=seastests.
Ollech, Daniel, and Karsten Webel. 2020. “A Random Forest-Based Approach to Identifying the Most Informative Seasonality Tests.”
Footnotes
-
The autoregressive models (ARIMA and SARIMA) don’t need to be retrained for specific forecasting horizons, they can simply forecast next values based on the already forecast ones. Thus, the ARIMA and SARIMA models are only trained once. ↩