-
Notifications
You must be signed in to change notification settings - Fork 19
Home
A comparison of sktime
and HCrystalball
API designs for forecasting, and proposed way forward.
Both sktime
and HCrystalball
adopt a sklearn-like fit/predict design, and a unified interface.
The below table summarizes the main differences:
Area | sktime | HCrystalball | HCrystalball comments |
---|---|---|---|
data container | pandas series | pandas DataFrame | pandas DataFrame |
supports multivariate | no | yes | not natively on wrapper level (i.e. Prophet is not multivariate model by construction as opposed to i.e. VAR models) |
supports exogeneous | experimental | yes | yes |
supports iloc use | yes | no | yes X.iloc[-5:] will return the last 5 rows even with datetime index |
supports loc use | no | yes | yes X.loc["2020-05-01":] will return all rows from "2020-05-01" |
type consistent composition | yes | no | unsure HCrystalball aims to utilize maximum from sklearn with minimum custom reimplementations of already existing objects --> we don't have a custom implementation of sklearn GridSearchCV but we use it directly, discussing concrete points would help us to understand this issue |
task interoperability | yes | no | no HCrystalball aims to support only time series forecasting. The limited scope is a design decision. |
For explanation:
- type consistent composition means: composites inherit from, and follow the same interface as a class type ancestor. For example,
GridSearchCV
insklearn
behaves as a classifier, when constructed with a classifier. The compositor itself is an estimator class. - task interoperability means: the interface is designed to allow reduction to other time series related tasks
- loc and iloc usage implies support for integer and date/time indices, and specification of the forecasting horizon as relative steps ahead and absolute time points respectively - HCrystalball's implementation allows you to leverage both indexing schemes - integers and datetimes
On a high-level, HCrystalball
's interface seems inspired by Facebook's prophet
. sktime
's interface is closer to statsmodels
and the Hyndman interfaces in R (e.g. forecast
, fable
).
This section highlights advantages, disadvantages, and problems, according to our opinion.
- "natural" interface in univariate case
- higher-order operations, including composition and reduction, are well-handled
- lack of loc support
- no good multivariate support
- support for multivariate and exogeneous
- uses abc
- higher-order operations are not well-designed or consistent - example would help to see the point
- lack of iloc support - (see above)
- interface is unintuitive in the univariate case - HCrystalball intention is as close compatibility with sklearn as possible with one exception - leveraging pandas as the main data interface instead of NumPy, this design decision leads to the natural choice of having X in two-dimensions (pandas dataframe) and y pandas series (1D NumPy is also supported) as input for fit and having X (dataframe) for the predict method. This implies an empty data frame with datetime index in the univariate case. HCrystalball in the past supported also just one input for fit and integer (horizon) for predict method for the univariate case, but over time experience showed that using more generic interface leads to better modeling experience (no need to change interface after adding one column, frequent usage of many exogenous variables, less error-prone and cleaner implementations, direct compatibility with the whole sklearn ecosystem...). The design decision to stick with sklearn API also demonstrates our intention to address primarily the ML community rather than a more traditional statistical community around statsmodels).
- does not consistently cover both univariate, multivariate use well - user frustration in at least one sub-case
- user cannot use series and DataFrame
- no support for both iloc and loc (indexed, e.g., datetime) indexing
Up to naming of variables, both sktime and HCrystalball adopt a fit/predict API, of the type
fit(y_past, [x_past], horizon)
predict([x_future], horizon)
where:
-
y_past
is the time series in the past, -
horizon
is the indices (loc or iloc) to predict at - note that some methods already require this infit
-
x_past
is exogeneous time series in the past -
x_future
is exogeneous time series in the future
The differences are mainly in expected type:
variable | sktime | HCrystalball | HCrystalball comments |
---|---|---|---|
y_past |
pandas series | pandas DataFrame | pandas series (on wrapper level) |
horizon in fit
|
integer sequence | not supported (instead fitting is moved to predict in cases where horizon is required for fitting) |
in order to follow sklearn API we agreed to stick with original fit and predict signature (fitting in the predict is also done in i.e. KNN implementation in sklearn) |
horizon in predict
|
integer sequence | empty DataFrame with loc indices | (see above) |
x_past |
pandas DataFrame (experimental) | pandas DataFrame | pandas DataFrame |
x_future |
pandas DataFrame (experimental) | pandas DataFrame | pandas DataFrame |
The interface differences suggest:
- different signature and type choices cover different use cases well (e.g., univariate vs multivariate) - a joint/merged interface may therefore be desirable.
- the interfaces are currently incompatible, while compatibility will require support for both series and DataFrames, and support for both
loc
andiloc
indexing. - the
sktime
interface has an advantage in composition and other higher-order operations. A joint interface should perhaps adopt this.
More precisely, a "good" consensus interface should satisfy the following requirements:
- support for both series and DataFrames as inputs/outputs. **We prefer just one way how to do things, as sklearn expects 2D for X, this wouldn't allow us to leverage the whole sklearn ecosystem directly **
- support for both
loc
andiloc
indexing - support for exogeneous variables
-
horizon
can be passed infit
- consistent typing in higher-order motifs including composition, wrappers, reduction (inherits from resultant type class, components passed in constructor)
We therefore suggest:
-
sktime
andHCrystalball
work together towards a unified forecasting interface in the next release. - This unified interface should satisfy the requirements outlined above
-
HCrystalball
becomes anaffiliated
package ofsktime
(means: compatible interface) - displayed on the landing page with other affiliated and coordinated packages -
HCrystalball
specifies a scope and roadmaps, e.g., adapters to advanced forecasters with major package dependencies? - individual
HCrystalball
team members are acknowledged as contributors tosktime
, insofar they ontribute to the re-factor - optionally, Heidelberg Cement is acknowledged as a contributing organisation to
sktime
post-refactor, pending approval of Heidelberg Cement comms
The proposed re-design is based on two work items:
-
HCrystalball
adaptssktime
's higher-order composition/reduction interface (correct class inheritance structure) - re-factor of
fit
/predict
signatures towards a consensus, which is type union based
The consensus could be as follows:
variable | consensus type |
---|---|
y_past |
pandas series or DataFrame
|
return of predict
|
same as type of y_past
|
horizon |
integer sequence (iloc ) or sequence of loc indices or empty DataFrame with loc indices |
x_past |
pandas series or DataFrame
|
x_future |
pandas series or DataFrame , needs same type and variables as x_past
|
There may be an additional flag for whether loc
or iloc
indices are used.
The low-level design could look similar to this, though the linked proposal is mainly concerned with support or datetime
.