Skip to content

Commit

Permalink
Merge pull request #7 from adamingas/development
Browse files Browse the repository at this point in the history
Adds new documentation
  • Loading branch information
adamingas authored Jan 17, 2024
2 parents d837413 + 8e39011 commit 37c4698
Show file tree
Hide file tree
Showing 10 changed files with 373 additions and 28 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ dist/
.vscode/
.ipynb_checkpoints/
**/*.egg-info
**/build
**/build
.DS_Store
27 changes: 27 additions & 0 deletions docs/_static/feature_high_label.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
27 changes: 27 additions & 0 deletions docs/_static/feature_low_label.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
30 changes: 30 additions & 0 deletions docs/_static/feature_medium_label.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
63 changes: 63 additions & 0 deletions docs/_static/thresholds.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
78 changes: 78 additions & 0 deletions docs/_static/thresholds_max_proba.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 7 additions & 3 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
import os

project = "OrdinalGBT"
copyright = "2023, Adamos Spanashis"
Expand All @@ -19,14 +20,17 @@
"sphinx.ext.napoleon",
"sphinx.ext.viewcode",
"sphinx_rtd_theme",
"sphinx.ext.mathjax"
]
if os.environ.get("NO_MATHJAX",False):
extensions.append( "sphinx.ext.imgmath")
imgmath_latex_preamble = "\\usepackage{amsmath}"
else:
extensions.append( "sphinx.ext.mathjax")
mathjax_path = "https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"
autoapi_dirs = ["../ordinalgbt"] # location to parse for API reference
html_theme = "sphinx_rtd_theme"
exclude_patterns = []
nb_execution_mode = "off"
mathjax_path = "https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"

# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

Expand Down
26 changes: 4 additions & 22 deletions docs/maths.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ In a three ordered labeled problem, we only need two thresholds,
are associated to each label
:math:`(-\infty,\theta_1], (\theta_1, \theta_2], (\theta_2, \infty)`.

Deriving probabilities
~~~~~~~~~~~~~~~~~~~~~~
Deriving the probabilities
~~~~~~~~~~~~~~~~~~~~~~~~~~

A property we want our mapping from latent variable to probability to
have is for the cummulative probability of label :math:`z` being at most
Expand Down Expand Up @@ -60,15 +60,8 @@ Naturally, the probability of :math:`z` being any particular label is
then:

.. math::
\newcommand{\problessthank}{P(z \leq k; y,\Theta )}
% \newcommand{\bbeta}{\mathbf{b}}
% \newcommand{\btheta}{\mathbf{\theta}}
\begin{align*}
P(z = k; y,\Theta ) &=P(z \leq k; y,\Theta) -P(z \leq k-1; y,\Theta ) \hspace{2mm} \\
&= F(\theta_k - y) - F(\theta_{k-1} - y)
\end{align*}
A function that satisfies all these conditions is the sigmoid function,
hereafter denoted as :math:`\sigma`. ### Deriving the loss function
Expand All @@ -85,15 +78,12 @@ As is usual in machine learning we use the negative log likelihhod as
our loss:

.. math::
\begin{align*}
l({\bf y};\Theta) &= -\log L({\bf y},\theta)\\
&= -\sum_{i=0}^n I(z_i=k)\log(P(z_i = k; y_i,\Theta)) \\
&= -\sum_{i=0}^n I(z_i=k)\log \left(\sigma(\theta_k - y_i) - \sigma(\theta_{k-1} - y_i)\right)
\end{align*}
### Deriving the gradient and hessian
Deriving the gradient and hessian
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To use a custom loss function with gradient boosting tree frameworks
(i.e. lightgbm), we have to first derive the gradient and hessian of the
Expand All @@ -106,7 +96,6 @@ We denote the first and second order derivative of the sigmoid as
The gradient is denoted as:

.. math::
\begin{align*}
\mathcal{G}&=\frac{\partial l({\bf y};\Theta)}{\partial {\bf y}} \\
&= -\frac{\partial }{\partial {\bf y}} \sum_{i=0}^n I(z_i=k)\log \left(\sigma(\theta_k - y_i) - \sigma(\theta_{k-1} - y_i)\right) \\
&=
Expand All @@ -121,14 +110,12 @@ The gradient is denoted as:
... \\
I(z_n = k) \left( \frac{\sigma'(\theta_k-y_n) - \sigma'(\theta_{k-1}-y_n)}{\sigma(\theta_k-y_n) - \sigma(\theta_{k-1}-y_n)} \right) \\
\end{pmatrix}
\end{align*}
The summmation is gone when calculating the derivative for variable
:math:`y_i` as every element of the summation depends only on one latent
variable:

.. math::
\begin{align*}
\frac{\partial f(y_1)+f(y_2)+f(y_3)}{\partial {\bf y}} &=
\begin{pmatrix}
\frac{\partial f(y_1)+f(y_2)+f(y_3)}{\partial y_1} \\
Expand All @@ -141,7 +128,6 @@ variable:
\frac{\partial f(y_2)}{\partial y_2} \\
\frac{\partial f(y_3)}{\partial y_3} \\
\end{pmatrix}
\end{align*}
The hessian is the partial derivative of the gradient with respect to
the latent variable vector. This means that for each element of the
Expand Down Expand Up @@ -173,7 +159,6 @@ The hessian is then reduced to a vetor:
.. math::
\begin{align*}
\mathcal{H} &=
\begin{pmatrix}
\frac{\partial}{\partial y_1 y_1} \\
Expand All @@ -195,7 +180,6 @@ The hessian is then reduced to a vetor:
-I(z_n = k) \left( \frac{\sigma''(\theta_k-y_n) - \sigma''(\theta_{k-1}-y_n)}{\sigma(\theta_k-y_n) - \sigma(\theta_{k-1}-y_n)} \right) +
I(z_n = k)\left( \frac{\sigma'(\theta_k-y_n) - \sigma'(\theta_{k-1}-y_n)}{\sigma(\theta_k-y_n) - \sigma(\theta_{k-1}-y_n)} \right)^2 \\
\end{pmatrix}
\end{align*}
Miscellanious
~~~~~~~~~~~~~
Expand All @@ -212,12 +196,10 @@ and the hessian is:
.. math::
\begin{align*}
\sigma''(x) &= \frac{d}{dx}\sigma(x)(1-\sigma(x)) \\
&= \sigma'(x)(1-\sigma(x)) - \sigma'(x)\sigma(x)\\
&= \sigma(x)(1-\sigma(x))(1-\sigma(x)) -\sigma(x)(1-\sigma(x))\sigma(x) \\
&= (1-\sigma(x))\left(\sigma(x)-2\sigma(x)^2\right)
\end{align*}
.. raw:: html

Expand Down
133 changes: 133 additions & 0 deletions docs/motivation.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Overview\n",
"## Motivation\n",
"\n",
"Usually when faced with prediction problems involving ordered labels (i.e. low, medium, high) and tabular data, data scientists turn to regular multinomial classifiers from the gradient boosted tree family of models, because of their ease of use, speed of fitting, and good performance. Parametric ordinal models have been around for a while, but they have not been popular because of their poor performance compared to the gradient boosted models, especially for larger datasets.\n",
"\n",
"Although classifiers can predict ordinal labels adequately, they require building as many classifiers as there are labels to predict. This approach, however, leads to slower training times, and confusing feature interpretations. For example, a feature which is positively associated with the increasing order of the label set (i.e. as the feature's value grows, so do the probabilities of the higher ordered labels), will va a positive association with the highest ordered label, negative with the lowest ordered, and a \"concave\" association with the middle ones."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div>\n",
" <table>\n",
" <tr>\n",
" <th>\n",
" <img src=\"_static/feature_low_label.svg\" width=\"250\"/>\n",
" </th>\n",
" <th>\n",
" <img src=\"_static/feature_high_label.svg\" width=\"250\"/>\n",
" </th>\n",
" <th>\n",
" <img src=\"_static/feature_medium_label.svg\" width=\"250\"/>\n",
" </th>\n",
" </tr>\n",
" </table>\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"## Creating an ordinal loss\n",
"\n",
"We build an ordinal model using the \"threshold\" approach where a regressor learns a latent variable $y$, which is then contrasted to the real line split into regions using a set of thresholds $\\Theta$ to produce probabilities for each label. For a K labeled problem, we use K-1 thresholds $\\{\\theta_1,...,\\theta_{k-1}\\}$ that produce K regions in the real line. Each of these regions is associated with one of the levels of the label, and when the latent variable $y$ lies within their region, the probability of the label being on that level is maximised.\n",
"<div>\n",
"<img src=\"_static/thresholds.svg\" width=\"500\"/>\n",
"</div>\n",
"<div>\n",
"<img src=\"_static/thresholds_max_proba.svg\" width=\"500\"/>\n",
"</div>\n",
"\n",
"\n",
"Because of learning a single latent variable, we can calculate the cumulative probability of the label $z$ being at most at a certain level n [0,...,n,...,K] (contrasted to the regular classifier which assumes all labels are independent). This probability is **higher** when the latent variable gets smaller or when the level we consider is larger. In other words, in a 5 leveled ordinal problem, given a latent variable value $y$, the cumulative probability that our label is at most the third level is always going to be higher than being at most on the second level.\n",
"$$\n",
"P(z \\leq 3^{\\text{rd}};y,\\Theta) > P(z \\leq 2^{\\text{nd}};y,\\Theta)\n",
"$$\n",
"\n",
"Using the same setup, given that we are calculating the cumulative probability of our label being at most on third level, a **lower** latent value will lead to a higher probability.\n",
"\n",
"$$\n",
" \\text{Given that } y_1 > y_2,\n",
"$$\n",
"$$\n",
" P(z \\leq 3^{\\text{rd}};y_1,\\Theta) < P(z \\leq 3^{\\text{rd}};y_2,\\Theta)\n",
"$$\n",
"\n",
"We can create a cumulative distribution function $F$ that calculates this probability and satisfies the aforementioned conditions, in addition to the following that makes it into a good candidate for being a CDF:\n",
"* Is continuous differentiable, and so is it's derivative\n",
"* It's domain is between 0 and 1\n",
"* Is monotonically increasing\n",
"\n",
"The probability of the label being a particular level is then just a subtraction of the cumulative probability of being up to that level and that of being up to a level below\n",
"$$\n",
" P(z = n;y,\\Theta) = P(z \\leq n;y,\\Theta) - P(z \\leq n-1;y,\\Theta)\n",
"$$\n",
"\n",
"With this, [the negative log likelihood as our loss, and by calculating it's gradient and hessian](maths.rst), we can (almost) build a gradient boosted ordinal model.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Optimising the thresholds\n",
"\n",
"GBT frameworks allow only for building trees by looking at the gradient and hessian of the loss with respect to the raw predictions of the model. Therefore, they won't allow us to also optimise the thresholds at the same time, as is done in other ordinal models. \n",
"\n",
"Instead we could view this as a two step optimisation problem. We first pick reasonable thresholds, build some trees, then otpimise the thresholds given the predictions, and then re-build the trees given the new thresholds. This could be repeated as many times as we want, and with any reasonable scalar as the stopping point for starting the threshold optimisation. In the current approach we do this only once and call it hot-starting the model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Problems\n",
"\n",
"Life would be boring if implementation followed smoothly from theory, but fortunately this is not the case here.\n",
"\n",
"When the latent variable becomes too large, the probability of the label being any level other than the highest one tends to 0 really fast (depending on the choice of the $F$ CDF). This single probability estimate dominates the loss function and creates problems when optimising the thresholds. To combat this the probabilities are capped to a lower and upper limit when calculating the loss.\n",
"\n",
"Another issue is that the sigmoid (which we have chosen to be the $F$ function), tends to 1 and 0 fairly quickly, which presents two possibilities:\n",
"* if the thresholds are close to each other then only the lowest and highest levels can reach a probability of 1. All other levels max out at a much lower level.\n",
"* If the thresholds are further apart, most of the probability mass is concentrated on a smaller set of levels. \n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
4 changes: 2 additions & 2 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
packages=
ordinalgbt
install_requires=
lightgbm
lightgbm<4
numpy
scipy
scikit-learn
Expand All @@ -14,7 +14,7 @@ install_requires=
[metadata]
name = ordinalgbt
description = A library to build Gradient boosted trees for ordinal labels
version = 0.1.1
version = 0.1.2
long_description = file:README.md
long_description_content_type = text/markdown
author = Adamos Spanashis
Expand Down

0 comments on commit 37c4698

Please sign in to comment.