Merge pull request #7 from adamingas/development

Adds new documentation
adamingas · Jan 17, 2024 · 37c4698 · 37c4698
2 parents d837413 + 8e39011
commit 37c4698
Show file tree

Hide file tree

Showing 10 changed files with 373 additions and 28 deletions.
diff --git a/.gitignore b/.gitignore
@@ -4,4 +4,5 @@ dist/
 .vscode/
 .ipynb_checkpoints/
 **/*.egg-info
-**/build
+**/build
+.DS_Store
diff --git a/docs/_static/feature_high_label.svg b/docs/_static/feature_high_label.svg
diff --git a/docs/_static/feature_low_label.svg b/docs/_static/feature_low_label.svg
diff --git a/docs/_static/feature_medium_label.svg b/docs/_static/feature_medium_label.svg
diff --git a/docs/_static/thresholds.svg b/docs/_static/thresholds.svg
diff --git a/docs/_static/thresholds_max_proba.svg b/docs/_static/thresholds_max_proba.svg
diff --git a/docs/conf.py b/docs/conf.py
@@ -5,6 +5,7 @@
 
 # -- Project information -----------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
+import os
 
 project = "OrdinalGBT"
 copyright = "2023, Adamos Spanashis"
@@ -19,14 +20,17 @@
  "sphinx.ext.napoleon",
  "sphinx.ext.viewcode",
  "sphinx_rtd_theme",
- "sphinx.ext.mathjax"
 ]
+if os.environ.get("NO_MATHJAX",False):
+ extensions.append( "sphinx.ext.imgmath")
+ imgmath_latex_preamble = "\\usepackage{amsmath}"
+else:
+ extensions.append( "sphinx.ext.mathjax")
+ mathjax_path = "https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"
 autoapi_dirs = ["../ordinalgbt"] # location to parse for API reference
 html_theme = "sphinx_rtd_theme"
 exclude_patterns = []
 nb_execution_mode = "off"
-mathjax_path = "https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"
-
 # -- Options for HTML output -------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
 

diff --git a/docs/maths.rst b/docs/maths.rst
@@ -27,8 +27,8 @@ In a three ordered labeled problem, we only need two thresholds,
 are associated to each label
 :math:`(-\infty,\theta_1], (\theta_1, \theta_2], (\theta_2, \infty)`.
 
-Deriving probabilities
-~~~~~~~~~~~~~~~~~~~~~~
+Deriving the probabilities
+~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 A property we want our mapping from latent variable to probability to
 have is for the cummulative probability of label :math:`z` being at most
@@ -60,15 +60,8 @@ Naturally, the probability of :math:`z` being any particular label is
 then:
 
 .. math::
-
-
- \newcommand{\problessthank}{P(z \leq k; y,\Theta )}
- % \newcommand{\bbeta}{\mathbf{b}}
- % \newcommand{\btheta}{\mathbf{\theta}}
- \begin{align*}
  P(z = k; y,\Theta ) &=P(z \leq k; y,\Theta) -P(z \leq k-1; y,\Theta ) \hspace{2mm} \\
  &= F(\theta_k - y) - F(\theta_{k-1} - y)
- \end{align*}
 
 A function that satisfies all these conditions is the sigmoid function,
 hereafter denoted as :math:`\sigma`. ### Deriving the loss function
@@ -85,15 +78,12 @@ As is usual in machine learning we use the negative log likelihhod as
 our loss:
 
 .. math::
-
-
- \begin{align*}
  l({\bf y};\Theta) &= -\log L({\bf y},\theta)\\
  &= -\sum_{i=0}^n I(z_i=k)\log(P(z_i = k; y_i,\Theta)) \\
  &= -\sum_{i=0}^n I(z_i=k)\log \left(\sigma(\theta_k - y_i) - \sigma(\theta_{k-1} - y_i)\right)
- \end{align*}
 
-### Deriving the gradient and hessian
+Deriving the gradient and hessian
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 To use a custom loss function with gradient boosting tree frameworks
 (i.e. lightgbm), we have to first derive the gradient and hessian of the
@@ -106,7 +96,6 @@ We denote the first and second order derivative of the sigmoid as
 The gradient is denoted as:
 
 .. math::
- \begin{align*}
  \mathcal{G}&=\frac{\partial l({\bf y};\Theta)}{\partial {\bf y}} \\
  &= -\frac{\partial }{\partial {\bf y}} \sum_{i=0}^n I(z_i=k)\log \left(\sigma(\theta_k - y_i) - \sigma(\theta_{k-1} - y_i)\right) \\
  &=
@@ -121,14 +110,12 @@ The gradient is denoted as:
  ... \\
  I(z_n = k) \left( \frac{\sigma'(\theta_k-y_n) - \sigma'(\theta_{k-1}-y_n)}{\sigma(\theta_k-y_n) - \sigma(\theta_{k-1}-y_n)} \right) \\ 
  \end{pmatrix}
- \end{align*}
 
 The summmation is gone when calculating the derivative for variable
 :math:`y_i` as every element of the summation depends only on one latent
 variable: 
 
 .. math::
- \begin{align*}
  \frac{\partial f(y_1)+f(y_2)+f(y_3)}{\partial {\bf y}} &=
  \begin{pmatrix}
  \frac{\partial f(y_1)+f(y_2)+f(y_3)}{\partial y_1} \\
@@ -141,7 +128,6 @@ variable:
  \frac{\partial f(y_2)}{\partial y_2} \\
  \frac{\partial f(y_3)}{\partial y_3} \\
  \end{pmatrix}
- \end{align*}
 
 The hessian is the partial derivative of the gradient with respect to
 the latent variable vector. This means that for each element of the
@@ -173,7 +159,6 @@ The hessian is then reduced to a vetor:
 .. math::
 
 
- \begin{align*}
  \mathcal{H} &= 
  \begin{pmatrix}
  \frac{\partial}{\partial y_1 y_1} \\
@@ -195,7 +180,6 @@ The hessian is then reduced to a vetor:
  -I(z_n = k) \left( \frac{\sigma''(\theta_k-y_n) - \sigma''(\theta_{k-1}-y_n)}{\sigma(\theta_k-y_n) - \sigma(\theta_{k-1}-y_n)} \right) +
  I(z_n = k)\left( \frac{\sigma'(\theta_k-y_n) - \sigma'(\theta_{k-1}-y_n)}{\sigma(\theta_k-y_n) - \sigma(\theta_{k-1}-y_n)} \right)^2 \\ 
  \end{pmatrix}
- \end{align*}
 
 Miscellanious
 ~~~~~~~~~~~~~
@@ -212,12 +196,10 @@ and the hessian is:
 .. math::
 
 
- \begin{align*}
  \sigma''(x) &= \frac{d}{dx}\sigma(x)(1-\sigma(x)) \\
  &= \sigma'(x)(1-\sigma(x)) - \sigma'(x)\sigma(x)\\
  &= \sigma(x)(1-\sigma(x))(1-\sigma(x)) -\sigma(x)(1-\sigma(x))\sigma(x) \\ 
  &= (1-\sigma(x))\left(\sigma(x)-2\sigma(x)^2\right)
- \end{align*}
 
 .. raw:: html
 

diff --git a/docs/motivation.ipynb b/docs/motivation.ipynb
@@ -0,0 +1,133 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Overview\n",
+ "## Motivation\n",
+ "\n",
+ "Usually when faced with prediction problems involving ordered labels (i.e. low, medium, high) and tabular data, data scientists turn to regular multinomial classifiers from the gradient boosted tree family of models, because of their ease of use, speed of fitting, and good performance. Parametric ordinal models have been around for a while, but they have not been popular because of their poor performance compared to the gradient boosted models, especially for larger datasets.\n",
+ "\n",
+ "Although classifiers can predict ordinal labels adequately, they require building as many classifiers as there are labels to predict. This approach, however, leads to slower training times, and confusing feature interpretations. For example, a feature which is positively associated with the increasing order of the label set (i.e. as the feature's value grows, so do the probabilities of the higher ordered labels), will va a positive association with the highest ordered label, negative with the lowest ordered, and a \"concave\" association with the middle ones."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<div>\n",
+ " <table>\n",
+ " <tr>\n",
+ " <th>\n",
+ " <img src=\"_static/feature_low_label.svg\" width=\"250\"/>\n",
+ " </th>\n",
+ " <th>\n",
+ " <img src=\"_static/feature_high_label.svg\" width=\"250\"/>\n",
+ " </th>\n",
+ " <th>\n",
+ " <img src=\"_static/feature_medium_label.svg\" width=\"250\"/>\n",
+ " </th>\n",
+ " </tr>\n",
+ " </table>\n",
+ "</div>"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": []
+ },
+ "source": [
+ "## Creating an ordinal loss\n",
+ "\n",
+ "We build an ordinal model using the \"threshold\" approach where a regressor learns a latent variable $y$, which is then contrasted to the real line split into regions using a set of thresholds $\\Theta$ to produce probabilities for each label. For a K labeled problem, we use K-1 thresholds $\\{\\theta_1,...,\\theta_{k-1}\\}$ that produce K regions in the real line. Each of these regions is associated with one of the levels of the label, and when the latent variable $y$ lies within their region, the probability of the label being on that level is maximised.\n",
+ "<div>\n",
+ "<img src=\"_static/thresholds.svg\" width=\"500\"/>\n",
+ "</div>\n",
+ "<div>\n",
+ "<img src=\"_static/thresholds_max_proba.svg\" width=\"500\"/>\n",
+ "</div>\n",
+ "\n",
+ "\n",
+ "Because of learning a single latent variable, we can calculate the cumulative probability of the label $z$ being at most at a certain level n [0,...,n,...,K] (contrasted to the regular classifier which assumes all labels are independent). This probability is **higher** when the latent variable gets smaller or when the level we consider is larger. In other words, in a 5 leveled ordinal problem, given a latent variable value $y$, the cumulative probability that our label is at most the third level is always going to be higher than being at most on the second level.\n",
+ "$$\n",
+ "P(z \\leq 3^{\\text{rd}};y,\\Theta) > P(z \\leq 2^{\\text{nd}};y,\\Theta)\n",
+ "$$\n",
+ "\n",
+ "Using the same setup, given that we are calculating the cumulative probability of our label being at most on third level, a **lower** latent value will lead to a higher probability.\n",
+ "\n",
+ "$$\n",
+ " \\text{Given that } y_1 > y_2,\n",
+ "$$\n",
+ "$$\n",
+ " P(z \\leq 3^{\\text{rd}};y_1,\\Theta) < P(z \\leq 3^{\\text{rd}};y_2,\\Theta)\n",
+ "$$\n",
+ "\n",
+ "We can create a cumulative distribution function $F$ that calculates this probability and satisfies the aforementioned conditions, in addition to the following that makes it into a good candidate for being a CDF:\n",
+ "* Is continuous differentiable, and so is it's derivative\n",
+ "* It's domain is between 0 and 1\n",
+ "* Is monotonically increasing\n",
+ "\n",
+ "The probability of the label being a particular level is then just a subtraction of the cumulative probability of being up to that level and that of being up to a level below\n",
+ "$$\n",
+ " P(z = n;y,\\Theta) = P(z \\leq n;y,\\Theta) - P(z \\leq n-1;y,\\Theta)\n",
+ "$$\n",
+ "\n",
+ "With this, [the negative log likelihood as our loss, and by calculating it's gradient and hessian](maths.rst), we can (almost) build a gradient boosted ordinal model.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Optimising the thresholds\n",
+ "\n",
+ "GBT frameworks allow only for building trees by looking at the gradient and hessian of the loss with respect to the raw predictions of the model. Therefore, they won't allow us to also optimise the thresholds at the same time, as is done in other ordinal models. \n",
+ "\n",
+ "Instead we could view this as a two step optimisation problem. We first pick reasonable thresholds, build some trees, then otpimise the thresholds given the predictions, and then re-build the trees given the new thresholds. This could be repeated as many times as we want, and with any reasonable scalar as the stopping point for starting the threshold optimisation. In the current approach we do this only once and call it hot-starting the model."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Problems\n",
+ "\n",
+ "Life would be boring if implementation followed smoothly from theory, but fortunately this is not the case here.\n",
+ "\n",
+ "When the latent variable becomes too large, the probability of the label being any level other than the highest one tends to 0 really fast (depending on the choice of the $F$ CDF). This single probability estimate dominates the loss function and creates problems when optimising the thresholds. To combat this the probabilities are capped to a lower and upper limit when calculating the loss.\n",
+ "\n",
+ "Another issue is that the sigmoid (which we have chosen to be the $F$ function), tends to 1 and 0 fairly quickly, which presents two possibilities:\n",
+ "* if the thresholds are close to each other then only the lowest and highest levels can reach a probability of 1. All other levels max out at a much lower level.\n",
+ "* If the thresholds are further apart, most of the probability mass is concentrated on a smaller set of levels. \n",
+ "\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.12"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/setup.cfg b/setup.cfg
@@ -3,7 +3,7 @@
 packages=
  ordinalgbt
 install_requires=
- lightgbm
+ lightgbm<4
  numpy
  scipy
  scikit-learn
@@ -14,7 +14,7 @@ install_requires=
 [metadata]
 name = ordinalgbt
 description = A library to build Gradient boosted trees for ordinal labels
-version = 0.1.1
+version = 0.1.2
 long_description = file:README.md
 long_description_content_type = text/markdown
 author = Adamos Spanashis