Skip to content

Commit

Permalink
new-posts (#8)
Browse files Browse the repository at this point in the history
* add westerlind2021

* complete westerlind2021

- added more info on methods, results and discussion of westerlind2021, though very minimal regarding the latter two chapters
- modified .gitignore to also ignore .csv, .jpg and .png

* add an article about mathematical notation

- currently only includes basics of set notation
  • Loading branch information
simonsteiger authored Jul 18, 2024
1 parent 4971f82 commit a0c5dae
Show file tree
Hide file tree
Showing 3 changed files with 208 additions and 0 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
/.quarto/
/_site/
*.html
*.csv
*.png
*.jp*g
155 changes: 155 additions & 0 deletions content/articles/westerlind2021/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
---
title: Westerlind et al., 2021
date: 2024-07-17
author: Simon Steiger
draft: true
categories:
- Rheumatoid arthritis
- Treatment response
- Prediction
- Swedish registers
subtitle: >
What is the persistence to methotrexate in rheumatoid arthritis, and does machine learning outperform hypothesis-based approaches to its prediction?
---

::: {.callout-note}

### At a glance

Objectives
: To assess the 1-year persistence to methotrexate (MTX) and to compare data-driven and hypothesis-based methods to predict this persistence.

Key findings
: Two thirds of patients with early RA who start MTX remain on this therapy at 1 year after initiation. Predicting persistence is challenging with either hypothesis-based or data-driven methods, and may require additional types of data.

Related articles
: Other studies working on persistence, using similar methods (Anton's?), or the description of the SRQ by Erikson et al. could fit here.

Link
: DOI: [https://doi.org/10.1002/acr2.11266](https://doi.org/10.1002/acr2.11266)

:::

## Background

Prediction of clinical outcomes in RA at diagnosis and later throughout the disease progression is often attempted but rarely achieved (see also [Duong et al., 2022](../duong2022/index.qmd), [Castrejón et al., 2016](../castrejon2016/index.qmd), and [Myasoedova et al., 2021](../myasoedova2021/index.qmd)).
Identifying those patients who are likely to respond well to the common first line treatment MTX would allow prescribing other treatments, which are more likely to have an effect than MTX, to the remaining group of patients.

::: {.callout-tip}

### Utility functions

This sounds like the utility of identifying patients who are very likely to respond poorly is higher than that of identifying patients who are very likely to respond well to MTX.

Maybe it's worthwhile considering modelling this utility? See [Michael Betancourt's blog](https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html) as a possible starting point.

:::

Previous studies have rarely achieved an area under the [receiver operating characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) (AUROC) greater than 0.70.
Most studies attempted to predict treatment response at shorter follow-up (three or six months) with smaller sample sizes.

At the time of writing, machine learning had not been widely used for modeling in RA research.
There are large amounts of additional data on medical and family history of patients which is difficult to manually incorporate into regression models.
Since machine learning methods promise automatic identification of relevant predictors from thousands of predictors, they may improve predictive power by detecting and incorporating useful predictors from these unexplored data sources.

::: {.callout-warning}

### Reading the Lasso

I have to do more reading here, but have heard that variable selection techniques like [Lasso](https://en.wikipedia.org/wiki/Lasso_(statistics)) are fine for prediction tasks, reading into the specific variables they use to get the job done can be misleading.

:::

## Methods

### Data sources

- Swedish Rheumatology Quality Register (**SRQ**) for clinical data registered at inclusion and later visits
- National Patient Register (**NPR**) for visits to specialty care
- Prescribed Drug Register (**PDR**) for drug dispensations
- Total Population Register for sociodemographics
- Longitudinal Integration Database for Health Insurance and Labor Market Studies (**LISA**) for sick leave and disability pension
- Multi-Generatio Register for data on first-degree relatives

For more detail on the data linkage used, see Erikson et al. (TODO)

### Inclusion criteria

Patients included were

- new-onset RA
- registered in SRQ between 2006 and 2016
- registration within 1 year of RA symptom onset
- started MTX DMARD monotherapy as the first ever DMARD
- any of `M05`, `M06`, and `M13` ICD10 codes

### Defining persistence

To be classified as MTX persistent, a patient had to

- have a treatment record of MTX in the SRQ spanning 365 days after initiation
- not have received any other DMARD in this period

Further sub-outcomes were analysed and reported in the supplementary material but these are ommitted here.

### Covariates

Four nested covariate data sets were created.
All included the SRQ and sociodemographic data.

Set A
: This set included a traditional expert opinion-based set of predictors.

Set B
: This set expanded this information by using all primary diagnoses and prescriptions from the NPR and PDR.

Set C
: This set split information added in set B into time intervals (the year before, 1-5 years before, 5-10 years before).

Set D
: This set added contributory codes to the previously recorded main codes in set C.

### Statistics

Models were trained on a random 90% partition of the data and validated on the remaining 10%.

For all data sets and outcomes, the authors ran univariate logistic regressions to assess the association with the outcomes.

#### Hypothesis-based modelling

Hypothesis-based models all included data from covariate **Set A**.

After inspecting univariate associations and distributions of the individual covariates with the outcome, the epidemiologist built two models.
One model was based on manually entering and removing individual variables and testing interaction terms and nonlinear terms for continuous variables (this model was labelled *manual model*).
The second model was a simple backward selection[^1] logistic regression model starting with the full covariate set A.

[^1]: This variable selection technique is common but has come under [intense criticism](https://bookdown.org/max/FES/greedy-stepwise-selection.html) for violating principles of statistical hypothesis testing and failing to propagate the uncertainty arising in the selection step to when inferences are drawn. It is not yet clear to me in what way this may or may not pose issues in prediction settings.

#### Machine-learning models

The authors compared several machine learning methods, including

- Lasso
- Elastic net regularisation
- Support vector machines with a linear kernel
- Random forests
- Extreme gradient boosting

Fivefold cross-validation was applied in all machine learning models.

Finally, all but the elastic net model were combined into an ensemble model.
All models were used to predict the holdout data, and the AUROC was estimated for all covariate data sets and outcomes.

## Results

### MTX persistence

Out of 5475 patients, 3835 (70%) remained on MTX DMARD monotherapy one year after RA diagnosis.

### Model comparison

The best AUROCs from the hypothesis-based and machine-learning models were similar, scoring around 0.66 and 0.67, respectively.

## Conclusions

Machine-learning approaches are on par with hypothesis-based approaches and may offer valuable opportunities for integrating large numbers of potentially meaningful predictors from currently unavailable data types.
50 changes: 50 additions & 0 deletions content/statistics/mathnotation/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
title: "Mathematical notation for probability theory"
author: Simon Steiger
draft: true
date: "2024-07-18"
categories:
- Cheatsheet
- Reporting
subtitle: >
From sets to sparsity and product spaces to priors.
---

# General

Since my school days, I have lacked confidence in writing mathematical notation.
This cheatsheet is here to remedy that!

The notations I list here is all but set in stone, but seems to be least one of the common ways to approach notation in this subject area.
It is adapted from Michael Betancourt's [blog](https://betanalpha.github.io/writing/)

# Set notation

| Notation | Interpretation |
|:--------:|:--------|
| $X$ | The **ambient set** captures all objects of interest. |
| $x$ | A variable element in the ambient set $X$. |
| $x_n \in X$ | A specific element $x_n$ from the ambient set $X$. |
| $\text{x}$ | A variable subset of the ambient set $X$. |
| $X = \{\text{🥦, 🥫, 🥐}\}$ | A set with a finite number of elements. |
| $\text{x} \subset X$ | A **subset** of $X$. |
| $\text{x} \subseteq X$ | A subset potentially containing all elements of $X$. |
| $\text{x}'$ | A subset of another subset $\text{x}$. |
| $\emptyset = \{\}$ | The **empty set** which contains no elements at all. |
| $\{x_n\}$ | A subset with a single element is the **atomic set**[^1]. |
| $2^X$ | The **power set** of $X$ is the collection of all its subsets. |
| $\text{x} = \{\text{🥦}\}, \text{x}^c = \{\text{🥫, 🥐}\}$ | The **complement** $\text{x}^c$ of $\text{x}$ includes all $x_n \in X$ not already in $\text{x}$. |
| $\{\text{🥦, 🥫}\} \cup \{\text{🥫, 🥐}\} = \{\text{🥦, 🥫, 🥐}\}$ | A **union** includes all elements found in *either* subset. |
| $\{\text{🥦, 🥫}\} \cap \{\text{🥫, 🥐}\} = \{\text{🥫}\}$ | An **intersection** includes all elements found in *both* subsets. |

: Set notation and its interpretation {tbl-colwidths="[30,70]" .striped .borderless}

[^1]: Atomic vectors finally make sense (looking at you here, `R`!).

::: {.callout-caution}

### Work in progress

This document will be extended with more chapters!

:::

0 comments on commit a0c5dae

Please sign in to comment.