version | author |
---|---|
1.0 |
GB, AC, SJ, HL |
Coursework submitted for module #2491 Data Challenge
MSc Health Data Science, London School of Hygiene and Tropical Medicine
Contributors:
AC @Syn4pt1c
HL @malaporpism
Supervisor: Dr Rosalind "Roz" Eggo
Client: UK Health Security Agency (UKHSA)
Influenza is a highly infectious disease caused by a family of virus under the same name. Up to 650,000 deaths yearly can be directly attributed to seasonal outbreaks of influenza worldwide. The virus has multiple species, with type A and type B being predominant in spreading among human. Progression of flu is usually self-limiting. For most, symptoms last for 2~7 days. Severe cases might require hospitalisation, or even deaths, particularly for at-risk groups (children <2 yo, pregnant, elderly >65yo, and people with underlying conditions).
The UK's flu surveillance system comprises of numerous different input points. For example, weekly % of GP consultations presenting with influenza-like illnesses (ILI), or death registrations where the leading cause is flu. To better understand and compare these data sources, we defined our research question as follows:
What is the temporal relationship between different influenza data sources (lab-confirmed infections, GP consultations, hospitalisations) in the UK, and have these changed in 2022/23 compared to the pre-pandemic years 2016-2019?
We hypothesise that there is a measurable lag between one source of surveillance and another. Statistical tools can then be applied to predict the timing and scale of a 'later' data source using an 'earlier' source. Additionally, different age groups can manifest at different time and scale, and one might be predictive of another. Also, changes in reporting procedures on respiratory infections in the recent years (post-covid) is also suspected to be causing some variations in data.
All data used in this project were publicly available on various governmental websites.
Primary care data was reported by the RCGP Research and Surveillance Centre in the communicable and respiratory disease reports, published weekly. Swabs data is reported weekly by PHE. Secondary care data is extracted from the UKHSA Severe Acute Respiratory Infection Watch (SARI Watch) system, updated weekly during the flu season (week 40 - week 20 next year). Mortality data is reported weekly by the ONS.
All cleaning and processing are done using the R language (v4.2.2) and the RStudio IDE.
If you have SSH configured, great, just fire up the terminal
$ cd ~/YOUR/PATH
$ git clone git@github.com:gabrielbattcock/data_challenge.git
If you don't, all is not lost my friend, you still need to
$ cd ~/YOUR/PATH
$ git clone https://github.com/gabrielbattcock/data_challenge.git
This project cannot be built without the numerous packages from the R community. We use pacman::p_load()
to install if needed, and load these packages without user intervention. If you prefer to install these selectively or individually, below is a comprehensive list of the packages, grouped by purpose.
- GENERAL PURPOSE
- Tidyverse (core)
- pacman for loading and installing packages
- Knitr for dynamic report generation
- shiny to build interactive web applications
- READING FILES
- DATA MANIPULATION
- [Reshape2 [retired]](https://cran.r-project.org/web/packages/reshape2/index.html) for
melt()
- Magrittr for the almighty double pipe
%<>%
- kableExtra to construct complex tables within the pipe syntax
- [Reshape2 [retired]](https://cran.r-project.org/web/packages/reshape2/index.html) for
- STATISTICAL ANALYSIS
- VISUALISATION
- ggrepel to avoid overlapping text labels
- ggpubr for publication-ready plots
- gt for presenting tables
- gtsummary for presenting data summary and analytic results
- hrbrthemes for additional themes and utils for
ggplot2
- RColorBrewer for additional colour schemes
- robvis for risk-of-bias (ROB) assessments
- scales to map data to aesthetics
- wesanderson for Wes Anderson inspired palettes and themes
You don't need to run source_data_entry.R
by itself.
Load the .Rproj
, then just open the script. It will source()
the data that's needed.
With in the directory ../R_scripts
, there is a script called source_data_entry.R
which gathers the various files in ../allData
that we have collected and outputs data frames which are used throughout the rest of the project.
The remainder of the code are saved under folders named by their purpose. Refer to ../R_scripts/Season_data
for plots and analysis comparing each source over a season or in ../R_scripts/source
, which compares the seasons for a given source, such as GP data.
data_challenge/
├─ README.md <<<<<<<<<<<<<<<<<<<<<<<<<< YOU ARE HERE
├─ allData/ * all data, in various formats
│ ├─ gp/
│ ├─ hospitalisation/
│ ├─ mortality/
│ ├─ swab/
│ ├─ vaccine/
│ └─ ...
├─ bibliography/
│ └─ UKHSA_bibliography.bib
├─ images/
│ └─ ... * full-resolution copy of all plots and infographic
├─ Meeting_notes
│ └─ ...
├─ Presentation
│ ├─ images/
│ ├─ Presentation_files/
│ ├─ Presentation.html
│ └─ Presentation.qmd * the presentation, with editable and executable code
├─ Presentation_files
│ └─ ...
├─ R_scripts
│ ├─ Season_data/
│ ├─ source/
│ ├─ corr.R
│ ├─ source_data_entry.R
│ └─ web_scraping_all_flu_subtypes.R
├─ report/
│ ├─ images/
│ ├─ UKHSA_report_files/
│ ├─ method_notes.R
│ ├─ UKHSA_report.html
│ ├─ UKHSA_report.pdf * the report rendered into pdf
│ └─ UKHSA_report.qmd * the report, with editable and executable code
├─ new_folder/
│ ├─ new_folder/
│ ├─ new_folder/
│ └─ new_folder/
├─ data_challenge.Rproj
└─ .gitignore