Merge pull request #15 from RobertASmithBresMed/RS_introduction_edits

Writing academic paper
RobertASmithBresMed · May 6, 2022 · 644a2fa · 644a2fa
2 parents 3836c14 + 5603904
commit 644a2fa
Show file tree

Hide file tree

Showing 2 changed files with 138 additions and 21 deletions.
diff --git a/report/academic_paper.Rmd b/report/academic_paper.Rmd
@@ -12,7 +12,7 @@ knitr::opts_chunk$set(echo = TRUE)
 
 # Living HEOR: Automating HTA with R {.tabset}
 
-*Robert A Smith^1^ & Paul P Schneider^1^*
+*Robert A Smith^1^, Paul P Schneider^2,3^ & Wael Mohammed^2,3^*
 
 <br>
 
@@ -27,25 +27,47 @@ knitr::opts_chunk$set(echo = TRUE)
   __Keywords__:  `HEOR`, `HTA`, `APIs`, `R`, `Plumber`.  
 
   __Intended Journal__: Wellcome Open Research
+
+  __Corresponding Author__: Robert A Smith (*rasmith3@sheffield.ac.uk*)
 
 ****  
 
 ## Introduction
 
 The process of updating economic models is time-consuming and expensive, and often involves the transfer of sensitive data between parties. This paper aims to demonstrate how updates to models in the Health Economics & Outcomes Research (HEOR) industry can be conducted in a way that allows clients, primarily those in the pharmaceutical industry, to retain full control of their data. We continue on to show how to automate reporting as new information becomes available, without the transfer of data between parties.
 
+A previous publication by @adibi2022programmable made the case for cloud based platforms to improve the accessibility, transparency and standardization of health economic models, particularly highlighting the benefits of hosting computationally burdensome models on remote servers. However, to our knowledge this is the first publication to demonstrate the feasibility of the use of APIs to avoid the need for companies to share sensitive data, and also the first to provide open source code for the semi-automation of model updates in health economics.
+
 ## Method {.tabset}
 
 We developed an automated analysis and reporting pipeline for health economic modelling and made the source code openly available on a GitHub repository. It consists of three parts:
--	An economic model is constructed by the consultant using pseudo data (i.e. random data, which has the same format as the real data).
--	On the company side, an application programming interface (API), generated using the R package plumber, is hosted on a server. An automated workflow is created. This workflow sends the economic model to the company API. The model is then run within the company server. The results are sent back to the consultant, and a (PDF) report is automatically generated using RMarkdown.
-- This API hosts all sensitive data, so that data does not have to be provided to the consultant.
-- All of these processes can be controlled through an RShiny app, based on the tutorial application in our previous paper [@smith2020making].
+
+*	An economic model is constructed by the consultant using pseudo data (i.e. random data, which has the same format as the real data).
+
+*	On the company side, an application programming interface (API), generated using the R package plumber, is hosted on a server. An automated workflow is created. This workflow sends the economic model to the company API. The model is then run within the company server. The results are sent back to the consultant, and a (PDF) report is automatically generated using RMarkdown.
+
+* This API hosts all sensitive data, so that data does not have to be provided to the consultant.
+
+* All of these processes can be controlled through an RShiny app, based on the tutorial application in our previous paper [@smith2020making].
 
 ![Schematic showing the interaction between the Company API and the Consultant Automated Workflow](../app_files/www/process_diagram.PNG)
 
 All of the scripts discussed in this paper, as well as the code for the demonstration app can be found contained within an open access  [GitHub repository](https://github.com/RobertASmithBresMed/plumberHE).
 
+### The model code
+
+This model code has been amended from the DARTH group's open source Cohort state-transition model (the Sick-Sicker Model) which can be found in this [GitHub repository](https://github.com/DARTH-git/Cohort-modeling-tutorial/) and is discussed in @alarid2020cohort. The code includes several functions, but for the purpose of this example we can treat the model as a black box, as a single function called *run_model* which runs the DARTH Sick Sicker model. The *run_model* function takes a single argument, *psa_inputs*, which is a data-frame containing Probabilistic Sensitivity Analysis parameter inputs for the model variables that are allowed to vary. 
+
+The data-frame has four columns:
+* *parameter* - the name of the parameter (e.g. p_HS1)
+* *distribution* - the distribution of that parameter (e.g. "beta")
+* *V1* - the first parameter for the distribution in R (for beta this would be shape1, for normal this would be mean) 
+* *V2* - the second parameter for the distribution in R (for beta this would be shape2, for normal this would be sd)
+
+The *run_model* function returns a data-frame with six columns and a thousand rows. Each row is the result of the model run for a random draw from the PSA inputs. The first three columns are costs for each treatment option, and the second three columns are QALY for each treatment option.
+
+The full model code is available on GitHub [here](https://github.com/RobertASmithBresMed/plumberHE/blob/main/R/darth_funcs.R), however it is not important to understand this specific example model, therefore we move on to describe the creation of the API.
+
 ### The API
 
 An application programming interface is a set of rules, in the form of code, that allow different computers to interact with one another in real time. Whereas user-interfaces such as those generated by *shiny* allow humans to interact with data, APIs are designed to enable computers to interact with data.
@@ -62,13 +84,16 @@ There are lots of different implementations of APIs, but the main focus of this
 In the examples below we use JSON to pass information to and from our API.
 
 
+```{r, file='../R/darth_funcs.R', eval = F, echo = F}
+```
+
 #### Plumber
 
-Plumber allows programmers to create web APIs by decorating R source code with roxygen-style comments. The source code looks like a series of functions, which are not assigned to objects, with roxygen style comments specifying the parameters input to the function, the name of the function, and the output from the function. These functions are then made available as API endpoints by plumber. 
+The R package plumber allows programmers to create web APIs by decorating R source code with roxygen-style comments. The source code looks like a series of functions, which are not assigned to objects, with roxygen style comments specifying the parameters input to the function, the name of the function, and the output from the function [@plumber2021citation, R2021citation]. These functions are then made available as API endpoints by plumber. 
+
+The code below give an example function which echos a message. The function takes one input, a string with the message, and outputs the message contained within a list. If this function was created in R it would return a list containing some text, like this:  `r    list(msg = paste0("The message is: '", "example_msg", "'"))`. 
 
-The code below give an example function which echo's a message (returns the message).
-The function takes one input, a string with the message, and outputs the message.
-The API can be called using any type of request, the below shows and example of the GET request (the default for web-browsers).
+The API can be called using any type of request, the below shows and example of the 'GET' request (the default for web-browsers).
 
 ```{r, eval = F, echo = T}
 
@@ -81,44 +106,80 @@ function(msg="") {
 
 ```
 
+The code for the model function uses the same principles, but is much more developed. There are three parameter inputs:
 
+* *path_to_psa_inputs* 
+* *model_functions*
+* *param_updates*
+
+The function sources the model functions from GitHub, obtains the model parameter data from within the API, and then overwrites the rows of the parameter updates that exist in *param_updates*. It then runs the model functions using the updated parameters, post-processes the results, checks that no sensitive data is included in the results, and then returns a data-frame of results. This entire process occurs on the server on which the API is hosted, with inputs and outputs passed to the API over the web in JSON format. 
 
 ```{r, file='../darthAPI/plumber.R', eval = F, echo = T, class.source = 'fold-hide'}
 ```
 
-#### RStudio Connect
+#### Deploying an API
 
-Once you have an account, it is simple to deploy the API direct to [RStudio connect](https://www.rstudio.com/products/connect/) from the Rstudio IDE. RStudio have a blog on how to publish an API created using Plumber to RStudio connect [here](https://www.rstudio.com/blog/rstudio-1-2-preview-plumber-integration/#:~:text=%20Resources%20%201%20Creating%20an%20API.%20On,APIs%20defined%20in%20your%20project%20and...%20More%20)
+In this example we have deployed the API on RStudio Connect. An account is required for this, but once you have one it is possible to deploy the API direct to [RStudio connect](https://www.rstudio.com/products/connect/) from the Rstudio IDE. RStudio have a blog on how to publish an API created using Plumber to RStudio connect [here](https://www.rstudio.com/blog/rstudio-1-2-preview-plumber-integration/#:~:text=%20Resources%20%201%20Creating%20an%20API.%20On,APIs%20defined%20in%20your%20project%20and...%20More%20). There are numerous other providers of cloud computing services, many at cheaper prices, but no others with such ease of deployment from RStudio.
 
+### Automating the model run
 
-### The model code
+The model run can be automated. We first show how to run the model from an R script, calling the API. We then continue to show how to use GitHub actions to automate the process.
 
-This model code has been amended from the DARTH group's open source Cohort state-transition model (the Sick-Sicker Model) which can be found in this [GitHub repository](https://github.com/DARTH-git/Cohort-modeling-tutorial/) and is discussed in @alarid2020cohort.
+#### Interact with API from RScript
 
-The code includes several functions, but for the purpose of this example we can treat the model as a black box, as a single function called 'run_model' which runs the DARTH Sick Sicker model. The 'run_model' function requires as an input a data-frame containing the PSA iteration values for each variable. The function runs the model and returns a data-frame with costs and QALYs for three distinct scenarios.
+We use the *POST* function from the *httr* package to query the API [@httr2020citation] - as shown in the code chunk below. This function requires an internet connection. We provide values for several arguments:
 
-```{r, file='../R/darth_funcs.R', eval = F, echo = F}
-```
+* *url* - the URL of the RStudio Connect server hosting the API we have created using plumber. 
+* *path* - the path to the API within the server URL.
+* *query* & *body* - objects passed to the API in list format, with names matching the plumber function arguments.
+* *config* - allows the user to specify the KEY needed to access the API.
 
-### Automating the model run
+The *content* function attempts to determine the correct format for the output from the API based upon the content type. This function ensures that the results object is a dataframe.
 
-The model run can be automated. We first show how to run the model from an R script, calling the API. We then continue to show how to use GitHub actions to automate the process.
+The script then then goes on to save the data and generate a PDF report from the outputs using the RMarkdown package [@rmarkdown2020citation], the code for which can be found [here](https://github.com/RobertASmithBresMed/plumberHE/blob/main/report/darthReport.Rmd). The markdown report uses functions adapted from the [*darkpeak*](https://github.com/dark-peak-analytics/darkpeak) R package.   
 
-#### Interact with API from RScript
 ```{r, file='../scripts/run_darthAPI.R',eval = F, echo = T}
 ```
 
 #### Use GitHub actions to automate the process
+
+Once the API is created and hosted online, it can be called any time. The advantage of this is that any updates to either the model code, or the data used by the model, can be undertaken separately and the model re-run by either party. Calls to the API can also be scheduled at routine intervals. This would enable the health economic evaluation model report to be updated, without human interaction, at regular intervals to reflect the most up-to-date data.
+
+In the example below we show how a GitHub Actions (other providers available) workflow can be used to automate an update to a health economic evaluation [@chandrasekara2021introduction]. The workflow runs at 0:01AM on the first day of every month. It first clones the GitHub repository on a GitHub actions Windows 2019 server, then installs the necessary dependencies, before running the script described above to generate the model report. It creates a pull request to the repo with this new updated report. If GitHub is not the preferred location of report storage, it is possible to send the report via email or save to cloud storage solutions such as Google Drive or Dropbox.
+
 ```{r, file='../.github/workflows/auto_model_run.yml',eval = F, echo = T}
 ```
 
 ## Discussion
 
-The method is relatively complex, and requires a strong understanding of R, APIs, RMarkdown and GitHub Actions. However, the end result is a process, which allows the consultant to conduct health economic (or any other) analyses on company data, without having direct access – the company does not need to share their sensitive data. The workflow can be scheduled to run at defined time points (e.g. monthly), or when triggered by an event (e.g. an update to the underlying data or model code). Results are generated automatically and wrapped into a full report. Documents no longer need to be revised manually.
+As the collection & storage of large data sets has become more commonplace in health & health care settings, this data is increasingly being used to inform decision making. However, concerns about the security of this data, and the ethical implications about linked data sets, make the owners of this valuable resource particularly reluctant to share data with health economic modeling teams. The ability to host APIs on data-owners' servers, and send the model to the data rather than the data to the model, is one potential solution to this problem. The example described in this paper may be relatively simple, but gives a tech savvy health economist everything they need to set up a modelling framework which does not rely on the sharing of data by a company (or other data-owner).
+
+The framework described have a number of benefits. 
+
+* Firstly, no data needs to leave the data-owner's server. This is likely to significantly reduce administrative burden for both the company and the consultant, and reduce the number of data-leaks.  
+
+* Separating the model code from the data can significantly improve the transparency of the health economic model. Allowing others to critique methods & hidden structural assumptions, test the code and identify bugs should improve the quality of models in the long run.
+
+* The computational burden of the model is handled on a remote server. The power of these servers is considerably greater than that of a typical laptop, speeding up model run time considerably.
+
+* API calls can be made at any time, and will always reflect the data held by the company. In many cases these datasets are updated regularly, allowing companies, and other stakeholders, to see the results of the decision model based on the most up to date data, without needing human intervention to: send new datasets, re-run analysis, write a report, and provide that report in a suitable format for the company. Automating model updates at set schedules, or when data is updated, may be invaluable where data is updated regularly, as has been the case throughout the COVID-19 pandemic.
+
+However, the framework has a number of limitations: 
+
+* Firstly, the method is relatively complex, and requires a strong understanding of health economic modeling in R, API creation and hosting, RMarkdown or other automated reporting packages, and GitHub Actions. While we hope that this paper provides a useful resource to health economists seeking to utilise these methods, the bulk of the industry still operates in MS Excel. Providing tuition to upskill health economists, or creating teams consisting of both health economists and data-scientists & software engineers may mediate this limitation somewhat. The [R for HTA consortium](https://r-hta.org/) has the potential to play a crucial role in upskilling the industry.
+
+* There are still likely to be concerns about data security, even with the authentication procedures built in to the API functionality. Collaboration with experts in this field may mediate this significantly, since there is no fundamental reason why health data is any more sensitive, or vulnerable, than the plethera of other data (including banking data) that relies on APIs every day. It will be important to reassure companies that the use of APIs is likely to reduce, not increase the risk of data breaches, and that every interaction with the data can be logged. 
+
+* There is a risk that running the model remotely will result in the perception that the model is a 'black box'. The use of user-interfaces (such as those increasingly being created in shiny) to interrogate the model, as well as the increased transparency associated with being able to share code on sites such as GitHub, should reassure stakeholders that this framework is more transparent than the existing spreadsheet based solutions.
+
+* Often, when building a model, it is helpful to have the underlying data to be able to investigate the data, often through the generation of descriptive statistics. The process of sharing pysudo-data enables modelers to ensure that the models they create conform to the structure of the data input. However, the modeler still needs to be able to write code that is versatile enough to cope with data with unknown distributions & ranges. Code that will not break when the number of observations changes, or when the range or distribution of a variable changes. This is easily solved, again by improved training and the use of standard packages.   
+
+A recent working paper by @adibi2022programmable has provided a similar call to action, extolling the virtues of the API for decision modeling, and showing how APIs can be used to shift much of the computational burden away from key stakeholders & make models more accessible. This paper goes one step further, providing open source code for the creation and deployment of an API with an accompanying automated health economic evaluation update framework. It also provides clearly described open source code on two new pieces of additional functionality not previously described elsewhere; firstly it demonstrates how companies can host APIs themselves to negate the need to share data with subject experts, and secondly it demonstrates how model updates can be automated with GitHUb actions.
+
 
 ## Conclusions
 
-This example demonstrates that it is possible, within a HEOR setting, to separate the health economic model from the data, and automate the main steps of the analysis pipeline. We believe this is the first application of this procedure for a HEOR project.
+This example framework, with accompanying open source code base, demonstrates that it is possible, within a HEOR setting, to separate the health economic model from the data, and automate the main steps of the analysis pipeline. We believe this is the first application of this procedure for a HEOR project, and is certainly the first example to be made open source for the benefit of the wider community. We hope that this framework will improve the transparency of health economic models, reduce the cost & administrative burden of updating models, and increase the speed at which updates can occur.
 
 \newpage
 ## References