Skip to content

Latest commit

 

History

History
270 lines (194 loc) · 14.6 KB

SQL-generation-process.md

File metadata and controls

270 lines (194 loc) · 14.6 KB

Index

SQL generation for data extracts

The SQL required for data extracts should be generated from templates rather than created freehand. This has two main advantages:

  1. Fragments of SQL can be reused across multiple queries.
  2. Clinical codes sets can be stored and version controlled separately to the SQL and injected when required.

This doc describes the process for the SQL generation.

Clinical code sets

The GMCR receives data from EMIS, Vision and TPP practices. This means there is a combination of Read v2, CTV3, SNOMED and EMIS clinical codes. GraphNet attempt to link the codes (using the standard linking files from NHS Digital) so that in theory a single SNOMED code set could be used to match patients from any clinical system. However in practice this does not seem to match all codes. Therefore we ideally will have 4 code sets (Read v2, CTV3, SNOMED and EMIS) for each concept of interest. EMIS codes are most frequently found for medications, so it is likely for non-medication concepts we will only have 3 code sets (Read v2, CTV3 and SNOMED).

Structure

The clinical code sets live in this repository in the ./shared/clinical-code-sets directory. The directory structure is as follows:

clinical-code-sets
├─ conditions // conditions and diseases
  ├─ atrial-fibrillation // one folder per concept
    ├─ 1 // First version of the code set
      ├─ atrial-fibrillation.ctv3.txt
      ├─ atrial-fibrillation.readv2.txt
      └─ atrial-fibrillation.snomed.txt
    ├─ 2 // Second version of the code set
    ├─ ... // Further versions
  └─ hypertension
     ├─ 1
     └─ 2
├─ medications
├─ patient // patient-related such as ethnicity, occupation etc.
├─ procedures

The folder naming convention of the concepts, whereby the name is converted to lower case and spaces replaced with "-"s, is important, as this is used to identify the code sets in the SQL templates.

The numbering of the version folders, whereby newer versions get a higher version number, is also important, so that the SQL templates can use the latest version or reference a specific versions.

The naming convention of the code set files themselves is less important. They must end with .[ctv3|emis|readv2|snomed].txt so that we can identify the code terminology. The filename itself is not used, simply the code sets position in the directory hierarchy. However, it is useful to include the conditions name so that the code set can be identified when viewed in isolation.

If metadata on the code set exists, such as that generated by tools such as GetSet, that can be stored in the same directory with the following naming convention:

concept.[readv2|ctv3|snomed|emis].metadata.txt -- for unstructured terminology-specific metadata
concept.[readv2|ctv3|snomed|emis].metadata.json -- for structured terminology-specific metadata
README.md -- for general information about the code set

Versioning

Different projects may require different code sets for the same concept. Also code sets may be improved or added to over time. Sometimes a project may just want to use the most up to date code set, and other times projects may want to ensure consistency and methods reproducibility by having the code set fixed. Therefore it is sensible to allow for different code set versions.

It is unlikely that a code set would have more than a handful of versions. Therefore a simple versioning system where each version is given an integer (1,2,3,...) is sufficient. When a code set is referenced, the template SQL can specify a particular version. If no version is specified then the highest version number is used instead.

NB although in general a higher version number indicates a more up to date code set, this will not always be the case. E.g. a code set for a diagnosis could be highly sensitive to ensure no one is missed, or highly specific to ensure that only definite cases are matched. These might be versions 1 and 2, but projects would need to decide which code set was most appropriate for their use case.

New code sets

To make creating a new code set simple, an interactive command line tool can be used. It requires nodejs to be installed. For Windows users, launch the run - windows.bat file from the root of this project, then choose the option to Create a new code set, and follow on screen instructions. Linux/Mac users can execute npm start in a terminal from the root of the project. This creates the file and folder structure.

Validation

It is also possible to use the interactive command line tool to validate the existing code sets. This checks that the directory structure is as above, and also that the code set files have the correct format - namely two tab-separated columns code and description.

Project structure

Each research proposal should have it's own directory under .\projects. The naming convention for the project directories is XXX - Name where XXX is the project's assigned number, and Name is the name of the PI. The project directory is structured as follows:

projects
├─ 001 - Grant
  ├─ extraction-sql
    ├─ file1.sql
    ├─ file2.sql
    ├─ ...
  ├─ output-for-analysts // this gets generated by the extract-data.bat file and is not present initially
  ├─ scripts // js code for data extraction
  ├─ template-sql
    ├─ file1.template.sql
    ├─ file2.template.sql
    ├─ ...
  ├─ extract-data.bat
  ├─ generate-sql-windows.bat
  └─ generate-sql.sh
├─ 002 - Williams

The template-sql directory contains the SQL templates for each project. An automated script takes the templates and outputs SQL files into the extraction-sql directory. The output-for-analysts directory is where the extracted data files will be saved. This directory is auto-generated during the extraction process and does not need to be present in the repository.

There should be one *.template.sql file for each data file required by the project. Examples for naming might be:

copd-cohort-demographics.template.sql
copd-cohort-records.template.sql
matched-cohort-demographics.template.sql
matched-cohort-records.template.sql
cohort-hospitalisations.template.sql

SQL Templates

There should be one template.sql file per data file to be extracted. The SQL must end with a single SELECT statement, the output of which is the required data file. Interim temporary tables can be created within each SQL file.

SQL templates should include a header describing the data file that will be extracted. More precisely this will describe the columns. It is probably easiest to copy an existing one as an example, but the syntax is as follows:

--┌─────────────────────────────────┐
--│ Descriptive name of the file    │
--└─────────────────────────────────┘

-- OUTPUT: Data with the following fields
-- 	•	Date (YYYY-MM-DD)
--	•	ConsultationType (face2face/remote/home visit/ooh/other)
-- 	•	CCG (Bolton/Salford/HMR/etc.)

Please note the following for the final SELECT statement:

  1. All columns to be extracted should have a name. So if using CASE statements, or aggregate functions, you need to append AS ColumnName to the field. This ensures we can have column headers in the data extract.
  2. Data extraction is comma-separated by default, so if extracting data that may contain ","s, these can be removed prior to extraction using the REPLACE command.
  3. NULL values tend not to work with stats packages so it's usually better to replace with an empty string, 'NA' or 0 depending on the situation. ISNULL can be used for this.

An example with all 3 is as follows:

SELECT
  PatientId,
  ISNULL(IndexDate,'') AS IndexDate, -- replacing null index dates with the empty string
  Code,
  ISNULL(REPLACE(Description,',',' '),'') AS Description, -- replacing ","s with spaces and replacing NULL values
  MAX(CodeValue) AS HighestValue -- giving each column a name
FROM ...

Reusing SQL

Any SQL code that can be reused is placed under ./shared/Reusable queries for data extraction. Reusable code must have a header similar to the individual templates above. The syntax is below, but again, copying an existing query is likely the easiest option.

--┌─────────────────────────────┐
--│ Descriptive name of the sql │
--└─────────────────────────────┘

-- OBJECTIVE: The objective of this SQL. This is required. It can also
--            wrap onto more than one line like this.

-- ASSUMPTIONS:
--	-	This is an optional field
--	-	But where there are key assumptions that the analysts should know about
--    then they go here like this.
--	-	This means when the SQL is used in a data extract, we can provide the
--    analysts with a report showing the assumptions and limiations which they
--    can then optionally use in their work and final publication.

-- INPUT: Some templates only work if there already exists a temporary
--        table in the SQL session. E.g. a chunk of SQL that finds the
--        ethnicity information for a cohort of patients, might require
--        that there already exists a temporary table called #Patients
--        with an integer column called PatientId. The description of what
--        is required can go here. If no pre-requisites, please state that
--        explicitly.

-- OUTPUT: A description of the temporary tables that this query creates
-- #TableName (ColumnName1, ColumnName2, ColumnName3)
-- 	- ColumnName1 - description of this column
--	- ColumnName2 - state the type e.g. date, int, varchar etc..
--	- ColumnName3 - and anything else of interest
--

The syntax for then injecting these reusable chunks of SQL is as follows:

--> EXECUTE sql-file-to-execute.sql

Sometimes the SQL files require parameters passing. The header of the SQL file will describe which parameters are required. The format is as follows:

--> EXECUTE sql-file-to-execute.sql param1name:param1value param2name:param2value

The SQL file referenced should exist under ./shared/Reusable queries for data extraction.

Referencing code sets

Any query that requires code sets should include the following:

--> CODESET space-separated-list-of-required-code-sets-with-specific-number-of-version

E.g.

--> CODESET diabetes-type-i:1 metformin:1 copd:2 asthma:1

A SQL file can contain multiple rows like this. The only condition is that if a code set is required the call to --> CODESET must occur before it is used in the file.

If a script requires any code sets, then the following five temporary tables will be populated:

#AllCodes
Concept, Version, Code, Description

#CodeSets
FK_Reference_Coding_ID, Concept, Description

#SnomedSets
FK_Reference_SnomedCT_ID, Concept, Description

#VersionedCodeSets
FK_Reference_Coding_ID, Concept, Version, Description

#VersionedSnomedSets
FK_Reference_SnomedCT_ID, Concept, Version, Description

The #AllCodes table allows you to retrieve all clinical codes relating to a concept. The remaining four table link concepts to foreign key ids specific to the GraphNet database. The two #Versioned... tables allow queries based on the concept and the version. The other two tables just contain the most recent version (highest version number) of each concept. It is recommended to explicitly state the version of code set required.

FK_Reference_Coding_ID and FK_Reference_SnomedCT_ID are the (bigint) id fields that the GMCR database use to identify codes. All codes are mapped to one of both of these fields.

The temporary tables can be used as follows on any table containing the FK_Reference_Coding_ID or FK_Reference_SnomedCT_ID fields:

-- Finding all hypertension records using the a particular version of the code set
SELECT * FROM [RLS].[vw_GP_Events]
WHERE (
  FK_Reference_Coding_ID IN (SELECT FK_Reference_Coding_ID FROM #VersionedCodeSets WHERE Concept = 'hypertension' AND Version = 2) OR
  FK_Reference_SnomedCT_ID IN (SELECT FK_Reference_SnomedCT_ID FROM #VersionedSnomedSets WHERE Concept = 'hypertension' AND Version = 2)
)
AND ...

Occasionally it is necessary to search based on the clinical code rather than the foreign key ids. For example when new codes are recorded in clinical systems but have not yet been added to the Graphnet DB. In which case the #AllCodes table can be used instead as follows:

-- Find COVID vaccinations
SELECT * FROM [RLS].[vw_GP_Events]
WHERE SuppliedCode IN (SELECT Code FROM #AllCodes WHERE Concept = 'covid-vaccination' AND [Version] = 1)
AND ...

Process for generating SQL

The sql can be generated by executing either of the generate-sql scripts. There is a .bat file for Windows users and a .sh file for Linux/MacOS users.

Process for extracting data

Once the SQL has been generated, the entire project directory (XXX-Name) should be copied to the VDE. Launching the extract-data.bat file will guide you through the data extraction process, including deciding whether to run all, or just some, of the SQL files. The output-for-analysts directory will be created. This contains a skeleton folder structure, and a _README_FIRST.txt file to remind users of their responsibilities when extracting content from their VDE. The entire contents of the output-for-analysts directory should be copied to the shared directory for this project.

Patient id pseudonymisation

Analysts may receive multiple extracts of data from different projects. To reduce the chance of combining the data we will provide different patient ids for each project. This is done automatically during the extraction process.

NB Any patient ids that are to be pseudonymised must be extracted as a column whose name ends 'PatientId'. E.g. PatientId, AnotherPatientId, YetAnotherPatientId

During the data extraction you can choose to refresh the project's pseudonymisation, or to reuse the previous one.