- Overview
- Data description
- Current projects
- Clinical code sets
- Additional technical information
- Research Data Engineer process
- SQL GENERATION PROCESS
The SQL required for data extracts should be generated from templates rather than created freehand. This has two main advantages:
- Fragments of SQL can be reused across multiple queries.
- Clinical codes sets can be stored and version controlled separately to the SQL and injected when required.
This doc describes the process for the SQL generation.
The GMCR receives data from EMIS, Vision and TPP practices. This means there is a combination of Read v2, CTV3, SNOMED and EMIS clinical codes. GraphNet attempt to link the codes (using the standard linking files from NHS Digital) so that in theory a single SNOMED code set could be used to match patients from any clinical system. However in practice this does not seem to match all codes. Therefore we ideally will have 4 code sets (Read v2, CTV3, SNOMED and EMIS) for each concept of interest. EMIS codes are most frequently found for medications, so it is likely for non-medication concepts we will only have 3 code sets (Read v2, CTV3 and SNOMED).
The clinical code sets live in this repository in the ./shared/clinical-code-sets directory. The directory structure is as follows:
clinical-code-sets
├─ conditions // conditions and diseases
│ ├─ atrial-fibrillation // one folder per concept
│ │ ├─ 1 // First version of the code set
│ │ │ ├─ atrial-fibrillation.ctv3.txt
│ │ │ ├─ atrial-fibrillation.readv2.txt
│ │ │ └─ atrial-fibrillation.snomed.txt
│ │ ├─ 2 // Second version of the code set
│ │ ├─ ... // Further versions
│ └─ hypertension
│ ├─ 1
│ └─ 2
├─ medications
├─ patient // patient-related such as ethnicity, occupation etc.
├─ procedures
The folder naming convention of the concepts, whereby the name is converted to lower case and spaces replaced with "-"s, is important, as this is used to identify the code sets in the SQL templates.
The numbering of the version folders, whereby newer versions get a higher version number, is also important, so that the SQL templates can use the latest version or reference a specific versions.
The naming convention of the code set files themselves is less important. They must end with .[ctv3|emis|readv2|snomed].txt
so that we can identify the code terminology. The filename itself is not used, simply the code sets position in the directory hierarchy. However, it is useful to include the conditions name so that the code set can be identified when viewed in isolation.
If metadata on the code set exists, such as that generated by tools such as GetSet, that can be stored in the same directory with the following naming convention:
concept.[readv2|ctv3|snomed|emis].metadata.txt -- for unstructured terminology-specific metadata
concept.[readv2|ctv3|snomed|emis].metadata.json -- for structured terminology-specific metadata
README.md -- for general information about the code set
Different projects may require different code sets for the same concept. Also code sets may be improved or added to over time. Sometimes a project may just want to use the most up to date code set, and other times projects may want to ensure consistency and methods reproducibility by having the code set fixed. Therefore it is sensible to allow for different code set versions.
It is unlikely that a code set would have more than a handful of versions. Therefore a simple versioning system where each version is given an integer (1,2,3,...) is sufficient. When a code set is referenced, the template SQL can specify a particular version. If no version is specified then the highest version number is used instead.
NB although in general a higher version number indicates a more up to date code set, this will not always be the case. E.g. a code set for a diagnosis could be highly sensitive to ensure no one is missed, or highly specific to ensure that only definite cases are matched. These might be versions 1 and 2, but projects would need to decide which code set was most appropriate for their use case.
To make creating a new code set simple, an interactive command line tool can be used. It requires nodejs to be installed. For Windows users, launch the run - windows.bat
file from the root of this project, then choose the option to Create a new code set
, and follow on screen instructions. Linux/Mac users can execute npm start
in a terminal from the root of the project. This creates the file and folder structure.
It is also possible to use the interactive command line tool to validate the existing code sets. This checks that the directory structure is as above, and also that the code set files have the correct format - namely two tab-separated columns code and description.
Each research proposal should have it's own directory under .\projects
. The naming convention for the project directories is XXX - Name
where XXX
is the project's assigned number, and Name
is the name of the PI. The project directory is structured as follows:
projects
├─ 001 - Grant
│ ├─ extraction-sql
│ │ ├─ file1.sql
│ │ ├─ file2.sql
│ │ ├─ ...
│ ├─ output-for-analysts // this gets generated by the extract-data.bat file and is not present initially
│ ├─ scripts // js code for data extraction
│ ├─ template-sql
│ │ ├─ file1.template.sql
│ │ ├─ file2.template.sql
│ │ ├─ ...
│ ├─ extract-data.bat
│ ├─ generate-sql-windows.bat
│ └─ generate-sql.sh
├─ 002 - Williams
The template-sql
directory contains the SQL templates for each project. An automated script takes the templates and outputs SQL files into the extraction-sql
directory. The output-for-analysts
directory is where the extracted data files will be saved. This directory is auto-generated during the extraction process and does not need to be present in the repository.
There should be one *.template.sql
file for each data file required by the project. Examples for naming might be:
copd-cohort-demographics.template.sql
copd-cohort-records.template.sql
matched-cohort-demographics.template.sql
matched-cohort-records.template.sql
cohort-hospitalisations.template.sql
There should be one template.sql
file per data file to be extracted. The SQL must end with a single SELECT
statement, the output of which is the required data file. Interim temporary tables can be created within each SQL file.
SQL templates should include a header describing the data file that will be extracted. More precisely this will describe the columns. It is probably easiest to copy an existing one as an example, but the syntax is as follows:
--┌─────────────────────────────────┐
--│ Descriptive name of the file │
--└─────────────────────────────────┘
-- OUTPUT: Data with the following fields
-- • Date (YYYY-MM-DD)
-- • ConsultationType (face2face/remote/home visit/ooh/other)
-- • CCG (Bolton/Salford/HMR/etc.)
Please note the following for the final SELECT
statement:
- All columns to be extracted should have a name. So if using
CASE
statements, or aggregate functions, you need to appendAS ColumnName
to the field. This ensures we can have column headers in the data extract. - Data extraction is comma-separated by default, so if extracting data that may contain ","s, these can be removed prior to extraction using the
REPLACE
command. NULL
values tend not to work with stats packages so it's usually better to replace with an empty string, 'NA' or 0 depending on the situation.ISNULL
can be used for this.
An example with all 3 is as follows:
SELECT
PatientId,
ISNULL(IndexDate,'') AS IndexDate, -- replacing null index dates with the empty string
Code,
ISNULL(REPLACE(Description,',',' '),'') AS Description, -- replacing ","s with spaces and replacing NULL values
MAX(CodeValue) AS HighestValue -- giving each column a name
FROM ...
Any SQL code that can be reused is placed under ./shared/Reusable queries for data extraction. Reusable code must have a header similar to the individual templates above. The syntax is below, but again, copying an existing query is likely the easiest option.
--┌─────────────────────────────┐
--│ Descriptive name of the sql │
--└─────────────────────────────┘
-- OBJECTIVE: The objective of this SQL. This is required. It can also
-- wrap onto more than one line like this.
-- ASSUMPTIONS:
-- - This is an optional field
-- - But where there are key assumptions that the analysts should know about
-- then they go here like this.
-- - This means when the SQL is used in a data extract, we can provide the
-- analysts with a report showing the assumptions and limiations which they
-- can then optionally use in their work and final publication.
-- INPUT: Some templates only work if there already exists a temporary
-- table in the SQL session. E.g. a chunk of SQL that finds the
-- ethnicity information for a cohort of patients, might require
-- that there already exists a temporary table called #Patients
-- with an integer column called PatientId. The description of what
-- is required can go here. If no pre-requisites, please state that
-- explicitly.
-- OUTPUT: A description of the temporary tables that this query creates
-- #TableName (ColumnName1, ColumnName2, ColumnName3)
-- - ColumnName1 - description of this column
-- - ColumnName2 - state the type e.g. date, int, varchar etc..
-- - ColumnName3 - and anything else of interest
--
The syntax for then injecting these reusable chunks of SQL is as follows:
--> EXECUTE sql-file-to-execute.sql
Sometimes the SQL files require parameters passing. The header of the SQL file will describe which parameters are required. The format is as follows:
--> EXECUTE sql-file-to-execute.sql param1name:param1value param2name:param2value
The SQL file referenced should exist under ./shared/Reusable queries for data extraction.
Any query that requires code sets should include the following:
--> CODESET space-separated-list-of-required-code-sets-with-specific-number-of-version
E.g.
--> CODESET diabetes-type-i:1 metformin:1 copd:2 asthma:1
A SQL file can contain multiple rows like this. The only condition is that if a code set is required the call to --> CODESET
must occur before it is used in the file.
If a script requires any code sets, then the following five temporary tables will be populated:
#AllCodes
Concept, Version, Code, Description
#CodeSets
FK_Reference_Coding_ID, Concept, Description
#SnomedSets
FK_Reference_SnomedCT_ID, Concept, Description
#VersionedCodeSets
FK_Reference_Coding_ID, Concept, Version, Description
#VersionedSnomedSets
FK_Reference_SnomedCT_ID, Concept, Version, Description
The #AllCodes
table allows you to retrieve all clinical codes relating to a concept. The remaining four table link concepts to foreign key ids specific to the GraphNet database. The two #Versioned...
tables allow queries based on the concept and the version. The other two tables just contain the most recent version (highest version number) of each concept. It is recommended to explicitly state the version of code set required.
FK_Reference_Coding_ID
and FK_Reference_SnomedCT_ID
are the (bigint) id fields that the GMCR database use to identify codes. All codes are mapped to one of both of these fields.
The temporary tables can be used as follows on any table containing the FK_Reference_Coding_ID
or FK_Reference_SnomedCT_ID
fields:
-- Finding all hypertension records using the a particular version of the code set
SELECT * FROM [RLS].[vw_GP_Events]
WHERE (
FK_Reference_Coding_ID IN (SELECT FK_Reference_Coding_ID FROM #VersionedCodeSets WHERE Concept = 'hypertension' AND Version = 2) OR
FK_Reference_SnomedCT_ID IN (SELECT FK_Reference_SnomedCT_ID FROM #VersionedSnomedSets WHERE Concept = 'hypertension' AND Version = 2)
)
AND ...
Occasionally it is necessary to search based on the clinical code rather than the foreign key ids. For example when new codes are recorded in clinical systems but have not yet been added to the Graphnet DB. In which case the #AllCodes
table can be used instead as follows:
-- Find COVID vaccinations
SELECT * FROM [RLS].[vw_GP_Events]
WHERE SuppliedCode IN (SELECT Code FROM #AllCodes WHERE Concept = 'covid-vaccination' AND [Version] = 1)
AND ...
The sql can be generated by executing either of the generate-sql
scripts. There is a .bat
file for Windows users and a .sh
file for Linux/MacOS users.
Once the SQL has been generated, the entire project directory (XXX-Name
) should be copied to the VDE. Launching the extract-data.bat
file will guide you through the data extraction process, including deciding whether to run all, or just some, of the SQL files. The output-for-analysts
directory will be created. This contains a skeleton folder structure, and a _README_FIRST.txt
file to remind users of their responsibilities when extracting content from their VDE. The entire contents of the output-for-analysts
directory should be copied to the shared directory for this project.
Analysts may receive multiple extracts of data from different projects. To reduce the chance of combining the data we will provide different patient ids for each project. This is done automatically during the extraction process.
NB Any patient ids that are to be pseudonymised must be extracted as a column whose name ends 'PatientId'. E.g. PatientId, AnotherPatientId, YetAnotherPatientId
During the data extraction you can choose to refresh the project's pseudonymisation, or to reuse the previous one.