Skip to content

Latest commit

 

History

History
242 lines (181 loc) · 16.6 KB

DESIGN.md

File metadata and controls

242 lines (181 loc) · 16.6 KB

Design Documentation

The ELM to OMOP translator takes CQL or ELM (parsed CQL) as input, and generates an OHDSI cohort definition as output. The functionality also exists to submit the cohort definition to the OHDSI WebAPI to create the cohort definition, initiate the cohort generation job, and poll for the cohort results.

As this is a PhEMA project, the ultimate goal of this work is to evaluate whether the CQL language can be used for cross-platform EHR-driven phenotyping.

Circe Overview

The library used internally by the OHDSI WebAPI to represent and execute cohort definitions is Circe. The normal way that users create cohort definitions is by using the OHDSI web interface, called Atlas. The user creates a list of inclusion rules by clicking buttons and selecting from dropdown menus. To each inclusion rule they add criteria groups, and to each group they add specific domain criteria by using standard user interface controls.

Once saved, the Atlas application generates a JSON representation of the cohort definition, which is then submitted to the WebAPI, which saves it to the OHDSI database. The user must then take a separate action to initiate the cohort generation job. As part of this asynchronous job, the Circe library deserializes the JSON version of the cohort definition into Java objects that it uses internally. Circe then uses a set of builder classes, along with SQL templates to construct the database queries needed to generate the cohort from the cohort definition.

There are several benefits to this approach. First, the graphical user interface allows users without any knowledge of SQL, and only minimal knowledge of the OMOP Common Data Model (CDM) to construct cohorts of clinical and/or research interest. Separation of logic into distinct inclusion rules allows for more efficient database queries, since successive inclusion rules are only applied to the results of the previous rule. Separate inclusion rules further allow for the generation of attrition statistics and visualizations. Finally, using SQL enables the creation of manually optimized queries that may be better than those generated by an object-relational mapper.

The following Circe UML diagram shows how cohort expression are assembled, as well as what type of criteria exist.

Circe UML Diagram
Circe UML Diagram

CQL Overview

The Clinical Quality Language (CQL) is a domain-specific language focused on the clinical quality and decision support domains. It is parsed into a canonical abstract syntax tree they call the Expression Logical Model (ELM). Libraries exist to parse CQL into the equivalent ELM representation, so internally all our computation is done on ELM.

CQL is a highly expressive language approaching the complexity of a general purpose programming language. As such, it is able to represent significantly more constructs than Circe, as can be seen in the ELM UML diagram below. One simple example is that CQL is able to evaluate the expression 1 + 1 and return the result of 2. Furthermore, there are usually many different ways that the same logic can be represented in CQL. CQL is also data model independent, which means that libraries must specify the data that they are written for, and the evaluation engine must know about this data model in order to evaluate to the library.

A CQL library can contain many statements, each of which is evaluated separately (although statements can reference each other). This means that evaluating a CQL library returns multiple results - one for each statement in the library. Further, each result can be one of many different data types.

ELM UML Diagram
Circe UML Diagram

Implementation Considerations

Language Support

For the above reasons, we can only support translating a limited subset of the CQL language. We therefore define a set of supported language constructs, along with some conventions that must be followed so that we can successfully translate the CQL library to a Circe cohort definition. We aim to support the following language constructs:

  • The CalculateAge() function, used to determine the age of the patient [docs]
  • Simple Retrieve operation with terminology filtering, to access the underlying data [docs]
  • The following numeric comparisons: =, <, <=, >, >= [docs]
  • Some Query operations with where clause filtering [docs]
  • Some correlated Query operations with a single relationship [docs]
  • Timing relationships in Query constructs (including correlated queries) [docs]
  • The And and Or logical operators [docs]

The Query operations that we are able to support is limited by the criteria and criteria attributes that Circe is able to represent.

More operations will be added to the above list over time, for example, additional demographic characteristics.

Data Model

A simple approach would be to use the OMOP CDM as the data model in our CQL libraries, but this would limit the environments in which the library can be executed. We have therefore taken the decision to use the QUICK data model, which is a set of FHIR profiles and data type mappings that are focused on quality measurement and decision support use cases. It is likely that many CQL libraries will be written using the QUICK data model, and supporting this data model means that we are able to re-use logic written for many clinical quality measurement and decision support use cases.

In order to map the QUICK model references to the OMOP CDM, we use the mappings published by the Common Data Model Harmonization project.

Conventions

Patient Context Only

CQL libraries may contain zero or more context statements. This statement tells the interpreter to potentially apply some data filtering. For example, if the Patient context is specified, then only data for a specific patient is included in the evaluation. If the Unfiltered context is used, then data for all patients is considered. Data models may optionally specify additional evaluation contexts [docs].

We support only the Patient context, which means that each statement should be written with knowledge that it will be applied to a single patient only. This also means that we cannot support the population-based aggregate operators [docs].

Phenotype Entry Point Statement

Since a CQL library may contain multiple statements, but we only create one Circe cohort definition, we must somehow determine which statement represents the phenotype definition. Currently, the translator is written in such a way that it takes the name of the phenotype definition statement as a parameter. An alternative approach could be to use a statement naming convention, or some other way to annotate the correct statement definition.

Boolean Return Types

The ultimate decision that must be made for each patient is whether or not they should be included in the cohort specified by the Circe cohort definition, or equivalent CQL library. In Circe, this decision is made by taking the logical conjunction of each inclusion rule applied to each patient.

The approach we have taken is to require that all CQL statements return a boolean value. At first this may seem limiting, but it actually exactly matches how Circe represents cohort definitions. Each Circe criteria determines exactly the boolean result corresponding to whether or not a given patient meets the criteria.

Implementation Details

Some technical implementation details are described below.

Inclusion Rules

When users create cohort definitions using Atlas, it is convenient to group conceptually similar criteria together in a single inclusion rule. One example is that two demographic criteria, such as one for age and for gender may both be added to one inclusion rule. Without introducing additional conventions, it is unfortunately not possible to detect these conceptually similar criteria in the translator code. As a result, all criteria are added to a single inclusion rule.

There are two unfortunately consequences. First, it not possible to determine the attrition contribution from groups of criteria. Second, this limitation may result in poorer performing database queries, since criteria are not applied only to the results of preceding inclusion rules.

It may be worth introducing additional conventions to overcome these limitations.

🤯 Criteria Groups, Correlated Criteria, and Criteria

In Circe, criteria groups are used to group collections of criteria together. Criteria groups must also specify how the contained criteria should apply. For example, the user can specify whether all criteria must apply, whether any one can apply, or a whether minimum or maximum number of criteria must apply.

Criteria themselves can be correlated or uncorrelated. For example, looking for a condition that matches a specific value set is an uncorrelated criteria. Looking for a measurement with a specific value that occurs within some time frame of a specific procedure is an example of a correlated criteria.

Unfortunately, in version 1.7.0 of the Circe library, criteria groups can only contain instances of the CorelatedCriteria class (or DemographicCriteria or other CriteriaGroups). This means that specific domain criteria (e.g. ProcedureOccurrence) must always be wrapped in a CorelatedCriteria, even when uncorrelated. Further, the Criteria parent class of all domain criteria has a field called CorelatedCriteria, which is actually of type CriteriaGroup, which can be very confusing. However, this field does determine which criteria groups are correlated to the specific domain criteria instance.

Consider the very simple case of a cohort definition where we are only looking for patients that have had a procedure matching a specific value set. To accomplish this, we would create a CohortDefinition instance, to which we would add a CohortExpression containing a single InclusionRule. The InclusionRule class contains a single expression member that is of type CriteriaGroup.

To construct the logic, we begin by creating an instance of the ProcedureOccurrence criteria referencing the appropriate value set (more on value sets below). We leave the CorelatedCriteria field (note: this is the name of the field, not its type, which is unfortunately actually CriteriaGroup) null, since this is an uncorrelated criteria.

We must then create a CorelatedCriteria, and set the criteria field to the ProcedureOccurrence instance just created. Finally, we can add this CorelatedCriteria to a CriteriaGroup, which we can then add to the InclusionRule. We end up with something that looks like the following:

• CohortExpression
   • CohortExpression
      • List<InclusionRule>
         - 0: CriteriaGroup
               • List<CorelatedCriteria>
                  - 0: CorelatedCriteria
                        • ProcedureOccurrence
                           • CriteriaGroup = null (name: CorelatedCriteria, type: CriteriaGroup)                                

One reason why all Criteria must be wrapped in a CorelatedCriteria is because cohorts in OHDSI are modeled using the idea of cohort entry event, and all criteria are actually correlated in some way to this entry event, either directly or indirectly. The CorelatedCriteria class therefore has startWindow and endWindow fields, which are relative (directly or indirectly) to the entry event.

Finally, CorelatedCriteria also has an occurence field of type Occurence, which is used to describe how domain criteria should apply. Continuing the above example, if the procedure should occur at least three times, then we specify this using an Occurence instance.

Note that in all cases except for Boolean logic (see below), we translate CQL constructs to their equivalent CorelatedCriteria representations.

Below is a table with a short summary of some of the important Circe classes.

Class Description
CohortDefinition This is actually a class in the WebAPI, not Circe, but it is the outer most wrapper of the payload that is sent to the WebAPI.
CohortExpression This class is the container for the expression logic, including the InclusionRule instances, and the cohort entry event (an instance of PrimaryCriteria).
InclusionRule An inclusion rule just wraps a single CriteriaGroup, giving it a name.
CriteriaGroup A criteria group contains any number of CorelatedCriteria, DemographicCriteria and/or other CriteriaGroup instances. It also specifies how these criteria are applied (e.g. ALL, ANY, AT_LEAST 3, etc).
CorelatedCriteria This class wraps all domain criteria, and associates start and end windows with them (relative to the parent criteria or entry event). This class also has an Occurence field specifying how many, say, procedures should be found.
Criteria This abstract class is the parent of all the domain criteria (e.g. ConditionOccurrence, Observation, etc). Criteria also contains a field (unfortunately) called CorelatedCriteria of type CriteriaGroup which facilitates temporal correlation between criteria.
DemographicCriteria This is a special type of criteria allowing the user to filter cohort members based on demographic characteristics.

Nested Boolean Logic

Boolean logic can only be represented in Circe using CriteriaGroup instances of type ALL and ANY, representing Boolean AND and OR statements respectively. We support arbitrarily nested Boolean AND and OR statements, and implement such nesting using nested CriteriaGroup instances.

Value Sets

Circe uses the ConceptSet class to represent value sets. All referenced value sets are included inline in the CohortExpression instance. I believe this is so that descendents, mapped, and excluded concepts specified in existing OHDSI concept sets can all be resolved ahead of evaluating the cohort expression.

That said, our implementation does not make use of existing concept sets. Instead, we define a service interface used to retrieve the relevant concepts sets. We have service implementations that read PhEMA value sets from CSV files, and resolve concepts using the OHDSI WebAPI. This supports using publicly accessible value sets based on standard terminologies, and decouples the implementation from the OHDSI platform.

We also have a service implementation that reads pre-resolved concept sets from file in JSON format. This is more efficient, especially for large concept sets, since concepts must otherwise be resolved one at a time by performing a search using the WebAPI, which is an expensive operation.

Alternative Approaches

The approach described above is not the only possibility. Another approach would be to extend the reference implementation of the CQL engine (or create a new implementation) to directly support the OMOP CDM as a data model. The advantage of this approach is that the CQL library author would have full access to the expressiveness of the CQL language, and could write queries of any type, for any purpose, not just phenotyping.

The downside of this approach is that CQL libraries written against this data model would then be tied to the OHDSI platform, and would not be cross-platform, as in the currently implementation. Further, the full set of OHDSI tools for cohort analysis would no longer be available to the user. Importantly, in the current approach, a user can look at the generated cohort definition in the existing Atlas interface to manually inspect whether or not the logic is correct, which would not be possible in a pure CQL data provider implementation.