Understanding Circe-be Logic Through Capr for Generating Complex Cohort Definitions
1 Introduction
1.1 ATLAS
Typically, we define cohort definitions for OHDSI studies using ATLAS. ATLAS has several benefits, in particular having a nice user interface to visual the cohort definition we are trying to create. However, there are times when ATLAS can be a bit tedious particularly when we must create several cohort definitions with a similar structure (template). We can deal with this situations by copying and pasting, however this can lead to errors in cohort logic and can also be quite time consuming.
1.2 Capr
Given the challenges of templating in ATLAS, the R package Capr (pronounced like the edible flower bud, caper) was created as a programatic interface to defining cohort logic for OMOP data, serving as an alternative avenue to generating cohort definitions for OHDSI studies. The advantage scripting cohort definitions is that we can define a template of our definition and iterate across multiple possibilities. Capr emphasizes the DRY principle in coding, (“Do not Repeat Yourself”) which forces programmers to define something once instead of multiple times. This sounds great, however this comes with a slight change in mindset when defining cohort definitions. To properly use Capr users need to understand the underlying logic expressed in circe-be. Capr attempts to re-populate the same json structure as one would in ATLAS, essentially a backdoor to circe-be which we have a bit more control over.
1.3 circe-be
Underneath the hood of ATLAS, there lies the circe-be software, essentially a bridge between clinical concept to computational query. When users fill out a cohort definition in ATLAS they are populating a json file. Think of the json like an “Mad-lib”, you are entering pieces into a structure that would formulate a coherent message. circe-be takes these instructions and translates them into a sql query that we can run against the OMOP data. This is a powerful tool because it is standardizing queries across the OHDSI network. In order to create this standardized query, circe-be builds elements of a sql script based on underlying components. Some of these components we are familiar with (primary criteria, inclusion rules, etc.) while others are not as well-known (query, count, group). The purpose of this demo is to use Capr to help users understand the underlying constructs of circe-be. Understanding these constructs will help improve users ability to create complex cohort definitions in ATLAS and Capr and learn the ideas towards templating in Capr.
2 Tutorial
In this tutorial we use Capr to show the circe-be structures. In particular we will demonstrate five structures: 1) Concept Set Expression (CSE), 2) Query, 3) Attribute, 4) Count and 5) Group. We provide code blocks of how to create the circe-be structure in Capr. Each code block is also accompanied by dplyr code that expresses the idea how the circe-be structure is constructed individually. The idea is to show how the Capr object would be deployed once it goes through the conversion process to standardized sql.
For our example we walk through the eMERGE phenotype for defining a Type 2 Diabetes (T2D) case. This is a complex algorithm with five potential pathways to define a T2D case, as shown in figure 1. To construct this full pathway we need to define the sub-components in the circe-be logic. We use Capr as a means of demonstrating each component of the circe-be semantic model and interfacing with these sub-components used to build cohort definitions. We build this cohort definition using the test CMS Synpuf database which includes the latest OMOP vocabulary used to define the logic. At the end of the tutorial we provide the full Capr code to build this complex cohort.
2.1 Concept Set Expression
The first circe-be structure is the concept set expression. This is essentially a code list used to define a clinical event of interest. The expression aspect of this structure adds relational structure to the code set, incorporating descendant logic and adding exceptions to the code list for refined definition. In the eMerge algorithm, some of the paths require T2D medications in order to find a case of T2D. In the documentation we can get the list of RxNorm codes and then find the OMOP concept IDs for them. But databases record events more than just the ingredient, there can be different dosages, brands and delivery methods. However, we want to count all of these variations. This can be done using a concept set expression. In the Capr code we look up the drug IDs and then check off the includeDescendants toggle to add in all concepts that descend this hierarchy. Now when we want to look up a record of a T2D medication, we do not just look up the ingredient concept, we look at all the descendants (quite powerful 💪).
The code below is how we construct a concept set expression using Capr. The first step is to look up the concept ids in the concept table of the OMOP vocabularies and then merge this with the concept_ancestor table to find all descendants. When we run this line of code, remember that you need to establish a connection to your database to access the vocabulary tables in the defined schema.
To give further context as to what is going on here, we use dplyr to abstract the sql query that is taking place behind the scenes. Again we take our list of ingredient concepts and find all the descendants through the concept_ancestor table. The ohdsisql typically holds this in a temp table.
In Capr our first step in defining a cohort is to define the CSE. This construct holds all of the codes we want to look up across the different tables in CDM. To build a cohort definition we need to make sure our list of concepts is thorough.
2.2 Query
Recall the structure of the CDM (shown in the margin). We have a relational database, so to extract data from this format we need to merge tables using keys. For example, say we are looking for metformin users. We would merge the concept_id for metformin from the concept table to the drug_concept_id in the drug_exposure table and find person_id for patients that took metformin. Simply put we are performing a type of query on the relational database.
The Capr code for a query is very simple. We need to define which domain we need to look up “hits” of the concept set. Queries in Capr are defined by the create verb followed by the name of the clinical table. For example if we want a condition occurrence the Capr signature is createConditionOccurrence
. The input of the query must be a Concept Set Expression object. Using Capr we are simply telling the circe-be engine that we want to look up a particular concept set in the designated domain.
Further we can show how this declaration would be deployed in circe-be through the code below. We join the CSE we made earlier with the drug exposure table looking for persons that have a record of the code in their patient history.
2.3 Attribute
Closely associated with a query is an attribute. An attribute modifies the query to subset the persons from the query that contain a particular value based on another column in the clinical table. All attributes are based on columns in the clinical domain table or from the person table. For example, in the T2D example we want measurement values where the random glucose have a value greater than 200 mg/dL (which would designate an abnormal measure). In this case we would look up all persons with a Random Glucose concept and then search the value as number column to see if the listed value is greater than 200. When constructing cohort definitions, remember that the attribute complements the query.
Using Capr an attribute object is first defined outside the query and then placed in a list within the query command. In the code block below we create an object called value200 which holds an attribute to modify the query. This attribute is called an OpAttribute
. We are deploying an mathematical operator or inequality to describe the logic of interest in our query. Other attributes include a ConceptAttribute
and LogicAttribute
.
The example shown for Capr is a tad tricky to show in synpuf because there are limited lab values so we show an example using a gender attribute. Again the attribute is modifier of the query, where we are filter the matching persons by the existence of another value. In the case of gender we have a concept ID for female (8532) so to find females who have taken a T2D medications, we first do a filter join to find the persons with a “hit” from the CSE and then we join on the person table by the person_id. From this set of persons we filter to only count those with a concept id of 8532 in the gender_concept_id column of the person table. As you can see the attribute is additional filtering logic that modifies the query.
2.4 Count
So far we have not incorporated time into our queries only the existence of a code in a table. However, timing is vital when determining a cohort of patients. We need to ensure that of the initial set of patients, we restrict people who have experienced a medical event at some plausible point in their patient history. For example if we want persons with T2D, we want to ensure they do not have prior type 1 diabetes. This is the essence of the circe-be count structure; we enumerate patients based on the temporal occurrence of a medical event. Counts are typically only defined in Additional Criteria and Inclusion Rules because we need the occurrence of a prior event in order to define a window in the patient history on which to enumerate.
In Capr counts require: 1) a query, 2) a count and 3) a timeline. The timeline sets the window of observation relative to another event, in circe-be this is typically relative to the primary criteria (unless we are building a correlated criteria attribute). In the exmaple below we define two counts: 1) at least 1 occurrence of an T2D medication and 2) no occurrence of a T2D medications. Note these are for different pathways in the T2D eMerge algorithm. Relative to the primary criteria we define our window as all time before and no time after. Now we can begin to enumerate 🧮! We want to observe the occurrence of a query x instances where x is some value that we apply with an inequality. If we want at least 1 instance we follow the first example in the Capr code and if we want no occurrence we follow the second example in the Capr code.
So if we want to create an inclusion rule we need to understand the primary criteria before we build any rule. Next we want to define the time relative to this index event where to create a window. Then we create a query of a medical event we want to observe in this window. Finally we want to define how many times we observe this event in the patient history in order for the subject to be included or excluded.
An example of how circe-be deploys a count construct can be seen in the naive example below. We want to include people into the cohort if they have experienced an exposure to T2D medication between 365 to 1 day prior to a T2D diagnosis. As we can see the idea of a count is to enumerate an event temporally based on some prior event. Counts are usually used within additional criteria and inclusion rules where the temporal bounds are set by the primary criteria.
2.5 Group
Groups are the most complex, yet most powerful structure in the underlying circe-be semantic model. A group bundles all counts and groups together into a single piece of logic that determines whether a person is added or omitted from a cohort. The eMerge T2D algorithm offers excellent examples of a group. The first path towards a T2D case is no occurrence of T2D diagnosis, at least 1 T2D medication and at least 1 abnormal lab measurement. The patient needs to pass all three of these rules in order to be added or omitted from the cohort. Interestingly in this example we are using two counts and a group. Group objects in circe-be can hold other groups 🤯! The T2D algorithm defines abnormal labs as one of any: 1) random glucose \(> 200mg/dL\), 2) HbA1c of \(\geq 6.5\%\) and 3) fasting glucose \(\geq 125 mg/dL\). After defining this group we then need to bundle the count substructures for no occurrence of T2D diagnosis and at least 1 T2D medication. The Capr code below shows how to build this structure from start to finish.
Again the group example is hard to depict in synpuf data so we simplify it to provide a {dplyr}
representation. We could have two count objects persons who take T2D medications and those who take ace inhibitors. Of people who are diagnosed with T2D we want to see if they have taken both of these medications to be in the cohort. A group allows us to combine the logic of both of these counts using a join statement as shown below.
3 eMerge T2D
In the tutorial above we defines the 5 essential circe-be substructures that are needed to build elements of a cohort definition. Capr defines cohort definitions from the bottom up so we need to understand these sub-structures to effectively create complex cohort definitions. Our understanding of these sub-structures improves the way we can build cohorts and create templates in Capr. The following is how one would create the full T2D algorithm from eMerge. While this is a long code block, this shows how the fundamental pieces can be created and deployed into different iterations to formulate complex algorithms.