Automated Terminology Harmonization, Extraction and Normalization for Analytics - ATHENA
Preliminary analysis
Application Design
Preliminary analysis
Application Design
I would build the logic slightly differently:
1. Concepts. - We have one authoratitative source: SNOMED international in combination with SNOMED UK. Other components might follow later (DM+D, other country-specific versions). - We get a stream of concepts from them:
- We get a stream of concept-to-concept relationships
- We get a stream of update (inactive to active) relationships (only one per deprecated concept must exist)
Makes sense?
I am not sure we need UMLS for them. UMLS is really only a re-formating of SNOMED. There isn't much going on. Unless you found something in http://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/SNOMEDCT_US. Take a look.
The source SNOMED vocabulary can be acquired from SNOMED. Also, SNOMED is included in UMLS. Both of this resources are suggested to be used in the import process.
Local copy of the vocabulary will be used to extract the concepts, and the UMLS web API will be used for additional concept analysis. The advantages of this approach are:
We'll start from the basic import process, which will give additional knowledge about the process itself.
Each concept in the source dictionary can be:
In current scope, identification means that OMOP and UMLS already have info about current context. When the Concept is identified, it can be validated. Each Concept is described by its type, set of attributes and relations with other Concepts. During the validation process, we must compare the Source and UMLS Concepts description to OMOP. If the translation can be performed to both directions, without data integrity and validity violation, we can say that the Concept is valid.
To identify the Concept we must:
After this checks we will receive the data:
Records processed | X |
Records recognized only by OMOP | Y |
Records recognized only by UMLS | Z |
Records recognized by OMOP and UMLS | N |
Records not recognized | M |
From this table we can say that:
N - stable records, recognized by both systems, most likely they are valid.
Z - missing records, that should be added to OMOP. We can use UMLS data for validation purposes.
Y - this data should be inspected. There might be an invalid records, or we importing newer version of SNOMED, that included in UMLS.
M - new records, that are just added to new version of SNOMED. We need to validate them, using the source description.
Also, we must have an ability to see each of this subsets as the table or export it to file by the user request.
This process allows us to ensure, that OMOP describes the Concept exactly as the Source vocabulary. We also can use UMLS API for additional checks. At first we should define the Concept's type. It can be:
After the type of the Concept is been defined, we can perform the additional checks, that are specific for each type.
If the current concept is Domain, we can verify that:
If current Source Concept is Relationship, we should compare it with the Relationship Mapping.