Documentation
Common Data Model (CDM)
Convert Database to CDM (ETL)
Tool Specific Documentation
Common Data Model (CDM)
Convert Database to CDM (ETL)
Tool Specific Documentation
This is an old revision of the document!
LAERTES provides an evidence base of a wide variety of sources with information relevant for assessing associations between drugs and health outcomes of interest. Please see the following publication for a description of the goal and use cases driving the system:
Boyce. RD., Ryan. PB., Noren. N., et al., Bridging islands of information to establish an integrated knowledge base of drugs and health outcomes of interest. Drug Safety. 2014. Volume 37, Issue 8 (2014), Page 557-567. DOI: 10.1007/s40264-014-0189-0, PubMed PMID: 24985530. PMCID: PMC4134480.http://link.springer.com/article/10.1007%2Fs40264-014-0189-0.
Conceptually, the data model is a hybrid relational/RDF schema where the RDF component applies the Open Data Annotation (OA) model (http://www.openannotation.org/spec/core/) to represent specific evidence items for or against an association between a drug and a health outcome of interest (HOIs) from any of a variety of sources, and the relational component provides a summary and index for drug-HOI evidence represented in the RDF component.
The relational component extends the OHDSI Standard Vocabulary, enumerates the evidence data sources, and provides counts for the records associated with every drug-HOI pair in each source noting the modality of the association (i.e., a positive or negative association). Associated with the counts from a given source is a URL that can be used in the RDF component to pull back a list of OA records typed using custom OHDSI evidence types. Each OA record provides data that client programs can use to render data about the source of the evidence (the “target”) and the semantic tags used to identify the record as a source of evidence for a drug-HOI pair (the “body” or “bodies”).
This model decouples the data sources from the various copies of the sources that might have been processed in many different ways. It also decouples what can be said about and evidence item (i.e., the semantic tags) from the information artifact. All of this allows for greater flexibility with respect to inclusion of sources and analysis. Moreover, the 'drill down' use case can be supported by following URIs provided by in the 'evidence target' table to Linked Data resources.
This model will be developed to support all of the sources here: https://docs.google.com/document/d/13UwoqjPyqKr-MLpcflzNs8WD5Za4Ynqnue1xWU2cgaA/edit#
NOTE: The OHDSI standard vocabulary may require extensions from the following terminologies:
This wiki page provides a high level guide or “cookbook” for the ETL process that the LAERTES team follows to periodically update the LAERTES database when new source data sets are released.
This ETL process will be improved and extended with new data sources over time. It is somewhat complex, it has very specific pre-requisites, it requires a good knowledge of the source data sets and it includes some manual steps.
This ETL is not intended to be executed by typical consumers of LAERTES data, instead they should use the following approaches:
Internet access is required to download source dataset files from external websites and to download the ETL source code and schema from github: https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES
A zip file decompression program is required (e.g. 7-zip for windows or gzip/gunzip for linux)
Notes. The ETL process combines the source data into a PostgreSQL database and then the last step populates the final tables in ANSI standard DDL tables which can be created and loaded in the three supported OHDSI DBMSs: Oracle, PostgreSQL and SQL Server.
* Note. As a general principle to be followed when running the ETL scripts You will get the best performance by first running the create table statements, then loading the tables and then running the create index statements. The reason is that it is faster to add an index to a populated table than to load data into a table that already has an existing index.
Notes.
SemMedDB, a repository of semantic predications (subject-predicate-object triples) extracted from the entire set of PubMed citations is only available as a MySQL database dump file. Therefore a MySQL database is required to load the data for ETL processing. The SemMedDB data is relatively large. (80 to 100 GBs).
Notes.
LAERTES is a hybrid relational/RDF schema model. The RDF schema is created and maintained in a Virtuoso server.
* Note. instructions on faceted browser installation: http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtFacetBrowserInstallConfig
PostgreSQL DBMS server
Virtuoso server
MySQL DBMS server
ETL and URL shortening service server
This table shows the web sites to visit and the files to download for this ETL process.
Data Feed | Website | Website Download Page | Example Download File Name | Notes |
---|---|---|---|---|
EU SPC ADRs | PROTECT Website | http://www.imi-protect.eu/adverseDrugReactions.shtml | http://www.imi-protect.eu/documents/Finalrepository_2Sep2014_DLP30June2013.xls | A structured Excel database of all adverse drug reactions (ADRs) listed in section 4.8 of the Summary of Product Characteristics (SPC) of medicinal products authorised in the EU according to the centralised procedure |
PubMed/MEDLINE using MeSH tags ADEs | NLM ftp server | NLM ftp server | Releases folder, MedlineXmlToDatabase*.zip files | MEDLINE records for indexed literature reporting adverse drug events |
SPLICER extracted US SPL ADEs | SPLICER Data set provided to OHDSI by Regenstrief | SPLICER Data set provided to OHDSI by Regenstrief | SPLICER Data set provided to OHDSI by Regenstrief | Structured Product Label Information Coder and Extractor (SPLICER) natural language processing to extract ADE data from FDA mandated Structured Product Labels (SPL) |
NLM's SemMedDB (semantic representation of PubMed content) | http://skr3.nlm.nih.gov/ | http://skr3.nlm.nih.gov/ | semmedVER24_2_WHOLEDB_to06302014.sql | NLM's SemMedDB (semantic representation of PubMed content) |
Use a web browser or the command-line program wget to download the following zip file containing all the ETL scripts:
https://github.com/OHDSI/KnowledgeBase/archive/master.zip unzip the master.zip file using 7-Zip on Windows or unzip on Linux.
This will result in the following directory structure:
/KnowledgeBase-master/LAERTES
There is one sub-directory containing the ETL scripts for each data feed, plus the Schema, URLShortener and terminology-mappings sub-directories explained below.
This directory contains the zipped source code to deploy and run a local URL shortening service on a local server for resolving URLs
Download, the UML datasets from the UMLS website. Decompress the gzipped files. Run the ETL script to create the UML database staging tables schema and load the datasets into the UMLS staging tables.
Note. There are multiple sets of extracted files. Each set of files must be concatenated together to create a single file.
A. Linux commands to unzip and concatenate files (run in shell):
cd /VOCABETL/Vocabulary-v5.0-master/UMLS/YYYYAB\META where YYYY is the year (e.g. 2014). gunzip *-meta.nlm gunzip *.gz --cat MRCONSO.RRF.aa MRCONSO.RRF.ab > MRCONSO.RRF --cat MRHIER.RRF.aa MRHIER.RRF.ab > MRHIER.RRF --cat MRREL.RRF.aa MRREL.RRF.ab MRREL.RRF.ac MRREL.RRF.ad > MRREL.RRF --cat MRSAT.RRF.aa MRSAT.RRF.ab MRSAT.RRF.ac MRSAT.RRF.ad > MRSAT.RRF
B. Alternative commands to run for Windows (run in powershell):
cd C:\VOCABETL\Vocabulary-v5.0-master\UMLS\2014AB\META where YYYY is the year (e.g. 2014). unzip *-meta.nlm gzip -d *.gz MRCONSO.RRF.aa.gz MRCONSO.RRF.ab.gz MRHIER.RRF.aa.gz MRHIER.RRF.ab.gz MRREL.RRF.aa.gz MRREL.RRF.ab.gz MRREL.RRF.ac.gz MRREL.RRF.ad.gz MRSAT.RRF.aa.gz MRSAT.RRF.ab.gz MRSAT.RRF.ac.gz MRSAT.RRF.ad.gz cmd /c copy MRCONSO.RRF.aa + MRCONSO.RRF.ab MRCONSO.RRF cmd /c copy MRHIER.RRF.aa + MRHIER.RRF.ab MRHIER.RRF cmd /c copy MRREL.RRF.aa + MRREL.RRF.ab + MRREL.RRF.ac + MRREL.RRF.ad MRREL.RRF cmd /c copy MRSAT.RRF.aa +MRSAT.RRF.ab + MRSAT.RRF.ac + MRSAT.RRF.ad MRSAT.RRF
Run the following commands in a linux shell or Windows command-line to load each file into the staging table in PRODV5 with the same name.
Note. Each file has a separate control file which is in the vocabETL/vocabulary-v5.0-master/UML sub-directory and the full path to the data sets to be loaded must be specified in the command (replace <password> with the database password of the PRODV5 user in the sqlldr statements below).
Note. You can review and then ignore any single .bad record for each file which just contains the Windows EOF file char added by the above batch copy statements.
sqlldr 'PRODV5/<password>' CONTROL=MRCONSO.ctl LOG=MRCONSO.log BAD=MRCONSO.bad sqlldr 'PRODV5/<password>' CONTROL=MRHIER.ctl LOG=MRHIER.log BAD=MRHIER.bad sqlldr 'PRODV5/<password>' CONTROL=MRMAP.ctl LOG=MRMAP.log BAD=MRMAP.bad sqlldr 'PRODV5/<password>' CONTROL=MRREL.ctl LOG=MRREL.log BAD=MRREL.bad sqlldr 'PRODV5/<password>' CONTROL=MRSAT.ctl LOG=MRSAT.log BAD=MRSAT.bad sqlldr 'PRODV5/<password>' CONTROL=MRSMAP.ctl LOG=MRSMAP.log BAD=MRSMAP.bad
Download, the SNOMED datasets from the website. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets.
alter table concept_stage modify (concept_code varchar2(50));
The international SNOMED-CT and the UK SNOMED-CT will be loaded into separate staging tables and then unioned together in views ready for processing in the load_stage.sql step below.
Staging table load statements
sqlldr PRODV5/<password> CONTROL=SCT2_CONCEPT_FULL_INT.ctl LOG=SCT2_CONCEPT_FULL_INT.log BAD=SCT2_CONCEPT_FULL_INT.bad sqlldr PRODV5/<password> CONTROL=SCT2_DESC_FULL_EN_INT.ctl LOG=SCT2_DESC_FULL_EN_INT.log BAD=SCT2_DESC_FULL_EN_INT.bad sqlldr PRODV5/<password> CONTROL=SCT2_RELA_FULL_INT.ctl LOG=SCT2_RELA_FULL_INT.log BAD=SCT2_RELA_FULL_INT.bad sqlldr PRODV5/<password> CONTROL=SCT2_CONCEPT_FULL_UK.ctl LOG=SCT2_CONCEPT_FULL_INT.log BAD=SCT2_CONCEPT_FULL_UK.bad sqlldr PRODV5/<password> CONTROL=SCT2_DESC_FULL_UK.ctl LOG=SCT2_DESC_FULL_UK.log BAD=SCT2_DESC_FULL_UK.bad sqlldr PRODV5/<password> CONTROL=SCT2_RELA_FULL_UK.ctl LOG=SCT2_RELA_FULL_UK.log BAD=SCT2_RELA_FULL_UK.bad sqlldr PRODV5/<password> CONTROL=der2_cRefset_AssRefFull_INT.ctl LOG=der2_cRefset_AssRefFull_INT.log BAD=der2_cRefset_AssRefFull_INT.bad sqlldr PRODV5/<password> CONTROL=der2_cRefset_AssRefFull_UK.ctl LOG=der2_cRefset_AssRefFull_UK.log BAD=der2_cRefset_AssRefFull_UK.bad
Download, the RxNorm datasets from the website. Decompress the gzipped files. Run the ETL script to create database staging tables schema and load the datasets.
Update the update vocabulary latest_update date statement in create_source_tables.sql to the correct date for the file you are loading. Make this change in load_stage.sql for each data feed.
This ETL step processes RxNorm, ATC, NDFRT, VA Product , VA Class and ATC vocabularies.
At step 15 of create_source_tables.sql you must run the generic_update_sql script (from “working” directory) to load the RxNorm concepts from the concept staging tables. You also need to run that same script after the load_stage.sql script finishes.
Download the HCPCS datasets from the website. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets.
load_stage.sql Steps 9 and 10 are completely manual – a select statement to generate the set of Medical Coder rows which needs to be exported as a file followed by an instruction “Append resulting file from Medical Coder (in concept_relationship_stage format) to concept_relationship_stage”. In the future this could be converted into an insert append statement. Note. The original HCPCS HCPYYYY_CONTR_ANWEB_V2.txt file is a fixed field format file that will not load with the ANWEB_V2.ctl file which requires a comma separated file so see details below on creating a csv file to use for the load.
Staging table load statements
sqlldr PRODV5/<password> CONTROL=ANWEB_V2.ctl DATA=HCPC2015_CONTR_ANWEB_v2.tsv LOG=ANWEB_V2.log BAD=ANWEB_V2.bad
Download, the ICD9CM datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets.
Staging table load statements
sqlldr PRODV5/<password> CONTROL=CMS32_DESC_LONG_DX.ctl LOG=CMS32_DESC_LONG_DX.log BAD=CMS32_DESC_LONG_DX.bad sqlldr PRODV5/<password> CONTROL=CMS32_DESC_SHORT_DX.ctl LOG=CMS32_DESC_SHORT_DX.log BAD=CMS32_DESC_SHORT_DX.bad
Download, the ICD9Proc datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets.
load_stage.sql Steps 8 and 9 are completely manual – a select statement to generate the set of Medical Coder rows which need to be exported as a file called 'concept_relationship_manual9cm'. In the future these two steps could be converted into an insert append sub-select statement.
Staging table load statements
sqlldr PRODV5/<password> CONTROL=CMS32_DESC_LONG_SG.ctl LOG=CMS32_DESC_LONG_SG.log BAD=CMS32_DESC_LONG_SG.bad sqlldr PRODV5/<password> CONTROL=CMS32_DESC_SHORT_SG.ctl LOG=CMS32_DESC_SHORT_SG.log BAD=CMS32_DESC_SHORT_SG.bad
Download, the LOINC datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets.
Add the missing create table statement for LOINC_HIERARCHY and add missing columns to LOINC_ANSWERS create statement in create_source_tables.sql, The added/update two tables are shown below:
CREATE TABLE LOINC_HIERARCHY ( PATH_TO_ROOT VARCHAR2(4000), SEQUENCE NUMBER(38,0), IMMEDIATE_PARENT VARCHAR2(255), CODE VARCHAR2(255), CODE_TEXT VARCHAR2(255) ); CREATE TABLE LOINC ( LOINC_NUM VARCHAR2(10), COMPONENT VARCHAR2(255), PROPERTY VARCHAR2(30), TIME_ASPCT VARCHAR2(15), SYSTEM VARCHAR2(100), SCALE_TYP VARCHAR2(30), METHOD_TYP VARCHAR2(50), CLASS VARCHAR2(20), SOURCE VARCHAR2(8), DATE_LAST_CHANGED DATE, CHNG_TYPE VARCHAR2(3), COMMENTS CLOB, STATUS VARCHAR2(11), CONSUMER_NAME VARCHAR2(255), MOLAR_MASS VARCHAR2(13), CLASSTYPE VARCHAR2(20), FORMULA VARCHAR2(255), SPECIES VARCHAR2(20), EXMPL_ANSWERS CLOB, ACSSYM CLOB, BASE_NAME VARCHAR2(50), NAACCR_ID VARCHAR2(20), CODE_TABLE VARCHAR2(10), SURVEY_QUEST_TEXT CLOB, SURVEY_QUEST_SRC VARCHAR2(50), UNITSREQUIRED VARCHAR2(1), SUBMITTED_UNITS VARCHAR2(30), RELATEDNAMES2 CLOB, SHORTNAME VARCHAR2(40), ORDER_OBS VARCHAR2(15), CDISC_COMMON_TESTS VARCHAR2(1), HL7_FIELD_SUBFIELD_ID VARCHAR2(50), EXTERNAL_COPYRIGHT_NOTICE CLOB, EXAMPLE_UNITS VARCHAR2(255), LONG_COMMON_NAME VARCHAR2(255), HL7_V2_DATATYPE VARCHAR2(255), HL7_V3_DATATYPE VARCHAR2(255), CURATED_RANGE_AND_UNITS CLOB, DOCUMENT_SECTION VARCHAR2(255), EXAMPLE_UCUM_UNITS VARCHAR2(255), EXAMPLE_SI_UCUM_UNITS VARCHAR2(255), STATUS_REASON VARCHAR2(9), STATUS_TEXT CLOB, CHANGE_REASON_PUBLIC CLOB, COMMON_TEST_RANK VARCHAR2(20), COMMON_ORDER_RANK VARCHAR2(20), COMMON_SI_TEST_RANK VARCHAR2(20), HL7_ATTACHMENT_STRUCTURE VARCHAR2(15) );
Staging table load statements
sqlldr PRODV5<password> CONTROL=LOINC.ctl LOG=LOINC.log BAD=LOINC.bad sqlldr PRODV5/<password> CONTROL=MAP_TO.ctl LOG=MAP_TO.log BAD=MAP_TO.bad sqlldr PRODV5/<password> CONTROL=SOURCE_ORGANIZATION.ctl LOG=SOURCE_ORGANIZATION.log BAD=SOURCE_ORGANIZATION.bad sqlldr PRODV5/<password> CONTROL=LOINC_HIERARCHY.ctl LOG=LOINC_HIERARCHY.log BAD=LOINC_HIERARCHY.bad sqlldr PRODV5/<password> CONTROL=LOINC_CLASS.ctl LOG=LOINC_CLASS.log BAD=LOINC_CLASS.bad sqlldr PRODV5/<password> CONTROL=LOINC_ANSWERS.ctl LOG=LOINC_ANSWERS.log BAD=LOINC_ANSWERS.bad
Download, the MEDDRA datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets.
You must have a MedDRA license to download the data set. The processing of this data set follows the same approach as the other data feeds.
Download, the NDC-SPL datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets.
This data feed loads the NDC vocabulary and the SPL vocabularies.
Staging table load statements
sqlldr PRODV5/<password> CONTROL=PRODUCT.ctl LOG=PRODUCT.log BAD=PRODUCT.bad
Download, the READ code datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets.
staging table load statements
sqlldr PRODV5/<password> CONTROL=Keyv2.ctl LOG=Keyv2.log BAD=Keyv2.bad sqlldr PRODV5/<password> CONTROL=rcsctmap2_uk.ctl LOG=rcsctmap2_uk.log BAD=rcsctmap2_uk.bad
In the final_assembly directory, run the two SQL scripts to populate the concept_ancestor and drug_strength tables.
In the “working” directory, run the manual change scripts.
These scripts are collections of ad-hoc inserts / updates that have been applied over time. Before running each block of sql statements review them against the current V5 vocabulary table values to avoid potential duplicate inserts or inappropriate updates. Contact Christian Reich and the Athena vocabulary team for more info on these scripts.
In the “working” directory run a script that will update the V4 schema tables with the new data in the V5 schema tables.
The v4 tables and the v5 tables should already exist and be populated with data. The prodv4 user should have been granted read access to the prodv5 schema tables.