User Tools

Site Tools


documentation:vocabulary:data_etl

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
documentation:vocabulary:data_etl [2015/01/24 12:46]
cgreich
documentation:vocabulary:data_etl [2024/09/02 09:43] (current)
jcruz
Line 1: Line 1:
 ===== Data ETL ===== ===== Data ETL =====
-The most important impact the Standardized Vocabularies have on the ETL process from raw to CDM-formatted data is the Domain of each Concept. Irrespective from which source table a record comes, or what coding scheme it is represented by, the destination table will be determined by the domain_id of the respective Concept. Any ETL will have to follow the following logic for process every record in the source data: +Thank you for visiting ​the wiki! This page has moved to https://github.com/OHDSI/Vocabulary-v5.0/wiki/General-Structure,​-Download-and-Use#data-etl.
-  * ** Retrieve the source code** from the record. +
-  * **Find the Concept** that corresponds to the source code. In most cases, that is done by looking up the source code in the concept_code field with the correct content of the vocabulary_id field in the [[documentation:​cdm:​concept|CONCEPT]] table. Sometimes, the source code needs to be manipulated to find the right match. For example, [[documentation:​vocabulary:​icd9cm|ICD-9-CM]] codes have a dot after the 2nd or 3rd character, but in source data they are often stored without the dot. Or [[documentation:​vocabulary:​ndc|NDC]] codes come in a variety of formats (with dashes and asterisks or without, 9 or 11 digit), making the mapping process more complicated. Look into the specification of each vocabulary for specific recommendations of the lookup process. +
-  * **Map to a Standard Concept** (standard_concept='​S'​) by retrieving all the active (invalid_reason field should be NULL) records from the [[documentation:​cdm:​concept_relationship|CONCEPT_RELATIONSHIP]] table. Use concept_id_1 for the Concept ​you want to map and concept_id_2 ​for the destination Concept, with relationship_id='​Maps ​to'. If the source code is a local code and you have a [[documentation:cdm:​source_to_concept_map|SOURCE_TO_CONCEPT_MAP]] table for these, determine the destination Concept from the target_concept_id field. The destination Concept could be the Concept itself if it happens to be a Standard Concept, or a Concept in another vocabulary. In most cases, a source code maps to a single destination Standard Concept, but in some cases it could be two or three. +
-  * **Write a record in the corresponding CDM table** for each destination Standard Concept based on the content of the domain_id field. Most Standardized Clinical Data Tables require a start_date and a Type Concept (see below). Some have more fields that need consideration. See the details of each table in the [[documentation:​cdm:​standardized_clinical_data_tables|CDM Specifications]]. The corresponding table/field combination for each Domain is as follows: +
- +
-^domain_id^CDM table^Field^Comment^ +
-|Generic|Any|Any|Generic Concepts can be in any field that ends in concept_id.| +
-|Gender|[[documentation:​cdm:​person|PERSON]]|gender_concept_id| | +
-|Race|[[documentation:​cdm:​person|PERSON]]|race_concept_id| | +
-|Ethnicity|[[documentation:​cdm:​person|PERSON]]|ethnicity_concept_id| | +
-|Visit|[[documentation:​cdm:​visit_occurrence|VISIT_OCCURRENCE]]|visit_concept_id| | +
-|Procedure|[[documentation:​cdm:​procedure_occurrence|PROCEDURE_OCCURRENCE]]|procedure_concept_id| | +
-|Modifier|[[documentation:​cdm:​procedure_occurrence|PROCEDURE_OCCURRENCE]]|modifier_concept_id| | +
-|Drug|[[documentation:​cdm:​drug_exposure|DRUG_EXPOSURE]]|drug_concept_id| | +
-|Route|[[documentation:​cdm:​drug_exposure|DRUG_EXPOSURE]]|route_concept_id| | +
-|Unit|[[documentation:​cdm:​measurement|MEASUREMENT]] or [[documentation:​cdm:​observation|OBSERVATION]] or [[documentation:​cdm:​specimen|SPECIMEN]]|unit_concept_id|Units are used in different contexts. *| +
-|Device|[[documentation:​cdm:​device_exposure|DEVICE_EXPOSURE]]|device_concept_id| | +
-|Condition|[[documentation:​cdm:​condition_occurrence|CONDITION_OCCURRENCE]]|condition_concept_id| | +
-|Measurement|[[documentation:​cdm:​measurement|MEASUREMENT]]|measurement_concept_id| | +
-|Meas Value Operator|[[documentation:​cdm:​measurement|MEASUREMENT]]|operator_concept_id| | +
-|Meas Value|[[documentation:​cdm:​measurement|MEASUREMENT]]|value_as_concept_id| | +
-|Observation|[[documentation:​cdm:​observation|OBSERVATION]]|observation_concept_id| | +
-|Relationship|[[documentation:​cdm:​fact_relationship|FACT_RELATIONSHIP]]|relationship_concept_id| | +
-|Place of Service|[[documentation:​cdm:​care_site|CARE_SITE]]|place_of_service_concept_id| | +
-|Provider Specialty|[[documentation:​cdm:​provider|PROVIDER]]|specialty_concept_id| | +
-|Currency|[[documentation:​cdm:​visit_cost|VISIT_COST]] or [[documentation:​cdm:​procedure_cost|PROCEDURE_COST]] or [[documentation:​cdm:​drug_cost|DRUG_COST]] or [[documentation:​cdm:​device_cost|DEVICE_COST]]|currency_concept_id|Currency values could appear in any of the *_COST tables. *| +
-|Revenue Code|[[documentation:​cdm:​procedure_cost|PROCEDURE_COST]]|revenue_code_concept_id| | +
-|Specimen|[[documentation:​cdm:​specimen|SPECIMEN]]|specimen_concept_id| | +
-|Spec Anatomic Site|[[documentation:​cdm:​specimen|SPECIMEN]] or [[documentation:​cdm:​measurement|MEASUREMENT]] or [[documentation:​cdm:​observation|OBSERVATION]]|anatomic_site_concept_id or value_as_concept_id or value_as_concept_id|Anatomical site Concepts can be used to characterize the origin of a Specimen, or the result of a Measurement or Observation. *| +
-|Spec Disease Status|[[documentation:​cdm:​specimen|SPECIMEN]]|disease_status_concept_id| | +
- +
-  * For some Source Concepts, the Standardized Vocabularies do not provide a mapping to a Standard Concept (there is no record in the CONCEPT_RELATIONSHIP or SOURCE_TO_CONCEPT_MAP tables), usually because the Source Concept is too generic or otherwise ill-defined. In these cases, the domain_id field of the Source Concept itself should be used to place the record in the right table. Some Source Concepts have combination Domains, such as "​Device/Procedure"​In these cases, write a record into each of the combination Domain table (in this case to the DEVICE_OCCURRENCE and PROCEDURE_OCCURRENCE tables). As the Standard Concept write Concept 0 (concept_id=0) into the respective *_concept_id field. +
- +
-If the Domain of the destination Concept is "​Domain"​ or "​Metadata",​ an error has occurred in the construction of the mapping table. Please report in the [[http://forums.ohdsi.org/c/cdm-builders/​l/​latest|CDM-Builder]] forum if you believe there is such an error in the Standardized Vocabularies data. +
- +
-The same is true if you find a Type Concept Domain, like "Obs Period Type", "Death Type", "Visit Type", "​Procedure Type", etc.These Type Concepts are valid concepts ​and have to be placed into the respective *_type_concept_id field of the respective CDM tables. However, they should never be introduced as part of a mapping process, as there is no equivalent information in the source ​data, because they denote the origin of where a current record is coming from. Type Concepts are hard-wired in the ETL process, because it depends on the structure of the source data how to assign Type Concepts to records. Please also report if you find a situation like that. +
- +
documentation/vocabulary/data_etl.1422103579.txt.gz · Last modified: 2015/01/24 12:46 by cgreich