This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
documentation:vocabulary:data_etl [2015/01/24 12:54] cgreich |
documentation:vocabulary:data_etl [2024/09/02 09:43] (current) jcruz |
||
---|---|---|---|
Line 1: | Line 1: | ||
===== Data ETL ===== | ===== Data ETL ===== | ||
- | The most important impact the Standardized Vocabularies have on the ETL process from raw to CDM-formatted data is the Domain of each Concept. Irrespective from which source table a record comes, or what coding scheme it is represented by, the destination table will be determined by the domain_id of the respective Concept. Any ETL will have to follow the following logic for process every record in the source data: | + | Thank you for visiting the wiki! This page has moved to https://github.com/OHDSI/Vocabulary-v5.0/wiki/General-Structure,-Download-and-Use#data-etl. |
- | * ** Retrieve the source code** from the record. | + | |
- | * **Find the Concept** that corresponds to the source code. In most cases, that is done by looking up the source code in the concept_code field with the correct content of the vocabulary_id field in the [[documentation:cdm:concept|CONCEPT]] table. Sometimes, the source code needs to be manipulated to find the right match. For example, [[documentation:vocabulary:icd9cm|ICD-9-CM]] codes have a dot after the 2nd or 3rd character, but in source data they are often stored without the dot. Or [[documentation:vocabulary:ndc|NDC]] codes come in a variety of formats (with dashes and asterisks or without, 9 or 11 digit), making the mapping process more complicated. Look into the specification of each vocabulary for specific recommendations of the lookup process. | + | |
- | * **Map to a Standard Concept** (standard_concept='S') by retrieving all the active (invalid_reason field should be NULL) records from the [[documentation:cdm:concept_relationship|CONCEPT_RELATIONSHIP]] table. Use concept_id_1 for the Concept you want to map and concept_id_2 for the destination Concept, with relationship_id='Maps to'. If the source code is a local code and you have a [[documentation:cdm:source_to_concept_map|SOURCE_TO_CONCEPT_MAP]] table for these, determine the destination Concept from the target_concept_id field. The destination Concept could be the Concept itself if it happens to be a Standard Concept, or a Concept in another vocabulary. In most cases, a source code maps to a single destination Standard Concept, but in some cases it could be two or three. | + | |
- | * **Write a record in the corresponding CDM table** for each destination Standard Concept based on the content of the domain_id field. Most Standardized Clinical Data Tables require a start_date and a Type Concept (see below). Some have more fields that need consideration. See the details of each table in the [[documentation:cdm:standardized_clinical_data_tables|CDM Specifications]]. The corresponding table/field combination for each Domain is as follows: | + | |
- | + | ||
- | ^domain_id^CDM table^Field^Comment^ | + | |
- | |Generic|Any|Any|Generic Concepts can be in any field that ends in concept_id.| | + | |
- | |Gender|[[documentation:cdm:person|PERSON]]|gender_concept_id| | | + | |
- | |Race|[[documentation:cdm:person|PERSON]]|race_concept_id| | | + | |
- | |Ethnicity|[[documentation:cdm:person|PERSON]]|ethnicity_concept_id| | | + | |
- | |Visit|[[documentation:cdm:visit_occurrence|VISIT_OCCURRENCE]]|visit_concept_id| | | + | |
- | |Procedure|[[documentation:cdm:procedure_occurrence|PROCEDURE_OCCURRENCE]]|procedure_concept_id| | | + | |
- | |Modifier|[[documentation:cdm:procedure_occurrence|PROCEDURE_OCCURRENCE]]|modifier_concept_id| | | + | |
- | |Drug|[[documentation:cdm:drug_exposure|DRUG_EXPOSURE]]|drug_concept_id| | | + | |
- | |Route|[[documentation:cdm:drug_exposure|DRUG_EXPOSURE]]|route_concept_id| | | + | |
- | |Unit|[[documentation:cdm:measurement|MEASUREMENT]] or [[documentation:cdm:observation|OBSERVATION]] or [[documentation:cdm:specimen|SPECIMEN]]|unit_concept_id|*| | + | |
- | |Device|[[documentation:cdm:device_exposure|DEVICE_EXPOSURE]]|device_concept_id| | | + | |
- | |Condition|[[documentation:cdm:condition_occurrence|CONDITION_OCCURRENCE]]|condition_concept_id| | | + | |
- | |Measurement|[[documentation:cdm:measurement|MEASUREMENT]]|measurement_concept_id| | | + | |
- | |Meas Value Operator|[[documentation:cdm:measurement|MEASUREMENT]]|operator_concept_id| | | + | |
- | |Meas Value|[[documentation:cdm:measurement|MEASUREMENT]]|value_as_concept_id| | | + | |
- | |Observation|[[documentation:cdm:observation|OBSERVATION]]|observation_concept_id| | | + | |
- | |Relationship|[[documentation:cdm:fact_relationship|FACT_RELATIONSHIP]]|relationship_concept_id| | | + | |
- | |Place of Service|[[documentation:cdm:care_site|CARE_SITE]]|place_of_service_concept_id| | | + | |
- | |Provider Specialty|[[documentation:cdm:provider|PROVIDER]]|specialty_concept_id| | | + | |
- | |Currency|[[documentation:cdm:visit_cost|VISIT_COST]] or [[documentation:cdm:procedure_cost|PROCEDURE_COST]] or [[documentation:cdm:drug_cost|DRUG_COST]] or [[documentation:cdm:device_cost|DEVICE_COST]]|currency_concept_id|Currency values appear in any of the *_COST tables. *| | + | |
- | |Revenue Code|[[documentation:cdm:procedure_cost|PROCEDURE_COST]]|revenue_code_concept_id| | | + | |
- | |Specimen|[[documentation:cdm:specimen|SPECIMEN]]|specimen_concept_id| | | + | |
- | |Spec Anatomic Site|[[documentation:cdm:specimen|SPECIMEN]] or [[documentation:cdm:measurement|MEASUREMENT]] or [[documentation:cdm:observation|OBSERVATION]]|anatomic_site_concept_id or value_as_concept_id or value_as_concept_id|Anatomical Site Concepts are used to characterize the origin of a Specimen, but also the result of a Measurement or Observation. *| | + | |
- | |Spec Disease Status|[[documentation:cdm:specimen|SPECIMEN]]|disease_status_concept_id| | | + | |
- | + | ||
- | * For some Source Concepts there is **no mapping to a Standard Concept** (there is no record in the CONCEPT_RELATIONSHIP or SOURCE_TO_CONCEPT_MAP tables), usually because the Source Concept is too generic or otherwise ill-defined. In these cases, the domain_id field of the Source Concept itself should be used to place the record in the right table. Some Source Concepts have combination Domains, such as "Device/Procedure". In these cases, write a record into each of the combination Domain table (in this case to the DEVICE_OCCURRENCE and PROCEDURE_OCCURRENCE tables). As the Standard Concept write Concept 0 (concept_id=0) into the respective *_concept_id field. | + | |
- | + | ||
- | If the Domain of the destination Concept is "Domain" or "Metadata", an error has occurred in the construction of the mapping table. Please report in the [[http://forums.ohdsi.org/c/cdm-builders/l/latest|CDM-Builder]] forum if you believe there is such an error in the Standardized Vocabularies data. | + | |
- | + | ||
- | The same is true if you find a Type Concept Domain, like "Obs Period Type", "Death Type", "Visit Type", "Procedure Type", etc.These Type Concepts are valid concepts and have to be placed into the respective *_type_concept_id field of the respective CDM tables. However, they should never be introduced as part of a mapping process, as there is no equivalent information in the source data, because they denote the origin of the record. Type Concepts are hard-wired in the ETL process, because it depends on the structure of the source data how to assign Type Concepts to records. Please also report if you find a situation like that. | + | |
- | + |