Differences

This shows you the differences between two versions of the page.

--- welcome:overview:cdm:cdm_conversion_best_practices [2017/06/28 21:14]
bchristian created
+++ welcome:overview:cdm:cdm_conversion_best_practices [2017/06/29 18:47] (current)
bchristian Updates from afternoon discussion
@@ Line 1: / Line 1: @@
 What do I have to have to do an OMOP conversion?
+Separate the process into modules.
 Pre-analysis
@@ Line 7: / Line 9: @@
     - OMOP vocabulary (reference codes)
     - are any fields masked or modified for privacy and if so, what is the algorithm?
+  - Use min and max to measure range for observations and for each field.
+    - Define how to handle exceptions
   - Identify subject matter expert for source.
   - Identify OMOP subject matter expert.
@@ Line 15: / Line 19: @@
     - Is the conversion feasible?
   - Draft ETL Specs.
+  - Privacy requirements
+  - Definition of visit
+  - Specific codes that are of interest.  Ensure these are mapped.
+  - Business rules for handling conflicting information (e.g. visits after patient death).
+  - Specific metrics that are of interest.
@@ Line 20: / Line 29: @@
 Post-analysis
+  - Revised Data dictionary
   - Initial ETL Spec including:
-    - Business rules for mapping in a detailed specification, preferably in a computer readable format, like White Rabbit.
+    - Business rules for mapping in a detailed specification, preferably in a computer readable format, like [[https://www.ohdsi.org/analytic-tools/whiterabbit-for-etl-design/?ModPagespeed=noscript|White Rabbit]].
-    - View of mapping, preferably in a computer readable format, like rabbit in a hat.
+    - View of mapping, preferably in a computer readable format, like [[https://www.ohdsi.org/analytic-tools/whiterabbit-for-etl-design/?ModPagespeed=noscript|Rabbit-In-a-Hat]].
   - Identify any additional mapping needed:
     - custom or local mapping of organizational codes
     - country codes not already in OHDSI vocabulary like OHIP codes
+  - What to do with old codes that have been deactivated or invalidated but have no replacement codes
+    - add new values to be published in next version of vocabulary
   - Preferred development methodology
@@ Line 37: / Line 49: @@
     - Dimension table first
     - GRA table always last because derived
+  - Define range of data to convert
+    - Observation period
+      - first/last?
+    - EMR vs Claims
+    - Use min and max to measure range for observations and for each field.
+      - exclude reversals for claims
+      - exclude invalid dates
+      - identify broken processes that generate invalid dates to help remove invalid dates
+    - Use start and end date of vocabulary items
+    - Document what to do with records that are missing required fields
+      - maybe you have a medical coder who can code from a description field
+    - Document what to do with records that have fields with invalid values
+  - Software lifecycle
+    - How do you develop, test, and accept for production?
+      - Jenkins for automated build
+      - SVN for source code control
+    - How do you manage effort and cost to convert millions of patient records in TB of data?
+      - Use a sample subset of the total data based on number of patients, amount of data, or processing time
+      - develop and test using a sample subset of entire data (150 thousand patients)
+      - business acceptance test using a large sample subset of entire data (1 million patients)
+      - production run using entire data (millions of patients)
+    - Define destination location(s)
+    - Always get the latest vocabularies before each refresh (development, test, or production run)
+    - Where do you get the most recent list of codes?
+    - Frequency or schedule of reviews
+  - How do you become aware of updates to CDM?
+  - How do you become aware of updates to vocabularies?
+  - Partitioning for parallelism to optimize performance
+  - Guidelines for incremental update
+  - Reusable code/Tables
+  - Intermediate model?
+Quality Assurance (QA)
+  - How do we ensure ETL is good?
+    - metrics for success
+    - for each source and destination table
+      - top mapped codes with count of records
+      - top unmapped codes with count of records
+      - variance between previous and current run
+      - count of records
+      - % of records with mapped codes
+      - % unique codes that are mapped
+      - for select fields (demographics), show histogram of values
+    - compare actual to expected results
+    - ensure referential integrity on platforms that do not enforce it
+    - it would be awesome to compare histogram of values for source with equivalent destination
+    - it would be awesome to show improvements between runs due to better mapping and coding
+    - it would be nice to show average condition per visit
+    - some deidentification processes introduce variance in dates or id values
+  - How do we get business units to participate?
+  - How do we get approval from business units?
+  - Validate destination data with use cases and compare against source data with use cases. Investigate or accept variance.
+  - Standard model checks that are independent of data or volume
+  - automatic vs manual checks
+  - Frequency or schedule of reviews
+  - Tools
+    - Achilles
+    - Autosys
+    - Oozie
+Operation
+  - Guidance for archive
+  - Tools
+    - monitoring
+    - kibana

Observational Health Data Sciences and Informatics

User Tools

Site Tools

Differences

Page Tools