This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
welcome:overview:cdm:cdm_conversion_best_practices [2017/06/28 21:14] bchristian created |
welcome:overview:cdm:cdm_conversion_best_practices [2017/06/29 18:47] (current) bchristian Updates from afternoon discussion |
||
---|---|---|---|
Line 1: | Line 1: | ||
What do I have to have to do an OMOP conversion? | What do I have to have to do an OMOP conversion? | ||
+ | |||
+ | Separate the process into modules. | ||
Pre-analysis | Pre-analysis | ||
Line 7: | Line 9: | ||
- OMOP vocabulary (reference codes) | - OMOP vocabulary (reference codes) | ||
- are any fields masked or modified for privacy and if so, what is the algorithm? | - are any fields masked or modified for privacy and if so, what is the algorithm? | ||
+ | - Use min and max to measure range for observations and for each field. | ||
+ | - Define how to handle exceptions | ||
- Identify subject matter expert for source. | - Identify subject matter expert for source. | ||
- Identify OMOP subject matter expert. | - Identify OMOP subject matter expert. | ||
Line 15: | Line 19: | ||
- Is the conversion feasible? | - Is the conversion feasible? | ||
- Draft ETL Specs. | - Draft ETL Specs. | ||
+ | - Privacy requirements | ||
+ | - Definition of visit | ||
+ | - Specific codes that are of interest. Ensure these are mapped. | ||
+ | - Business rules for handling conflicting information (e.g. visits after patient death). | ||
+ | - Specific metrics that are of interest. | ||
Line 20: | Line 29: | ||
Post-analysis | Post-analysis | ||
+ | - Revised Data dictionary | ||
- Initial ETL Spec including: | - Initial ETL Spec including: | ||
- | - Business rules for mapping in a detailed specification, preferably in a computer readable format, like White Rabbit. | + | - Business rules for mapping in a detailed specification, preferably in a computer readable format, like [[https://www.ohdsi.org/analytic-tools/whiterabbit-for-etl-design/?ModPagespeed=noscript|White Rabbit]]. |
- | - View of mapping, preferably in a computer readable format, like rabbit in a hat. | + | - View of mapping, preferably in a computer readable format, like [[https://www.ohdsi.org/analytic-tools/whiterabbit-for-etl-design/?ModPagespeed=noscript|Rabbit-In-a-Hat]]. |
- Identify any additional mapping needed: | - Identify any additional mapping needed: | ||
- custom or local mapping of organizational codes | - custom or local mapping of organizational codes | ||
- country codes not already in OHDSI vocabulary like OHIP codes | - country codes not already in OHDSI vocabulary like OHIP codes | ||
+ | - What to do with old codes that have been deactivated or invalidated but have no replacement codes | ||
+ | - add new values to be published in next version of vocabulary | ||
- Preferred development methodology | - Preferred development methodology | ||
Line 37: | Line 49: | ||
- Dimension table first | - Dimension table first | ||
- GRA table always last because derived | - GRA table always last because derived | ||
+ | - Define range of data to convert | ||
+ | - Observation period | ||
+ | - first/last? | ||
+ | - EMR vs Claims | ||
+ | - Use min and max to measure range for observations and for each field. | ||
+ | - exclude reversals for claims | ||
+ | - exclude invalid dates | ||
+ | - identify broken processes that generate invalid dates to help remove invalid dates | ||
+ | - Use start and end date of vocabulary items | ||
+ | - Document what to do with records that are missing required fields | ||
+ | - maybe you have a medical coder who can code from a description field | ||
+ | - Document what to do with records that have fields with invalid values | ||
+ | - Software lifecycle | ||
+ | - How do you develop, test, and accept for production? | ||
+ | - Jenkins for automated build | ||
+ | - SVN for source code control | ||
+ | - How do you manage effort and cost to convert millions of patient records in TB of data? | ||
+ | - Use a sample subset of the total data based on number of patients, amount of data, or processing time | ||
+ | - develop and test using a sample subset of entire data (150 thousand patients) | ||
+ | - business acceptance test using a large sample subset of entire data (1 million patients) | ||
+ | - production run using entire data (millions of patients) | ||
+ | - Define destination location(s) | ||
+ | - Always get the latest vocabularies before each refresh (development, test, or production run) | ||
+ | - Where do you get the most recent list of codes? | ||
+ | - Frequency or schedule of reviews | ||
+ | - How do you become aware of updates to CDM? | ||
+ | - How do you become aware of updates to vocabularies? | ||
+ | - Partitioning for parallelism to optimize performance | ||
+ | - Guidelines for incremental update | ||
+ | - Reusable code/Tables | ||
+ | - Intermediate model? | ||
+ | |||
+ | Quality Assurance (QA) | ||
+ | - How do we ensure ETL is good? | ||
+ | - metrics for success | ||
+ | - for each source and destination table | ||
+ | - top mapped codes with count of records | ||
+ | - top unmapped codes with count of records | ||
+ | - variance between previous and current run | ||
+ | - count of records | ||
+ | - % of records with mapped codes | ||
+ | - % unique codes that are mapped | ||
+ | - for select fields (demographics), show histogram of values | ||
+ | - compare actual to expected results | ||
+ | - ensure referential integrity on platforms that do not enforce it | ||
+ | - it would be awesome to compare histogram of values for source with equivalent destination | ||
+ | - it would be awesome to show improvements between runs due to better mapping and coding | ||
+ | - it would be nice to show average condition per visit | ||
+ | - some deidentification processes introduce variance in dates or id values | ||
+ | - How do we get business units to participate? | ||
+ | - How do we get approval from business units? | ||
+ | - Validate destination data with use cases and compare against source data with use cases. Investigate or accept variance. | ||
+ | - Standard model checks that are independent of data or volume | ||
+ | - automatic vs manual checks | ||
+ | - Frequency or schedule of reviews | ||
+ | - Tools | ||
+ | - Achilles | ||
+ | - Autosys | ||
+ | - Oozie | ||
+ | Operation | ||
+ | - Guidance for archive | ||
+ | - Tools | ||
+ | - monitoring | ||
+ | - kibana |