What do I have to have to do an OMOP conversion?
Separate the process into modules.
Pre-analysis
Data from source (or at least range for each field).
Data dictionary including:
taxonomies
OMOP vocabulary (reference codes)
are any fields masked or modified for privacy and if so, what is the algorithm?
Use min and max to measure range for observations and for each field.
Define how to handle exceptions
Identify subject matter expert for source.
Identify OMOP subject matter expert.
Use cases for destination.
List of standard questions:
Number of records
How often is data refreshed?
Is the conversion feasible?
Draft ETL Specs.
Privacy requirements
Definition of visit
Specific codes that are of interest. Ensure these are mapped.
Business rules for handling conflicting information (e.g. visits after patient death).
Specific metrics that are of interest.
Meeting with subject matter experts
Post-analysis
Revised Data dictionary
Initial ETL Spec including:
Business rules for mapping in a detailed specification, preferably in a computer readable format, like
White Rabbit.
View of mapping, preferably in a computer readable format, like
Rabbit-In-a-Hat.
Identify any additional mapping needed:
custom or local mapping of organizational codes
country codes not already in OHDSI vocabulary like OHIP codes
What to do with old codes that have been deactivated or invalidated but have no replacement codes
add new values to be published in next version of vocabulary
Preferred development methodology
ETL Conversion
CDM Builder
SQL Development
Sample size for development
Major pieces of work (Scrum Epics)
Each table
Dimension table first
GRA table always last because derived
Define range of data to convert
Observation period
first/last?
EMR vs Claims
Use min and max to measure range for observations and for each field.
exclude reversals for claims
exclude invalid dates
identify broken processes that generate invalid dates to help remove invalid dates
Use start and end date of vocabulary items
Document what to do with records that are missing required fields
maybe you have a medical coder who can code from a description field
Document what to do with records that have fields with invalid values
Software lifecycle
How do you develop, test, and accept for production?
Jenkins for automated build
SVN for source code control
How do you manage effort and cost to convert millions of patient records in TB of data?
Use a sample subset of the total data based on number of patients, amount of data, or processing time
develop and test using a sample subset of entire data (150 thousand patients)
business acceptance test using a large sample subset of entire data (1 million patients)
production run using entire data (millions of patients)
Define destination location(s)
Always get the latest vocabularies before each refresh (development, test, or production run)
Where do you get the most recent list of codes?
Frequency or schedule of reviews
How do you become aware of updates to CDM?
How do you become aware of updates to vocabularies?
Partitioning for parallelism to optimize performance
Guidelines for incremental update
Reusable code/Tables
Intermediate model?
Quality Assurance (QA)
How do we ensure ETL is good?
metrics for success
for each source and destination table
top mapped codes with count of records
top unmapped codes with count of records
variance between previous and current run
count of records
% of records with mapped codes
% unique codes that are mapped
for select fields (demographics), show histogram of values
compare actual to expected results
ensure referential integrity on platforms that do not enforce it
it would be awesome to compare histogram of values for source with equivalent destination
it would be awesome to show improvements between runs due to better mapping and coding
it would be nice to show average condition per visit
some deidentification processes introduce variance in dates or id values
How do we get business units to participate?
How do we get approval from business units?
Validate destination data with use cases and compare against source data with use cases. Investigate or accept variance.
Standard model checks that are independent of data or volume
automatic vs manual checks
Frequency or schedule of reviews
Tools
Achilles
Autosys
Oozie
Operation
Guidance for archive
Tools
monitoring
kibana