Predicting outcomes for new users of celecoxib

Background Observational data present the opportunity to identify patterns that can be utilized to develop discriminative personalized health outcome risk models. The most widely implemented framework for developing health outcome risk models requires experts to define a set of independent variables, and these are then fed into a logistic or cox regression. The disadvantage of such a framework is that it requires the time consuming aspect of defining different sets of independent variables for each health outcome and dataset. Furthermore, the model is generally restricted to a small selection of variables and therefore ignores large quantities of data, whose inclusion may be able to improve risk prediction performance. For example, restricting the set of independent variables based on current expert knowledge may result in the exclusion of highly predictive independent variables because they are unknown at the time. The PatientLevelPrediction package developed in OHDSI (Observational Health Data Science and Informatics) uses an adaptive framework approach, where a data-driven method explores all the data to find the independent variables that are predictive of the outcome. This is accomplished by the logistic regression model including a large number of independent variables but each independent variable’s coefficient has a Laplace prior, which acts as a type of regularization and results in many coefficients being shrunk to zero to limit model overfitting. The independent variables with non-zero coefficients are selected by the model as they are predictive of the health outcome. If such a framework is able to perform well, then it could be efficiently applied, across the network of observational data available to the OHDSI community, to develop risk models for many health outcomes. The added advantage of the lasso logistic regression is that is learns an easy to interpret sparse model, so it may also be used to gain new medical insight by clear highlighting unknown risk factors. In this study described here we would like to provide a proof of principle of the PatientLevelPrediction package, and evaluate the performance and robustness of the risk prediction models across a range of health outcomes and datasets. We would also like to show that the PatientLevelPrediction package can be easily deployed in a distributed research network to develop risk models using different observational datasets.

Objective

Primary objective

  1. Determine the predictive performance of the OHDSI PatientLevelPrediction using standard measures including model calibration and model discrimination.

Secondary objectives

  1. Investigate the robustness of the model across different datasets and identify any limitations that are specific to certain dataset attributes.
  2. Show feasibility of running PatientLevelPrediction in a distributed data network

Project Lead(s):

Coordinating Institution(s):

Additional Participants:

Full Protocol: Word doc for the protocol

Initial Proposal Date:

Launch Date: TBD

Study Closure Date: TBD

Results Submission: Email

Requirements

CDM: V5

Table Accessed: CONCEPT, CONCEPT_ANCESTOR, CONDITION_ERA, CONDITION_OCCURRENCE, DRUG_ERA, DRUG_EXPOSURE, MEASUREMENT, OBSERVATION, OBSERVATION_PERIOD, PERSON, PROCEDURE_OCCURRENCE, VISIT_OCCURRENCE

Database Dialects: SQL Server, Postgres, Oracle, PDW

Software: SQL and R

Code

GitHub code for the study

Datasets Run