Authors: M. Tornqvist1, L. Dry1, G. Pinon1, A. Movschin1
1Quinten Health, Paris, France
Date: 19 November 2024
Code : MSR138
CONFERENCE/VALUE IN HEALTH INFO:
2024-11, ISPOR Europe 2024, Barcelona, Spain
Value in Health, Volume 27, Issue 12, S2 (December 2024)
Introduction
The increasing use of machine learning methods in medical research requires the availability of massive, high-quality patient data. However, concerns about confidentiality, cost, availability and timeliness limit their accessibility. To overcome these obstacles, the use of synthetic data mimicking real-life data (Real world Data, RWD), is emerging as a promising solution and is increasingly being considered by the pharmaceutical industries. These methods make it possible to generate customized artificial patient data of various sizes and without some of the limitations of RWD, such as missing values and class imbalances [1]. Recently, deep learning methods, such as generative adversarial networks (GANs), have demonstrated remarkable performance in generating reference RWDs, notably from the field of economics [2], [3]. This study
presents an in-depth evaluation of two GAN models, CTGAN [2] and CTABGAN [3], in health data synthesis using computerized patient record (CPR) data from the MIMIC-III database [4].
Methods
MIMIC-III is a publicly accessible database of critical care IPRs. This database contains clinical characteristics (e.g. diagnostic codes, gender, ethnicity, age, laboratory measurements) that have been extracted and aggregated over the lifetime of each patient. CTGAN and CTABGAN are models specifically designed for tabular data synthesis. CTGAN deals with class imbalance by incorporating conditional generation and a sample training mechanism, while CTABGAN can model a mixture of continuous and categorical variables thanks to innovative data encoding.
Synthetic data have been assessed for their fidelity and correlation with real data via the calculation of statistical measures (KSComplement [5], TVComplement [6], Wasserstein and Jensen-Shannon distances, and difference in pairwise correlations [7]) as well as comparative visualizations of synthetic versus real distributions.
Results
CTGAN and CTABGAN generated synthetic data capturing the essential statistical features of the original dataset (Figure 1). CTGAN had more difficulty than CTABGAN in synthesizing data following distributions with multiple modes, and CTABGAN was slightly better at preserving inter-variable correlations.
Conclusion
This study highlights the potential of deep learning approaches based on GANs for the generation of synthetic patient data. The evaluation of two GANs on the MIMIC-III database demonstrated their ability to produce realistic synthetic health data while preserving the confidentiality of the original data. However, it should be noted that
GANs require large quantities of data, significant computational resources, and are sometimes difficult to converge to generate diversified data.
Finally, to promote the use of synthetic data by researchers, regulators and industry, it is necessary to establish a consensus for their evaluation in order to harmonize evidence generation and decision-making.
and decision-making.