Marko Miletic

Heya! Welcome to my website.
news
latest posts
Sep 23, 2025 | Trump’s UN Claim Through the Lens of Bayesian Statistics |
---|---|
Jun 01, 2025 | Recap of The Mobi 100 Jackpot ⛰️🏃 |
Apr 28, 2025 | The Orange and Other Small Miracles |
selected publications
- JAMIASynthetic data for pharmacogenetics: enabling scalable and secure researchJAMIA Open, Oct 2025
This study evaluates the performance of 7 synthetic data generation (SDG) methods—synthpop, avatar, copula, copulagan, ctgan, tvae, and the large language models-based tabula—for supporting pharmacogenetics (PGx) research.We used PGx profiles from 142 patients with adverse drug reactions or therapeutic failures, considering 2 scenarios: (1) a high-dimensional genotype dataset (104 variables) and (2) a phenotype dataset (24 variables). Models were assessed for (1) broad utility using propensity score mean squared error (pMSE), (2) specific utility via weighted F1 score in a Train-Synthetic-Test-Real framework, and (3) privacy risk as ε-identifiability.Copula and synthpop consistently achieved strong performance across both datasets, combining low ε-identifiability (0.25-0.35) with competitive utility. Deep learning models like tabula and tvae trained for 10 000 epochs achieved lower pMSE but had higher ε-identifiability (>0.4) and limited gains in predictive performance. Specific utility was only weakly correlated with broad utility, indicating that distributional fidelity does not ensure predictive relevance. Copula and synthpop often outperformed original data in weighted F1 scores, especially under noise or data imbalance.While deep learning models can achieve high distributional fidelity (pMSE), they often incur elevated ε-identifiability, raising privacy concerns. Traditional methods like copula and synthpop consistently offer robust utility and lower re-identification risk, particularly for high-dimensional data. Importantly, general utility does not predict specific utility (F1 score), emphasizing the need for multimetric evaluation.No single SDG method dominated across all criteria. For privacy-sensitive PGx applications, classical methods such as copula and synthpop offer a reliable trade-off between utility and privacy, making them preferable for high-dimensional, limited-sample settings.Pharmacogenetics (PGx) is the study of how a person’s genes affect their response to medications. Sharing real PGx data for research can raise privacy concerns, especially when data samples are small or sensitive. One way to protect privacy is to use synthetic data—artificial data generated to mimic real data. In this study, we compared 7 methods for creating synthetic PGx data, including both traditional statistical approaches and newer artificial intelligence (AI) models. We tested them on genetic and medical information from 142 patients who had side effects or poor responses to medications. We evaluated each method in 3 ways: how well it preserves overall data patterns (broad utility), how well it supports accurate predictions in real-world tests (specific utility), and how much privacy risk remains (re-identification). Two traditional methods, copula and synthpop, performed well overall. They created useful data while keeping privacy risks low. In contrast, some AI-based methods, while better at copying data patterns, had higher privacy risks and only minor improvements in predictive performance. We found that good general data quality does not always mean better predictions. For PGx research, especially when privacy matters, simpler methods like copula and synthpop may offer the best balance between usefulness and safety.