publications | Marko Miletic

2025

SHTI

Synthetic Data Generation Methods for Longitudinal and Time Series Health Data

Marko Miletic, and Murat Sariyar

In Global Healthcare Transformation in the Era of Artificial Intelligence and Informatics, 2025

Abs DOI PDF

While Synthetic Data Generation (SDG) is widely recognized in healthcare, especially for structured tabular data, longitudinal and time series data represent another crucial application. This rapid literature review analyzed 338 articles retrieved from PubMed and swisscovery, identifying and categorizing 14 prominent methods for generating synthetic longitudinal and time series health data. These methods encompass Generative Adversarial Networks (GANs), diffusion models, Variational Autoencoders (VAEs), transformer-based models, and Bayesian statistical methods. The review offers preliminary insights into the approaches’ utility, fidelity, and privacy implications, guiding future method development and adequate SDG model selection.
SHTI

The Relevance of General Intelligence Measurement in Deep Learning for Healthcare

Marko Miletic, and Murat Sariyar

In Envisioning the Future of Health Informatics and Digital Health, 2025

Abs DOI PDF

The integration of artificial intelligence (AI) into medical informatics presents significant opportunities to enhance healthcare through data-driven diagnostics, predictive analytics, and personalized therapeutic recommendations. This paper examines the role of general intelligence in improving the effectiveness and adaptability of AI systems in complex clinical environments. We explore various levels of generalization – local, broad, and extreme – highlighting their respective contributions and limitations in healthcare. Local generalization provides robust assessments based on well-defined risk factors, while broad generalization allows for nuanced patient stratification across diverse populations. Extreme generalization, however, presents the greatest challenge, requiring AI systems to adapt to entirely new contexts without prior exposure. Despite advancements, existing metrics for assessing generalization difficulty remain inadequate, necessitating the development of new evaluation methodologies.
JMIR AI

Utility-based Analysis of Statistical Approaches and Deep Learning Models for Synthetic Data Generation With Focus on Correlation Structures: Algorithm Development and Validation

Marko Miletic, and Murat Sariyar

JMIR AI, Mar 2025

Abs DOI PDF

Background: Recent advancements in Generative Adversarial Networks and large language models (LLMs) have significantly advanced the synthesis and augmentation of medical data. These and other deep learning–based methods offer promising potential for generating high-quality, realistic datasets crucial for improving machine learning applications in health care, particularly in contexts where data privacy and availability are limiting factors. However, challenges remain in accurately capturing the complex associations inherent in medical datasets. Objective: This study evaluates the effectiveness of various Synthetic Data Generation (SDG) methods in replicating the correlation structures inherent in real medical datasets. In addition, it examines their performance in downstream tasks using Random Forests (RFs) as the benchmark model. To provide a comprehensive analysis, alternative models such as eXtreme Gradient Boosting and Gated Additive Tree Ensembles are also considered. We compare the following SDG approaches: Synthetic Populations in R (synthpop), copula, copulagan, Conditional Tabular Generative Adversarial Network (ctgan), tabular variational autoencoder (tvae), and tabula for LLMs. Methods: We evaluated synthetic data generation methods using both real-world and simulated datasets. Simulated data consist of 10 Gaussian variables and one binary target variable with varying correlation structures, generated via Cholesky decomposition. Real-world datasets include the body performance dataset with 13,393 samples for fitness classification, the Wisconsin Breast Cancer dataset with 569 samples for tumor diagnosis, and the diabetes dataset with 768 samples for diabetes prediction. Data quality is evaluated by comparing correlation matrices, the propensity score mean-squared error (pMSE) for general utility, and F1-scores for downstream tasks as a specific utility metric, using training on synthetic data and testing on real data. Results: Our simulation study, supplemented with real-world data analyses, shows that the statistical methods copula and synthpop consistently outperform deep learning approaches across various sample sizes and correlation complexities, with synthpop being the most effective. Deep learning methods, including large LLMs, show mixed performance, particularly with smaller datasets or limited training epochs. LLMs often struggle to replicate numerical dependencies effectively. In contrast, methods like tvae with 10,000 epochs perform comparably well. On the body performance dataset, copulagan achieves the best performance in terms of pMSE. The results also highlight that model utility depends more on the relative correlations between features and the target variable than on the absolute magnitude of correlation matrix differences. Conclusions: Statistical methods, particularly synthpop, demonstrate superior robustness and utility preservation for synthetic tabular data compared with deep learning approaches. Copula methods show potential but face limitations with integer variables. Deep Learning methods underperform in this context. Overall, these findings underscore the dominance of statistical methods for synthetic data generation for tabular data, while highlighting the niche potential of deep learning approaches for highly complex datasets, provided adequate resources and tuning.

2024

SHTI

Use and Evaluation of GANs for Synthetic Data Generation in Pharmacogenetics

Dominic Aeschbacher, Jessica Meisner, Marko Miletic, and 1 more author

In Collaboration across Disciplines for the Health of People, Animals and Ecosystems, Mar 2024

Abs DOI PDF

Pharmacogenetics (PGx) explores the influence of genetic variability on drug efficacy and tolerability. Synthetic Data Generation (SDG) has emerged as a promising alternative to the labor-intensive process of collecting real-world PGx data, which is required for high-qualitative prediction models. This study investigates the performance of two Generative Adversarial Network (GAN) models, CTGAN and CTAB-GAN+, in generating synthetic PGx data. The benchmarking is based on utility metrics (Hellinger distance and Random Forest accuracy) and ϵ-identifiability. Results demonstrate that synthetic data generated by CTAB-GAN+ can surpass the original dataset in terms of utility. For instance, CTAB-GAN+ achieves higher Random Forest accuracy compared to the original data, indicating better predictive performance. These improvements suggest that synthetic data not only capture the essential patterns of the original data but also enhance model generalization and prediction capabilities, providing a more robust training ground for machine learning models. Consequently, SDG offers a promising solution to address data scarcity and imbalance in pharmacogenetic research.
SHTI

What Kind of Transformer Models to Use for the ICD-10 Codes Classification Task

Mariem Mansour, Fatma Yilmaz, Marko Miletic, and 1 more author

In Digital Health and Informatics Innovations for Sustainable Health Care Systems, Mar 2024

Abs DOI PDF

Coding according to the International Classification of Diseases (ICD)-10 and its clinical modifications (CM) is inherently complex and expensive. Natural Language Processing (NLP) assists by simplifying the analysis of unstructured data from electronic health records, thereby facilitating diagnosis coding. This study investigates the suitability of transformer models for ICD-10 classification, considering both encoder and encoder-decoder architectures. The analysis is performed on clinical discharge summaries from the Medical Information Mart for Intensive Care (MIMIC)-IV dataset, which contains an extensive collection of electronic health records. Pre-trained models such as BioBERT, ClinicalBERT, ClinicalLongformer, and ClinicalBigBird are adapted for the coding task, incorporating specific preprocessing techniques to enhance performance. The findings indicate that increasing context length improves accuracy, and that the difference in accuracy between encoder and encoder-decoder models is negligible.
SHTI

Modelling Events in Biomedical Decision Support Systems Using Ontologies

Marko Miletic, and Murat Sariyar

In Digital Health and Informatics Innovations for Sustainable Health Care Systems, Mar 2024

Abs DOI PDF

Biomedical decision support systems play a crucial role in modern healthcare by assisting clinicians in making informed decisions. Events, such as physiological changes or drug reactions, are integral components of these systems, influencing patient outcomes and treatment strategies. However, effectively modeling events within these systems presents significant challenges due to the complexity and dynamic nature of medical data. Especially the differentiation between events and processes as well as the nature of events is often unclear. This paper explores approaches to modeling events in biomedical decision support systems, considering factors such as ontology-based representation. By addressing these challenges, we strive to provide the means for enhancing the functionality and interpretability of biomedical decision support systems concerning events.
LNCS

Assessing the Potentials of LLMs and GANs as State-of-the-Art Tabular Synthetic Data Generation Methods

Marko Miletic, and Murat Sariyar

In Privacy in Statistical Databases, Mar 2024

Abs DOI PDF

The abundance of tabular microdata constitutes a valuable resource for research, policymaking, and innovation. However, due to stringent privacy regulations, a significant portion of this data remains inaccessible. To address this, synthetic data generation methods have emerged as a promising solution. Here, we assess the potentials of two state-of-the-art GAN and LLM tabular synthetic data generators using different utility & risk measures and propose a robust risk estimation for individual records based on shared nearest neighbors. LLMs outperform CTGAN by generating synthetic data that more closely matches real data distributions, as evidenced by lower Wasserstein distances. LLMs also generally provide better predictive performance compared to CTGAN, with higher F_1 and R^2 scores. Interestingly, this does not necessarily mean that LLMs better capture correlations. Our proposed risk measure, Shared Neighbor Identifiability (SNI), proves effective in accurately assessing identification risk, offering a robust tool for navigating the risk-utility trade-off. Furthermore, we identify the challenges posed by mixed feature types in distance calculation. Ultimately, the choice between LLMs and GANs depends on factors such as data complexity, computational resources, and the desired level of model interpretability, emphasizing the importance of informed decision-making in selecting the appropriate generative model for specific applications.
SHTI

Large Language Models for Synthetic Tabular Health Data: A Benchmark Study

Marko Miletic, and Murat Sariyar

In Digital Health and Informatics Innovations for Sustainable Health Care Systems, Mar 2024

Abs DOI PDF

Synthetic tabular health data plays a crucial role in healthcare research, addressing privacy regulations and the scarcity of publicly available datasets. This is essential for diagnostic and treatment advancements. Among the most promising models are transformer-based Large Language Models (LLMs) and Generative Adversarial Networks (GANs). In this paper, we compare LLM models of the Pythia LLM Scaling Suite with varying model sizes ranging from 14M to 1B, against a reference GAN model (CTGAN). The generated synthetic data are used to train random forest estimators for classification tasks to make predictions on the real-world data. Our findings indicate that as the number of parameters increases, LLM models outperform the reference GAN model. Even the smallest 14M parameter models perform comparably to GANs. Moreover, we observe a positive correlation between the size of the training dataset and model performance. We discuss implications, challenges, and considerations for the real-world usage of LLM models for synthetic tabular data generation.
Appl. Sci.

Challenges of Using Synthetic Data Generation Methods for Tabular Microdata

Marko Miletic, and Murat Sariyar

Applied Sciences, Jan 2024

Abs DOI PDF

The generation of synthetic data holds significant promise for augmenting limited datasets while avoiding privacy issues, facilitating research, and enhancing machine learning models’ robustness. Generative Adversarial Networks (GANs) stand out as promising tools, employing two neural networks—generator and discriminator—to produce synthetic data that mirrors real data distributions. This study evaluates GAN variants (CTGAN, CopulaGAN), a variational autoencoder, and copulas on diverse real datasets of different complexity encompassing numerical and categorical attributes. The results highlight CTGAN’s sensitivity to training parameters and TVAE’s robustness across datasets. Scalability challenges persist, with GANs demanding substantial computational resources. TVAE stands out for its high utility across all datasets, even for high-dimensional data, though it incurs higher privacy risks, which is indicative of the curse of dimensionality. While no single model universally excels, understanding the trade-offs and leveraging model strengths can significantly enhance synthetic data generation (SDG). Future research should focus on adaptive learning mechanisms, scalability enhancements, and standardized evaluation metrics to advance SDG methods effectively. Addressing these challenges will foster broader adoption and application of synthetic data.

2023

SHTI

What Kind of Ontologies Do We Need in the Biomedical Domain?

Marko Miletic, and Murat Sariyar

In Healthcare Transformation with Informatics and Artificial Intelligence, Jan 2023

Abs DOI PDF

We tackle the question as to what sort of ontologies we primarily need in the biomedical domain. For this purpose, we will first provide a simple categorization of ontologies and describe an important use case related to modeling and documenting events. Then, the impact of using upper-level ontologies as a basis to address our use case will be shown in order to derive an answer to our research question. Although formal ontologies can serve as a starting point to understand conceptualization in a domain and facilitate interesting inferences, it is even more important to account for the dynamic and changing nature of knowledge. Being unconstrained by pre-defined categories and relationships can facilitate timely enrichment of a conceptual scheme and provide links and dependency structures in an informal manner. Semantic enrichment can be achieved by other mechanisms such as tagging or the creation of synsets as, for example, provided in WordNet.
SHTI

Implementing Informative-Based Active Learning in Biomedical Record Linkage for the Splink Package in Python

Marko Miletic, and Murat Sariyar

In Healthcare Transformation with Informatics and Artificial Intelligence, Jan 2023

Abs DOI PDF

In biomedical record linkage, efficient determination of a threshold to decide at which level of similarity two records should be classified as belonging to the same patient is frequently still an open issue. Here, we describe how to implement an efficient active learning strategy that puts into practice a measure of usefulness of training sets for such a task. Our results show that active learning should always be considered when training data is to be produced via manual labeling. In addition to that, active learning gives a quick indication how complex a problem is by looking into the label frequencies: If the most difficult entities are always stemming from the same class, then the classifier will probably have less problems in distinguishing the classes. In big data applications, these two properties are essential, as the problems of under- and overfitting are exacerbated in such contexts.

2022

SHTI

Explaining Contextualized Word Embeddings in Biomedical Research – A Qualitative Investigation

Marko Miletic, and Murat Sariyar

In Advances in Informatics, Management and Technology in Healthcare, Jan 2022

Abs DOI PDF

Contextualized word embeddings proved to be highly successful quantitative representations of words that allow to efficiently solve various tasks such as clinical entity normalization in unstructured texts. In this paper, we investigate how the Saussurean sign theory can be used as a qualitative explainable AI method for word embeddings. Our assumption is that the main goal of XAI is to produce confidence and/or trust, which can be gained through quantitative as well as quantitative approaches. One important result is related to the fact that the differential structure of language as explained by Saussure corresponds to the possibility of adding and subtracting word embeddings. On the other hand, these mathematical structures provide insights into the inner workings of natural language.
SHTI

The Sortal Concept in the Context of Biomedical Record Linkage

Marko Miletic, and Murat Sariyar

In Advances in Informatics, Management and Technology in Healthcare, Jan 2022

Abs DOI PDF

Biomedical Record Linkage is especially designed for linking data of patients in different data repositories. An important question in this context is whether singling-out is sufficient for identifying a patient, and if not, what is in general required for identification. To provide hints for an answer, we will extend previous works on the concept of identity and extend the sortal concept, stemming from analytical philosophy and upper-level ontologies. A sortal is a concept that is associated with an identity criterion. For example, the concept "set" has the identity criterion "having the same members". Based on a description of a record linkage setting, we operationalize the sortal concept by providing a distinction between the digital representation of a person (d-sortal) and the person in flesh (b-sortal).
SHTI

SERO – A New Mobile App for Suicide Prevention

Lea Meier, Caroline Gurtner, Stephan Nuessli, and 4 more authors

In Healthcare of the Future 2022, Jan 2022

Abs DOI PDF

Mobile apps indicate a positive effect on suicidal ideation and potential impact on suicide attempts. As part of the SERO suicide prevention program, Lucerne Psychiatry in collaboration with partner organizations aims to reduce suicides and suicide attempts in its service area, and to improve the self-management of suicidal individuals with a mobile app. The concept for such an app was developed in a trialog with health professionals, persons at risk and their relatives and its functions were compared to six known essential app-based strategies for suicide prevention, such as the development of a safety plan, access to support networks and tracking of mood. We present the concept and architecture for the app and discuss potential added value, which may result from the intertwining of the strategies within the app, which will be available in its first version in late 2022.
SHTI

An Interoperable Resuscitation Registry for the University Hospital of Bern

Marko Miletic, Manuela Iten, Thomas Bürkle, and 1 more author

In Healthcare of the Future 2022, Jan 2022

Abs DOI PDF

During resuscitation, the patient is the primary focus with the documentation of actions and outcomes being secondary. In most cases, a cardiac event leads to further treatment or hospitalization, in which complex patient pathways, independent documentation systems and information loss represent the key challenges for successful quality management. Hence, the need for a system that takes all these aspects into account. Market research, system analysis and requirements engineering for such a solution were performed and a prototype was created. A complete reference architecture for a web-based electronic data capture system was developed and implemented that enables healthcare professionals to enter resuscitation-relevant data uniformly and store it centrally in compliance with human research legislation. A qualitative evaluation concerning the process flows of the as-is and the to-be situation suggests that there is potential to achieve benefits in the form of improved data quality and quantity.