publications
A non-exhaustive list of publications.
2024
- SHTIUse and Evaluation of GANs for Synthetic Data Generation in PharmacogeneticsDominic Aeschbacher, Jessica Meisner, Marko Miletic , and 1 more authorStudies in Health Technology and Informatics, Nov 2024
Pharmacogenetics (PGx) explores the influence of genetic variability on drug efficacy and tolerability. Synthetic Data Generation (SDG) has emerged as a promising alternative to the labor-intensive process of collecting real-world PGx data, which is required for high-qualitative prediction models. This study investigates the performance of two Generative Adversarial Network (GAN) models, CTGAN and CTAB-GAN+, in generating synthetic PGx data. The benchmarking is based on utility metrics (Hellinger distance and Random Forest accuracy) and ϵ-identifiability. Results demonstrate that synthetic data generated by CTAB-GAN+ can surpass the original dataset in terms of utility. For instance, CTAB-GAN+ achieves higher Random Forest accuracy compared to the original data, indicating better predictive performance. These improvements suggest that synthetic data not only capture the essential patterns of the original data but also enhance model generalization and prediction capabilities, providing a more robust training ground for machine learning models. Consequently, SDG offers a promising solution to address data scarcity and imbalance in pharmacogenetic research.
- SHTIWhat Kind of Transformer Models to Use for the ICD-10 Codes Classification TaskMariem Mansour, Fatma Yilmaz, Marko Miletic , and 1 more authorStudies in Health Technology and Informatics, Aug 2024
Coding according to the International Classification of Diseases (ICD)-10 and its clinical modifications (CM) is inherently complex and expensive. Natural Language Processing (NLP) assists by simplifying the analysis of unstructured data from electronic health records, thereby facilitating diagnosis coding. This study investigates the suitability of transformer models for ICD-10 classification, considering both encoder and encoder-decoder architectures. The analysis is performed on clinical discharge summaries from the Medical Information Mart for Intensive Care (MIMIC)-IV dataset, which contains an extensive collection of electronic health records. Pre-trained models such as BioBERT, ClinicalBERT, ClinicalLongformer, and ClinicalBigBird are adapted for the coding task, incorporating specific preprocessing techniques to enhance performance. The findings indicate that increasing context length improves accuracy, and that the difference in accuracy between encoder and encoder-decoder models is negligible.
- SHTIModelling Events in Biomedical Decision Support Systems Using OntologiesMarko Miletic, and Murat SariyarStudies in Health Technology and Informatics, Aug 2024
Biomedical decision support systems play a crucial role in modern healthcare by assisting clinicians in making informed decisions. Events, such as physiological changes or drug reactions, are integral components of these systems, influencing patient outcomes and treatment strategies. However, effectively modeling events within these systems presents significant challenges due to the complexity and dynamic nature of medical data. Especially the differentiation between events and processes as well as the nature of events is often unclear. This paper explores approaches to modeling events in biomedical decision support systems, considering factors such as ontology-based representation. By addressing these challenges, we strive to provide the means for enhancing the functionality and interpretability of biomedical decision support systems concerning events.
- JMIR AIEffect of Correlation Structures on Synthesizing Medical Data using GANs and Large Language Models: Analytical Study (Preprint)Marko Miletic, and Murat SariyarAug 2024
BACKGROUND Recent advancements in Generative Adversarial Networks (GANs) and sophisticated language models have significantly impacted the synthesis and augmentation of medical data. These technologies facilitate the creation of high-quality, realistic datasets essential for enhancing machine learning (ML) applications in healthcare. GANs, through their adversarial framework, and Large Language Models (LLM), with their advanced Natural Language Processing (NLP) capabilities, offer innovative solutions for generating synthetic data that mirrors real-world medical information. This is particularly valuable in scenarios constrained by data privacy and availability. However, challenges persist in accurately capturing complex associations within medical datasets. Misrepresentation of these can lead to synthetic data that poorly reflects the real-world data variability and relationships, impacting model performance in clinical applications. OBJECTIVE This study aims to evaluate the effectiveness of Synthetic Data Generation (SDG) methods in replicating the correlation structures of real medical data and assess their performance in downstream tasks using Random Forests (RF). We compare two SDG approaches, CTGAN and the Tabula-Framework, with a focus on their ability to maintain accurate data correlations and their implications for model accuracy and variable importance. METHODS We assess synthetic data generation methods using real-world and simulated datasets. Simulated data involve ten Gaussian variables with different correlation structures, generated via Cholesky decomposition to create binary target variables. Real-world datasets include Body Performance (BP) with 13,393 samples for fitness classification, Wisconsin Breast Cancer (BC) with 569 samples for tumor diagnosis, and Diabetes (DB) with 768 samples for diabetes prediction. Data quality is evaluated through the Euclidean Distance (L² norm) between original and synthetic correlation matrices and through downstream classification tasks using Random Forests (RF) and computing F₁ scores. Variable importance (VIMP) measures, i.e., Gini impurity and permutation-based methods, are employed for assessing the mechanism behind the RF results. For each model and epoch combination, 100 samples are drawn, conducting outlier analysis to ensure robust performance evaluation. RESULTS In smaller datasets (samples = 1000), synthetic data utility remains stable under both high and moderate correlations, with moderate correlations occasionally enhancing utility. However, as correlation complexity increases, particularly with stronger correlations across multiple features, models struggle, reflected by higher L^2 values for the correlation matrix distance. CTGAN improves with more training epochs but requires significant tuning to handle complex patterns, while LLMs show promise with larger datasets despite their computational demands. Real-world data mirrors these findings, with LLMs outperforming in scenarios with intricate dependency structures. VIMP score analysis underscores the importance of aligning model complexity with data correlation structures. CONCLUSIONS Our findings emphasize that correlation complexity, not strength, is the key challenge in synthetic data generation. While CTGAN and LLMs show varying success based on dataset size and complexity, careful tuning and model selection are essential. Further research should focus on optimizing training protocols, exploring simpler neural network architectures, and expanding simulations to better handle nonlinear and high-order interactions in complex datasets.
- LNCSAssessing the Potentials of LLMs and GANs as State-of-the-Art Tabular Synthetic Data Generation MethodsMarko Miletic, and Murat SariyarSep 2024
The abundance of tabular microdata constitutes a valuable resource for research, policymaking, and innovation. However, due to stringent privacy regulations, a significant portion of this data remains inaccessible. To address this, synthetic data generation methods have emerged as a promising solution. Here, we assess the potentials of two state-of-the-art GAN and LLM tabular synthetic data generators using different utility & risk measures and propose a robust risk estimation for individual records based on shared nearest neighbors. LLMs outperform CTGAN by generating synthetic data that more closely matches real data distributions, as evidenced by lower Wasserstein distances. LLMs also generally provide better predictive performance compared to CTGAN, with higher F_1 and R^2 scores. Interestingly, this does not necessarily mean that LLMs better capture correlations. Our proposed risk measure, Shared Neighbor Identifiability (SNI), proves effective in accurately assessing identification risk, offering a robust tool for navigating the risk-utility trade-off. Furthermore, we identify the challenges posed by mixed feature types in distance calculation. Ultimately, the choice between LLMs and GANs depends on factors such as data complexity, computational resources, and the desired level of model interpretability, emphasizing the importance of informed decision-making in selecting the appropriate generative model for specific applications.
- SHTILarge Language Models for Synthetic Tabular Health Data: A Benchmark StudyMarko Miletic, and Murat SariyarStudies in Health Technology and Informatics, Aug 2024
Synthetic tabular health data plays a crucial role in healthcare research, addressing privacy regulations and the scarcity of publicly available datasets. This is essential for diagnostic and treatment advancements. Among the most promising models are transformer-based Large Language Models (LLMs) and Generative Adversarial Networks (GANs). In this paper, we compare LLM models of the Pythia LLM Scaling Suite with varying model sizes ranging from 14M to 1B, against a reference GAN model (CTGAN). The generated synthetic data are used to train random forest estimators for classification tasks to make predictions on the real-world data. Our findings indicate that as the number of parameters increases, LLM models outperform the reference GAN model. Even the smallest 14M parameter models perform comparably to GANs. Moreover, we observe a positive correlation between the size of the training dataset and model performance. We discuss implications, challenges, and considerations for the real-world usage of LLM models for synthetic tabular data generation.
- Appl. Sci.Challenges of Using Synthetic Data Generation Methods for Tabular MicrodataMarko Miletic, and Murat SariyarApplied Sciences, Jul 2024
The generation of synthetic data holds significant promise for augmenting limited datasets while avoiding privacy issues, facilitating research, and enhancing machine learning models’ robustness. Generative Adversarial Networks (GANs) stand out as promising tools, employing two neural networks—generator and discriminator—to produce synthetic data that mirrors real data distributions. This study evaluates GAN variants (CTGAN, CopulaGAN), a variational autoencoder, and copulas on diverse real datasets of different complexity encompassing numerical and categorical attributes. The results highlight CTGAN’s sensitivity to training parameters and TVAE’s robustness across datasets. Scalability challenges persist, with GANs demanding substantial computational resources. TVAE stands out for its high utility across all datasets, even for high-dimensional data, though it incurs higher privacy risks, which is indicative of the curse of dimensionality. While no single model universally excels, understanding the trade-offs and leveraging model strengths can significantly enhance synthetic data generation (SDG). Future research should focus on adaptive learning mechanisms, scalability enhancements, and standardized evaluation metrics to advance SDG methods effectively. Addressing these challenges will foster broader adoption and application of synthetic data.
2023
- SHTIWhat Kind of Ontologies Do We Need in the Biomedical Domain?Marko Miletic, and Murat SariyarStudies in Health Technology and Informatics, Jun 2023
We tackle the question as to what sort of ontologies we primarily need in the biomedical domain. For this purpose, we will first provide a simple categorization of ontologies and describe an important use case related to modeling and documenting events. Then, the impact of using upper-level ontologies as a basis to address our use case will be shown in order to derive an answer to our research question. Although formal ontologies can serve as a starting point to understand conceptualization in a domain and facilitate interesting inferences, it is even more important to account for the dynamic and changing nature of knowledge. Being unconstrained by pre-defined categories and relationships can facilitate timely enrichment of a conceptual scheme and provide links and dependency structures in an informal manner. Semantic enrichment can be achieved by other mechanisms such as tagging or the creation of synsets as, for example, provided in WordNet.
- SHTIImplementing Informative-Based Active Learning in Biomedical Record Linkage for the Splink Package in PythonMarko Miletic, and Murat SariyarStudies in Health Technology and Informatics, Jun 2023
In biomedical record linkage, efficient determination of a threshold to decide at which level of similarity two records should be classified as belonging to the same patient is frequently still an open issue. Here, we describe how to implement an efficient active learning strategy that puts into practice a measure of usefulness of training sets for such a task. Our results show that active learning should always be considered when training data is to be produced via manual labeling. In addition to that, active learning gives a quick indication how complex a problem is by looking into the label frequencies: If the most difficult entities are always stemming from the same class, then the classifier will probably have less problems in distinguishing the classes. In big data applications, these two properties are essential, as the problems of under- and overfitting are exacerbated in such contexts.
2022
- SHTIExplaining Contextualized Word Embeddings in Biomedical Research – A Qualitative InvestigationMarko Miletic, and Murat SariyarStudies in Health Technology and Informatics, Jun 2022
Contextualized word embeddings proved to be highly successful quantitative representations of words that allow to efficiently solve various tasks such as clinical entity normalization in unstructured texts. In this paper, we investigate how the Saussurean sign theory can be used as a qualitative explainable AI method for word embeddings. Our assumption is that the main goal of XAI is to produce confidence and/or trust, which can be gained through quantitative as well as quantitative approaches. One important result is related to the fact that the differential structure of language as explained by Saussure corresponds to the possibility of adding and subtracting word embeddings. On the other hand, these mathematical structures provide insights into the inner workings of natural language.
- SHTIThe Sortal Concept in the Context of Biomedical Record LinkageMarko Miletic, and Murat SariyarStudies in Health Technology and Informatics, Jun 2022
Biomedical Record Linkage is especially designed for linking data of patients in different data repositories. An important question in this context is whether singling-out is sufficient for identifying a patient, and if not, what is in general required for identification. To provide hints for an answer, we will extend previous works on the concept of identity and extend the sortal concept, stemming from analytical philosophy and upper-level ontologies. A sortal is a concept that is associated with an identity criterion. For example, the concept "set" has the identity criterion "having the same members". Based on a description of a record linkage setting, we operationalize the sortal concept by providing a distinction between the digital representation of a person (d-sortal) and the person in flesh (b-sortal).
- SHTISERO – A New Mobile App for Suicide PreventionLea Meier, Caroline Gurtner, Stephan Nüssli , and 3 more authorsStudies in Health Technology and Informatics, May 2022
Mobile apps indicate a positive effect on suicidal ideation and potential impact on suicide attempts. As part of the SERO suicide prevention program, Lucerne Psychiatry in collaboration with partner organizations aims to reduce suicides and suicide attempts in its service area, and to improve the self-management of suicidal individuals with a mobile app. The concept for such an app was developed in a trialog with health professionals, persons at risk and their relatives and its functions were compared to six known essential app-based strategies for suicide prevention, such as the development of a safety plan, access to support networks and tracking of mood. We present the concept and architecture for the app and discuss potential added value, which may result from the intertwining of the strategies within the app, which will be available in its first version in late 2022.
- SHTIAn Interoperable Resuscitation Registry for the University Hospital of BernMarko Miletic, Manuela Iten, Thomas Bürkle , and 1 more authorStudies in Health Technology and Informatics, May 2022
During resuscitation, the patient is the primary focus with the documentation of actions and outcomes being secondary. In most cases, a cardiac event leads to further treatment or hospitalization, in which complex patient pathways, independent documentation systems and information loss represent the key challenges for successful quality management. Hence, the need for a system that takes all these aspects into account. Market research, system analysis and requirements engineering for such a solution were performed and a prototype was created. A complete reference architecture for a web-based electronic data capture system was developed and implemented that enables healthcare professionals to enter resuscitation-relevant data uniformly and store it centrally in compliance with human research legislation. A qualitative evaluation concerning the process flows of the as-is and the to-be situation suggests that there is potential to achieve benefits in the form of improved data quality and quantity.