Assessing the Potentials of LLMs and GANs as State-of-the-Art Tabular Synthetic Data Generation Methods
The abundance of tabular microdata constitutes a valuable resource for research, policymaking, and innovation. However, due to stringent privacy regulations, a significant portion of this data remains inaccessible. To address this, synthetic data generation methods have emerged as a promising solution. Here, we assess the potentials of two state-of-the-art GAN and LLM tabular synthetic data generators using different utility & risk measures and propose a robust risk estimation for individual records based on shared nearest neighbors. LLMs outperform CTGAN by generating synthetic data that more closely matches real data distributions, as evidenced by lower Wasserstein distances. LLMs also generally provide better predictive performance compared to CTGAN, with higher F_1 and R^2 scores. Interestingly, this does not necessarily mean that LLMs better capture correlations. Our proposed risk measure, Shared Neighbor Identifiability (SNI), proves effective in accurately assessing identification risk, offering a robust tool for navigating the risk-utility trade-off. Furthermore, we identify the challenges posed by mixed feature types in distance calculation. Ultimately, the choice between LLMs and GANs depends on factors such as data complexity, computational resources, and the desired level of model interpretability, emphasizing the importance of informed decision-making in selecting the appropriate generative model for specific applications.