Journals-Paper Information - American Institute of Science

Clinical Medicine Journal

Journals > > Clinical Medicine Journal

Articles Information

Clinical Medicine Journal, Vol.7, No.2, Jun. 2021, Pub. Date: May 31, 2021

Simulation of Synthetic Diabetes Tabular Data Using Generative Adversarial Networks

Download PDF (453K)

Pages: 49-59 Views: 1511 Downloads: 388

Authors

[01] Heng Wee Lin Eunice, Department of Statistics & Applied Probability, Faculty of Science, National University of Singapore, Singapore.

[02] Carol Anne Hargreaves, Department of Statistics & Applied Probability, Faculty of Science, National University of Singapore, Singapore.

Abstract

Generative Adversarial Networks (GANs) is a relatively new research avenue in the domain of Deep Learning and Artificial Intelligence. Over the past few years, GANs have been extensively researched into due to their ability to generate realistic synthetic data. The creation of synthetic tabular data is especially useful when it is desirable to avoid using the original data set due to privacy reasons. Therefore, the objective of this paper was to review the effectiveness of GANs in simulating synthetic tabular diabetes data. Methodology: Prior to GAN training, we applied min-max normalization on the features. To analyze the similarity between the real data set and each synthetic data set, we conducted exploratory data analysis before employing some statistical methods. We compared the synthesized data with the original data using data visualizations including histograms and boxplots. We also computed confidence intervals for the means of the real data variables and compared them with the confidence intervals for the means of the synthetic data. Results: The results showed that 8 of the 9 confidence intervals overlapped. We also checked whether the mean of a particular variable in the synthetic data set fell into the confidence interval of the same variable in the real data set. For each variable, we had two different probability distributions: the true distribution (from the real data); and an approximation of that distribution (from the synthetic data). To quantify the difference between the two distributions, we computed the Kull-Lieber (KL) divergence score. The KL scores for all 8 predictors were relatively small and close to 0, which is ideal. A model for classifying patients as having diabetes was built using only the real data. Another model for classifying patients as having diabetes was built using the combined real and synthetic data. The model using the combined real and synthetic data achieved a much higher accuracy of 87.0% as compared to 78.7% attained when only using the real data. Conclusion: We built a realistic synthetic data set using generative adversarial networks. The synthetic data set proved to be very similar to the real dataset and could successfully replace the real data for analysis for research purposes. Further, we verified that the availability of more training data for diabetes classification helped to improve the accuracy of the classifier, while achieving a relatively high recall.

Keywords

Generative Adversarial Networks (GANs), Deep Learning, Artificial Intelligence, Tabular Data, Synthetic Data, Generator, Discriminator, Encoder, Decoders

References

[01] 7 Effective Ways to Deal with a Small Dataset | Hacker Noon. https://hackernoon.com/7-effective-ways-to-deal-with-a-small-dataset-2gyl407s. Accessed 10 Jan. 2021.

[02] Maynard-Atem, L (2019). The Data Series – Solving the Data Privacy Problem Using Synthetic Data, Impact, 2019:2, 11-13, DOI: 10.1080/2058802X.2019.1668192

[03] Alver, S. (2018, September 04). Connections Between GANs and AC Methods in Reinforcement Learning. Retrieved February 02, 2021, from https://alversafa.github.io/blog/2018/09/04/gan-ac.html

[04] Goodfellow, Ian. “NIPS 2016 Tutorial: Generative Adversarial Networks.” ArXiv: 1701.00160 [Cs], Apr. 2017. arXiv.org, http://arxiv.org/abs/1701.00160.

[05] Ratliff, L. J., Burden, S. A., and Sastry, S. S. (2013). Characterization and computation of local nash equilibria in continuous games. In Communication, Control, and Computing (Allerton), 2013 51st Annual Allerton Conference on, pages 917–924. IEEE.

[06] Arjovsky, M., Chintala, S. & Bottou, L. (2017). Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 214–223. PMLR, 2017.

[07] Thanh-Tung, Hoang, et al. “Improving Generalization and Stability of Generative Adversarial Networks.” ArXiv: 1902.03984 [Cs, Stat], Feb. 2019. arXiv.org, http://arxiv.org/abs/1902.03984.

[08] Goodfellow, Ian & Pouget-Abadie, Jean & Mirza, Mehdi & Xu, Bing & Warde-Farley, David & Ozair, Sherjil & Courville, Aaron & Bengio, Y.. (2014). Generative Adversarial Networks. Advances in Neural Information Processing Systems. 3. 10.1145/3422622.

[09] Szandała, T. (2020). Review and Comparison of Commonly Used Activation Functions for Deep Neural Networks. Bio-inspired Neurocomputing Studies in Computational Intelligence, 203-224. doi:10.1007/978-981-15-5495-7_11

[10] Tanaka, F. H., & Aranha, C. (2019). Data Augmentation Using GANs. arXiv: 1904.09135.

[11] Srivastava, Nitish & Hinton, Geoffrey & Krizhevsky, Alex & Sutskever, Ilya & Salakhutdinov, Ruslan. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research. 15. 1929-1958.

[12] Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32nd International Conference on Machine Learning, in PMLR 37:448-456

[13] Radford, Alec, et al. “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.” ArXiv:1511.06434 [Cs], Jan. 2016. arXiv.org, http://arxiv.org/abs/1511.06434.

[14] J. Brownlee, A Gentle Introduction to Batch Normalization for Deep Neural Networks (2019), Maching Learning Mastery

[15] Salimans, Tim, et al. “Improved Techniques for Training GANs.” ArXiv: 1606.03498 [Cs], June 2016. arXiv.org, http://arxiv.org/abs/1606.03498.

[16] Bengio, Yoshua. “Practical Recommendations for Gradient-Based Training of Deep Architectures.” ArXiv: 1206.5533 [Cs], Sept. 2012. arXiv.org, http://arxiv.org/abs/1206.5533.

[17] Brownlee, Jason. “How to Control the Stability of Training Neural Networks with the Batch Size.” Machine Learning Mastery, 20 Jan. 2019, https://machinelearningmastery.com/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size/.

[18] Brownlee, Jason. “A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size.” Machine Learning Mastery, 20 July 2017, https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/.

[19] Pima Indians Diabetes Database. https://kaggle.com/uciml/pima-indians-diabetes-database. Accessed 11 Jan. 2021.

[20] Han, Jiawei, et al. “Data Preprocessing.” Data Mining, Elsevier, 2012, pp. 83–124. DOI.org (Crossref), doi:10.1016/B978-0-12-381479-1.00003-4.

[21] Sample Means. (n.d.). Retrieved January 18, 2021, from http://www.stat.yale.edu/Courses/1997-98/101/sampmn.htm#clt

[22] Estimation of a population mean. (n.d.). Retrieved February 17, 2021, from https://www.britannica.com/science/statistics/Estimation-of-a-population-mean

[23] Kurt, W. (2017, May 10). Kullback-Leibler Divergence Explained. Retrieved January 16, 2021, from https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained

[24] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, Volume 16, pages 321-357, 2002, 2011. doi: 10.1613/jair.953.

[25] Guillaume Lemaˆıtre, Fernando Nogueira, and Christos K. Aridas. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17):1–5, 2017. URL http://jmlr.org/papers/v18/ 16-365.html.

[26] Ronaghan, S. (2019, November 01). The Mathematics of Decision Trees, Random Forest and Feature Importance in Scikit-learn and Spark. Retrieved January 24, 2021, from https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3

[27] Ian Goodfellow et al. “Generative adversarial nets”. In: Advances in neural information processing systems. 2014, pp. 2672–2680.

ISSN Print: 2381-7631
ISSN Online: 2381-764X
Current Issue:
Vol. 7, Issue 3, September Submit a Manuscript Join Editorial Board Join Reviewer Team