In recent years, machine learning (ML) techniques in the field of medicine have grown significantly. Thanks to the high performance of deep learning models, machines can detect and classify diseases precisely, sometimes even surpassing specialists. To do this, the model must use data such as medical images by accessing patients’ personal information. Using this personal data creates a privacy issue. One of the most significant barriers to Learning Health Systems (LHS) research and development is the lack of access to patient data in EHRs.

Thanks to advances in synthetic patient data technologies, synthetic patient data has recently been adopted as an alternative data for testing new processes involving EHR data. It is now possible to generate synthetic patient data that can be used to develop ML-enabled LHS and shared between research communities without restrictions. Using this technique, synthetic datasets have already emerged for cardiovascular disease, even cancer. In this context, a research team from California proposed a new reproducible process using synthetic patients to build LHS risk prediction based on ML data.

Specifically, the authors of the paper performed an experimental study through simulation. In this study, LHS for risk prediction is performed by building a baseline XGBoost model for various target diseases, such as lung cancer or stroke, from existing electronic health record (EHR) data. This simulation study follows two steps: In the first step, a novel ML-enabled LHS process was proposed to construct LHS for lung cancer risk prediction in synthetic patients. In a second step, a different target disease, stroke, was used to test the performance of the new LHS process for building LHS for risk prediction with accurate risk prediction for different target diseases. The authors proposed a high-level, data-driven, ML-enabled LHS design for risk prediction. Initially, the ML model is built from the initial patient data in the EHR. The LHS learning cycles then continuously use live patient data to improve the ML model and quickly release a new model that doctors can use to make risk predictions.

The ML-enabled LHS was initialized using a dataset of 30,000 synthetic Synthea patients, and the XGBoost model was used to predict lung cancer risk. Four other data sets of 30,000 patients were then generated. These four new datasets were sequentially added to the first updated dataset to simulate the addition of new patients, resulting in datasets of 60,000, 90,000, 120,000, and 150,000 patients. In each case, new XGBoost models were created. The results show that the performance improves as the data size increases, reaching 0.936 recall and 0.962 AUC on the 150,000 patient dataset. The performance of the new ML-enabled LHS process was verified by applying XGBoost models to predict stroke risk in the same Synthea patient populations.

This paper first introduces an ML model process based on synthetic medical data. This study proved the effectiveness of this new LHS approach, which can treat different types of diseases from other EHR data. The proposed model continues to learn from new patient-generated data to improve its performance until reaching risk predictions greater than 95% for recall and precision metrics. Finally, the authors state that because real data differs from synthetic data, real ML models can be further optimized by tuning hyperparameters.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Simulation of a machine learning enabled learning health system for risk prediction using synthetic patient data'. All Credit For This Research Goes To Researchers on This Project. Check out the paper.
Please Don't Forget To Join Our ML Subreddit


Mahmoud is a PhD student in machine learning. He also owns a
bachelor’s degree in physical sciences and master’s degree in
telecommunications and network systems. His current areas of
research concerns computer vision, stock market forecasting and deep
studying the. He produced several scientific papers on re-
identification and study of the strength and stability of deep
networks.


Latest Artificial Intelligence (AI) Research Uses Synthetic Patient Data To Simulate A Machine Learning Enabled Learning Health System (LHS)