SMOTE-Based Comparative Analysis of Machine Learning Models for Stroke Risk Prediction Using Imbalanced Healthcare Data


Ratu Mutiara Siregar(1); Budy Satria(2*); Sandi Fadilah(3); Liga Mayola(4); Silky Safira(5);

(1) Institut Teknologi Sawit Indonesia
(2) Universitas Andalas
(3) Universiti Muhammadiyah Malaysia
(4) Universitas Putra Indonesia YPTK
(5) Universitas Putra Indonesia YPTK
(*) Corresponding Author

  

Abstract


Stroke remains one of the leading causes of mortality and long-term disability worldwide, with a significant burden in Indonesia. Early detection is crucial, as up to 90% of stroke cases are potentially preventable through timely intervention. However, predictive modeling for stroke risk is often challenged by imbalanced datasets, where non-stroke cases significantly outnumber stroke cases, potentially biasing classification models. This study aims to perform a systematic comparative evaluation of six machine learning algorithms Logistic Regression, Decision Tree, Random Forest, Naïve Bayes, Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost) for stroke risk prediction under imbalanced data conditions. The dataset consists of 5,110 patient records with 11 health-related features obtained from a publicly available healthcare dataset. Data preprocessing included anomaly removal, categorical encoding, feature scaling, and class balancing using the Synthetic Minority Oversampling Technique (SMOTE). Model evaluation was conducted using 5-fold cross-validation and assessed through accuracy, precision, recall, and F1-score metrics. The experimental results demonstrate that ensemble-based models outperform single classifiers. Random Forest achieved the highest mean accuracy of 97.12% (±0.42) with an F1-score of 0.96, followed closely by XGBoost with 96.85% (±0.51). Both models also exhibited superior recall performance, indicating improved minority class detection. The novelty of this study lies in the systematic evaluation of multiple machine learning models using SMOTE-based balancing and cross-validation on publicly available healthcare data, providing robust comparative insights for imbalanced medical classification problems.


Keywords


Stroke prediction, Machine Learning, SMOTE, Imbalanced Data, Random Forest, XGBoost

  
  

Full Text:

PDF
  

Article Metrics

Abstract view: 201 times
PDF view: 83 times
     

Digital Object Identifier

doi  https://doi.org/10.33096/ilkom.v18i1.3161.180-194
  

Cite

References


GBD 2019 Stroke Collaborators, “Global, regional, and national burden of stroke and its risk factors, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019,” The Lancet Neurology, vol. 20, no. 10, pp. 795–820, 2021, doi: 10.1016/S1474-4422(21)00252-0

Siregar, R. M., Prayogi, A., Wahyuni, R., & Sugianto, R. A. Pest Detection on Oil-Palm Leaves Using the K-Nearest Neighbor Algorithm and Image Analysis. (2025, December). In Cendana International Conference on Social and Technology (pp. 117-123). https://doi.org/10.56473/cicost2025pp117-123

Siregar R. M., Kusuma W. A., and Annisa A., “Association of single nucleotide polymorphism and phenotype in type 2 of diabetes mellitus using Support Vector Regression and Genetic Algorithm,” ILKOM Jurnal Ilmiah, vol. 14, no. 3, pp. 194–202, Dec. 2022. https://doi.org/10.33096/ilkom.v14i3.1283.194-202

F. B. Mamahit and J. M. J. P. Santoso, “Physical and Psychological Recovery Facilities for Stroke Palliative Patients and Families in North Sulawesi,” J. Sci. Urban, Design, Architecture, vol. 6, no. 1, pp. 613–628, 2024, https://doi.org/10.24912/stupa.v6i1.27503

B. Satria, N. Afrianto, L. Ningsih, P. Sakinah, A. Sidauruk, and L. Mayola, “Comparative Analysis of Weighted-KNN, Random Forest, and Support Vector Machine Models for Beef and Pork Image Classification Using Machine Learning,” Int. J. Informatics Vis., vol. 9, no. 4, pp. 1677–1687, 2025, doi: http://dx.doi.org/10.62527/joiv.9.4.3736

M. Syukron, R. Santoso, and T. Widiharih, “Comparison of SMOTE Random Forest and SMOTE Xgboost Methods for Classifying Hepatitis C Disease Levels in Imbalanced Class Data,” J. Gaussian, vol. 9, no. 3, pp. 227–236, 2020, https://doi.org/10.14710/j.gauss.9.3.227-236

R. M. Siregar, B. Mulyara, R. Dian, M. Maisarah, M. A. S. Pane, and A. Prayogi, “Design of Control System and Temperature in Coffee Dryer Arduino Based Automatic Using Fuzzy,” JITK (Journal of Science and Technology vol. 10, no. 3, pp. 634–642, 2025, doi: https://doi.org/10.33480/jitk.v10i3.6166

N. Melnykova et al., “Machine learning for stroke prediction using imbalanced data,” Scientific Reports, vol. 15, no. 1, 2025. https://doi.org/10.1016/j.ijnss.2025.10.011

C. Kokkotis et al., “An explainable machine learning pipeline for stroke prediction on imbalanced data,” Diagnostics, vol. 12, no. 10, 2022. https://doi.org/10.3390/diagnostics12102392

E. C. Zabor, C. A. Reddy, R. D. Tendulkar, and S. Patil, “Logistic Regression in Clinical Studies,” Int. J. Radiat. Oncol. Biol. Phys., vol. 112, no. 2, pp. 271–277, 2022, 10.1016/j.ijrobp.2021.08.007

I. Lillo-Bravo, J. Vera-Medina, C. Fernandez-Peruchena, E. Perez-Aparicio, J. A. Lopez-Alvarez, and J. M. Delgado-Sanchez, “Random Forest model to predict solar water heating system performance,” Renew. Energy, vol. 216, no. April, p. 119086, 2023, https://doi.org/10.1016/j.renene.2023.119086

H. Blockeel, L. Devos, B. Frénay, G. Nanfack, and S. Nijssen, “Decision trees: from efficient prediction to responsible AI,” Front. Artif. Intell., vol. 6, 2023, https://doi.org/10.3389/frai.2023.1124553

and Y. C. Wang, Meng, Xinghua Yao, “An imbalanced-data processing algorithm for the prediction of heart attack in stroke patients.,” IEEE Access, vol. 9, pp. 25394-25404., 2021. https://doi.org/10.1109/ACCESS.2021.3056154

H. M. Mohebbi et al., “Stroke prediction using machine learning: A systematic review,” Computers in Biology and Medicine, vol. 143, 2022. https://doi.org/10.1016/j.compbiomed.2022.105343

A. Subudhi et al., “A deep learning approach for stroke prediction,” Biomedical Signal Processing and Control, vol. 68, 2021. https://doi.org/10.1016/j.bspc.2021.102688

X. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” Proceedings of the ACM SIGKDD, 2016. https://doi.org/10.1145/2939672.2939785

L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. https://doi.org/10.1023/A:1010933404324

N. V. Chawla et al., “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002. 10.1109/TKDE.2008.239

G. Douzas and F. Bacao, “Self-organizing map oversampling (SOMO) for imbalanced data,” Expert Systems with Applications, vol. 82, 2017. https://doi.org/10.1016/j.eswa.2017.03.050

H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, 2009. https://doi.org/10.1016/j.artmed.2021.102066

J. Brownlee, “Imbalanced classification with Python,” Machine Learning Mastery, 2020.

S. Buda, A. Maki, and M. A. Mazurowski, “A systematic study of class imbalance problem in CNNs,” Neural Networks, vol. 106, 2018. 10.1016/j.ijrobp.2021.08.007

T. Chen et al., “Applications of ensemble learning in healthcare prediction,” Artificial Intelligence in Medicine, vol. 115, 2021. https://doi.org/10.1016/j.patcog.2007.04.009

S. Lundberg and S. Lee, “A unified approach to interpreting model predictions,” Advances in Neural Information Processing Systems, 2017. https://doi.org/10.1016/j.eswa.2020.113276

J. Lemaitre et al., “Imbalanced-learn: A Python toolbox to tackle imbalanced datasets,” Journal of Machine Learning Research, vol. 18, 2017. https://doi.org/10.1016/j.apjon.2026.100923

Y. Sun et al., “Cost-sensitive learning for imbalanced classification,” Pattern Recognition, vol. 40, 2007. https://doi.org/10.1016/j.ejrh.2026.103197

M. S. Islam et al., “Performance evaluation of ML algorithms for stroke prediction,” IEEE Access, vol. 9, 2021. https://doi.org/10.1016/j.archger.2024.105641

R. Kaur et al., “Machine learning techniques for healthcare disease prediction,” Expert Systems with Applications, vol. 150, 2020. https://doi.org/10.1016/j.procs.2025.09.096

A. Johnson et al., “Explainable AI in healthcare,” Nature Medicine, vol. 27, 2021. https://doi.org/10.32604/cmes.2025.074627

M. A. Rahman et al., “Comparative study of ensemble models in medical diagnosis,” Computers in Biology and Medicine, vol. 130, 2021. https://doi.org/10.1016/j.compbiomed.2021.104217

W. Y. Lee et al., “SMOTE-based classification for medical diagnosis,” Applied Sciences, vol. 12, 2022. https://doi.org/10.3390/app12010234


Refbacks

  • There are currently no refbacks.


Copyright (c) 2026 Budy Satria, Ratu Mutiara Siregar, Liga Mayola, Silky Safira

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.