Churn Prediction in Credit Customers Using Random Forest and XGBoost Methods

Bagas Akbar Maulana; Nurtriana Hidayati

doi:10.56705/ijodas.v6i1.215

Authors

Bagas Akbar Maulana Universitas Semarang
Nurtriana Hidayati Universitas Semarang

DOI:

https://doi.org/10.56705/ijodas.v6i1.215

Keywords:

Churn Prediction, Credit Card, XGBoost, Random Forest, SMOTE

Abstract

Introduction: Customer churn in the credit card industry presents a significant challenge for financial institutions, potentially resulting in substantial revenue loss. This study aims to develop predictive models for identifying credit card customers likely to churn, thereby enabling proactive retention strategies. Methods: A dataset of 5,000 credit card customer records was used, including 800 churn and 4,200 non-churn instances, reflecting a class imbalance addressed using the Synthetic Minority Over-sampling Technique (SMOTE). Two machine learning models—Random Forest and XGBoost—were implemented. Data pre-processing involved feature scaling, categorical encoding, and class balancing. Key predictive features included age, marital status, education level, transaction count, and total transaction value. Both models underwent hyperparameter tuning to optimize performance. Results: The Random Forest model achieved a baseline accuracy of 95%, improving to 96% after tuning, with an F1-score of 88% for the churn class. XGBoost demonstrated consistent accuracy of 96% before and after tuning but outperformed in minority class detection with an F1-score of 87%, precision of 86%, and recall of 89%. Analysis revealed that customers aged 40–55 were more likely to churn, influenced by behavioral and demographic factors. Conclusions: Both Random Forest and XGBoost models showed excellent performance in churn prediction. However, XGBoost proved more effective in identifying minority class instances, making it the preferred model for credit customer churn prediction. These findings support the integration of predictive analytics in customer retention strategies within the banking sector.

Downloads

Download data is not yet available.

References

M. Rizki Kurniawan, P. Nurul Sabrina, and R. Ilyas, “Prediksi Customer Churn Pada Perusahaan Telekomunikasi Menggunakan Algoritma C4.5 Berbasis Particle Swarm Optimization,” JATI (Jurnal Mhs. Tek. Inform., vol. 7, no. 5, pp. 3369–3375, Jan. 2024, doi: 10.36040/jati.v7i5.7476.

Miryam Clementine and Arum, “Prediksi Churn Nasabah Bank Menggunakan Klasifikasi NaÃ¯ve Bayes dan ID3,” J. Process., vol. 17, no. 1, pp. 9–18, May 2022, doi: 10.33998/processor.2022.17.1.1170.

S. Mahmuda, “Implementasi Metode Random Forest pada Kategori Konten Kanal Youtube,” J. JENDELA Mat., vol. 2, no. 01, pp. 21–31, Jan. 2024, doi: 10.57008/jjm.v2i01.633.

N. Maulidah, “Prediksi Peningkatan Jumlah Nasabah Deposito Berjangka Menggunakan Algoritma KNN, Decision Tree, Random Forest Dan Xgboost,” InComTech J. Telekomun. dan Komput., vol. 13, no. 2, p. 90, Aug. 2023, doi: 10.22441/incomtech.v13i2.16921.

S. E. Herni Yulianti, Oni Soesanto, and Yuana Sukmawaty, “Penerapan Metode Extreme Gradient Boosting (XGBOOST) pada Klasifikasi Nasabah Kartu Kredit,” J. Math. Theory Appl., pp. 21–26, Aug. 2022, doi: 10.31605/jomta.v4i1.1792.

G. L. Taboada and L. Han, “Exploratory Data Analysis and Data Envelopment Analysis of Urban Rail Transit,” Electronics, vol. 9, no. 8, p. 1270, Aug. 2020, doi: 10.3390/electronics9081270.

M. Nazeri, A. Rezai, and H. Azis, “An Efficient Architecture for Golay Code Encoder,” Proc. - 2nd East Indones. Conf. Comput. Inf. Technol. Internet Things Ind. EIConCIT 2018, pp. 114–117, 2018, doi: 10.1109/EIConCIT.2018.8878513.

S. A. Khowaja, “Depression Detection From Social Media Posts Using Emotion Aware Encoders and Fuzzy Based Contrastive Networks,” IEEE Trans. Fuzzy Syst., 2024, doi: 10.1109/TFUZZ.2024.3461776.

S. Horiguchi, Y. Fujita, S. Watanabe, and ..., “Encoder-decoder based attractors for end-to-end neural diarization,” … /ACM Trans. …, 2022, doi: 10.1109/TASLP.2022.3162080.

S. Balaji, “Enhancing Diabetic Retinopathy Image Classification using CNN, Resnet, and Googlenet Models with Z-Score Normalization and GLCM Feature Extraction,” Proceedings of the 2nd International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics, ICIITCEE 2024. 2024, doi: 10.1109/IITCEE59897.2024.10467709.

D. Qi, “Improving Unbalanced Security X-Ray Image Classification Using VGG16 and AlexNet with Z-Score Normalization and Augmentation,” Lecture Notes in Electrical Engineering, vol. 1182. pp. 205–217, 2024, doi: 10.1007/978-981-97-1463-6_14.

M. Sholeh, “Comparison of Z-score, min-max, and no normalization methods using support vector machine algorithm to predict student’s timely graduation,” AIP Conference Proceedings, vol. 3077, no. 1. 2024, doi: 10.1063/5.0202505.

D. Geem, “Progression of Pediatric Crohn’s Disease Is Associated With Anti–Tumor Necrosis Factor Timing and Body Mass Index Z-Score Normalization,” Clin. Gastroenterol. Hepatol., vol. 22, no. 2, pp. 368–376, 2024, doi: 10.1016/j.cgh.2023.08.042.

A. Ishaq et al., “Improving the Prediction of Heart Failure Patients’ Survival Using SMOTE and Effective Data Mining Techniques,” IEEE Access, vol. 9, pp. 39707–39716, 2021, doi: 10.1109/ACCESS.2021.3064084.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, Jun. 2002, doi: 10.1613/jair.953.

R. Blagus and L. Lusa, “SMOTE for high-dimensional class-imbalanced data,” BMC Bioinformatics, vol. 14, no. 1, p. 106, Dec. 2013, doi: 10.1186/1471-2105-14-106.

J. Sun, “Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting,” Inf. Fusion, vol. 54, pp. 128–144, 2020, doi: 10.1016/j.inffus.2019.07.006.

L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1007/978-3-030-62008-0_35.

O. S. Djandja, “Random forest-based modeling for insights on phosphorus content in hydrochar produced from hydrothermal carbonization of sewage sludge,” Energy, vol. 245, 2022, doi: 10.1016/j.energy.2022.123295.

Y. Zhao, “Classification of Zambian grasslands using random forest feature importance selection during the optimal phenological period,” Ecol. Indic., vol. 135, 2022, doi: 10.1016/j.ecolind.2021.108529.

Y. Xin, “Predicting depression among rural and urban disabled elderly in China using a random forest classifier,” BMC Psychiatry, vol. 22, no. 1, 2022, doi: 10.1186/s12888-022-03742-4.

R. Ghawi and J. Pfeffer, “Efficient Hyperparameter Tuning with Grid Search for Text Categorization using kNN Approach with BM25 Similarity,” Open Comput. Sci., vol. 9, no. 1, pp. 160–180, Jan. 2019, doi: 10.1515/comp-2019-0011.

M. Ahsan, M. Mahmud, P. Saha, K. Gupta, and Z. Siddique, “Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance,” Technologies, vol. 9, no. 3, p. 52, Jul. 2021, doi: 10.3390/technologies9030052.

A. Mahabub, M. I. Mahmud, and F. Hossain, “A robust system for message filtering using an ensemble machine learning supervised approach,” ICIC Express Lett. Part B Appl., vol. 10, no. 9, pp. 805–811, 2019, doi: 10.24507/icicelb.10.09.805.

S. Basheer et al., “Comparison of Land Use Land Cover Classifiers Using Different Satellite Imagery and Machine Learning Techniques,” Remote Sens., vol. 14, no. 19, p. 4978, Oct. 2022, doi: 10.3390/rs14194978.