Optimization of Text Emotion Classification through the Combination of ITC Smoothed and Linear Models


Melki Garonga(1*); Mc Rore Rangga Punne(2); Irene Devi Damayanti(3);

(1) Univesitas Kristen Indonesia Toraja
(2) Univesitas Kristen Indonesia Toraja
(3) Univesitas Kristen Indonesia Toraja
(*) Corresponding Author

  

Abstract


This research investigates four feature extraction techniques TF-IDF, Smoothed TF-IDF, Inverse Term Counting (ITC), and ITC Smoothed to determine how effectively they enhance text-based emotion classification when working with imbalanced datasets. The study also seeks to pinpoint the most effective pairing between feature extraction methods and classification algorithms. Its key contributions include a methodical side-by-side comparison of these lesser-examined TF-IDF variations and demonstrating empirically that linear models handle class imbalances with considerable resilience. The analysis drew upon an Indonesian Twitter dataset comprising 4,132 tweets, categorized into six unequally distributed emotional states: anger, fear, joy, love, sadness, and neutrality. These four feature extraction approaches were assessed using five distinct classifiers: Naive Bayes, Logistic Regression, SVM, Random Forest, and KNN. Performance was measured through accuracy, precision, recall, and F1-score. Findings indicate that linear classifiers, specifically Logistic Regression and SVM, delivered superior performance, achieving accuracy rates between 93.71% and 94.44%. These models consistently outperformed both probabilistic and distance-based algorithms regardless of the feature extraction method applied. Interestingly, the impact of smoothing proved context-dependent. While applying smoothing to both TF-IDF and ITC boosted the performance of linear models over their unsmoothed counterparts, it paradoxically reduced accuracy for the standard ITC method. This outcome questions the widely held belief that smoothing universally enhances model performance. The combination of Logistic Regression with the unITC Smoothed method yielded the peak accuracy of 94.44%. The study offers actionable guidance, suggesting the pairing of Logistic Regression with ITC as a highly effective strategy for text-based emotion classification. It also contributes theoretically by underscoring the particular aptitude of linear models for managing high-dimensional text data within imbalanced class contexts

Keywords


Emotion Classification; Feature Extraction; Natural Language Processing; Text Mining; TF-IDF Variants.

  
  

Full Text:

PDF
  

Article Metrics

Abstract view: 172 times
PDF view: 57 times
     

Digital Object Identifier

doi  https://doi.org/10.33096/ilkom.v18i1.2954.1-16
  

Cite

References


L. P. Hung and S. Alias, “Beyond Sentiment Analysis : A Review of Recent Trends in Text Based Sentiment Analysis and Emotion Detection,” J. Adv. Comput. Intell. Intell. Informatics, vol. 27, no. 1, 2023, doi: 10.20965/jaciii.2023.p0084.

J. T. Black and M. Z. Shakir, “Emotion on the edge : An evaluation of feature representations and machine learning models,” Nat. Lang. Process. J., vol. 10, no. January, p. 100127, 2025, doi: 10.1016/j.nlp.2025.100127.

S. Suswadi and M. Erkamim, “Sentiment Analysis of Shopee App Reviews Using Random Forest and Support Vector Machine,” Ilk. J. Ilm., vol. 15, no. 3, pp. 427–435, 2023, doi: 10.33096/ilkom.v15i3.1610.427-435.

L. Xiang, “Application of an Improved TF-IDF Method in Literary Text Classification,” Adv. Multimed., vol. 2022, 2022, doi: 10.1155/2022/9285324.

Riccosan, K. E. Saputra, G. D. Pratama, and A. Chowanda, “Emotion dataset from Indonesian public opinion,” Data Br., vol. 43, no. July, 2022, doi: 10.1016/j.dib.2022.108465.

N. S. Mohd Nafis and S. Awang, “An Enhanced Hybrid Feature Selection Technique Using Term Frequency-Inverse Document Frequency and Support Vector Machine-Recursive Feature Elimination for Sentiment Classification,” IEEE Access, vol. 9, pp. 52177–52192, 2021, doi: 10.1109/ACCESS.2021.3069001.

M. R. Punne, Indrabayu, and I. Nurtanio, “Mood classification from song lyrics using the Naive Bayes Algorithm, Support Vector Machine (SVM) and XGBoost,” Proc. 2024 IEEE Int. Conf. Ind. 4.0, Artif. Intell. Commun. Technol. IAICT 2024, pp. 162–167, 2024, doi: 10.1109/IAICT62357.2024.10617452.

N. Umaira, C. Mohd, and N. A. Shafie, “Performance of TF-IDF for Text Classification Reviews on Google Play Store : Shopee,” ournal Comput. Res. Innov., vol. 9, no. 2, 2024, doi: 10.24191/jcrinn.v9i2.410.

Y. Setiawan, D. Gunawan, and R. Efendi, “Feature Extraction TF-IDF to Perform Cyberbullying Text Classification: A Literature Review and Future Research Direction,” 2022 Int. Conf. Inf. Technol. Syst. Innov. ICITSI 2022 - Proc., pp. 283–288, 2022, doi: 10.1109/ICITSI56531.2022.9970942.

S. Chanda and S. Pal, “The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media,” SN Comput. Sci., vol. 4, no. 5, 2023, doi: 10.1007/s42979-023-01942-7.

A. S. Rizki, N. M. Aristi, N. Ridha, A. F. Zulfahri, and D. A. Wibowo, “Implementation of The Indonesian Language Stemming Algorithm in Twitter Data Preprocessing. Case Study: Twitter Wargabanua and Instakalsel,” Fidel. J. Tek. Elektro, vol. 5, no. 3, pp. 175–183, 2023, doi: 10.52005/fidelity.v5i3.170.

R. Friedman, “Tokenization in the Theory of Knowledge,” Encyclopedia, vol. 3, no. 1, pp. 380–386, 2023, doi: 10.3390/encyclopedia3010024.

J. T. Pintas, L. A. F. Fernandes, and A. C. B. Garcia, “Feature selection methods for text classification: a systematic literature review,” Artif. Intell. Rev., vol. 54, no. 8, pp. 6149–6200, 2021, doi: 10.1007/s10462-021-09970-6.

A. B. Nassif, A. Elnagar, I. Shahin, and S. Henno, “Deep learning for Arabic subjective sentiment analysis: Challenges and research opportunities,” Appl. Soft Comput., vol. 98, 2021, doi: 10.1016/j.asoc.2020.106836.

S. I. Manzoor, J. Singla, and Nikita, “Fake news detection using machine learning approaches: A systematic review,” Proc. Int. Conf. Trends Electron. Informatics, ICOEI 2019, pp. 230–234, 2019, doi: 10.1109/ICOEI.2019.8862770.

A. A. Shujaaddeen, F. Mutaher Ba-Alwi, A. T. Zahary, and A. Sultan Alhegami, “A Model for Measuring the Effect of Splitting Data Method on the Efficiency of Machine Learning Models: A Comparative Study,” 4th Int. Conf. Emerg. Smart Technol. Appl. eSmarTA 2024, pp. 269–277, 2024, doi: 10.1109/eSmarTA62850.2024.10639022.

D. M. Abdullah and A. M. Abdulazeez, “Machine Learning Applications based on SVM Classification: A Review,” Qubahan Acad. J., vol. 1, no. 2, pp. 81–90, 2021, doi: 10.48161/qaj.v1n2a50.

P. J. B. Pajila, B. G. Sheena, A. Gayathri, J. Aswini, M. Nalini, and R. Siva Subramanian, “A Comprehensive Survey on Naive Bayes Algorithm: Advantages, Limitations and Applications,” Proc. 4th Int. Conf. Smart Electron. Commun. ICOSEC 2023, pp. 1228–1234, 2023, doi: 10.1109/ICOSEC58147.2023.10276274.

M. Sindhuja, K. S. Nitin, and K. S. Devi, “Twitter Sentiment Analysis using Enhanced TF-DIF Naive Bayes Classifier Approach,” Proc. - 7th Int. Conf. Comput. Methodol. Commun. ICCMC 2023, pp. 547–551, 2023, doi: 10.1109/ICCMC56507.2023.10084106.

J. C. Tesoro, “A Semantic Approach of the Naïve Bayes Classification Algorithm,” Int. J. Adv. Trends Comput. Sci. Eng., vol. 9, no. 3, pp. 3287–3294, 2020, doi: 10.30534/ijatcse/2020/125932020.

M. Özbay Karakuş and O. Er, “A comparative study on prediction of survival event of heart failure patients using machine learning algorithms,” Neural Comput. Appl., vol. 34, no. 16, pp. 13895–13908, 2022, doi: 10.1007/s00521-022-07201-9.

A. Zaidi and A. S. M. Al Luhayb, “Two Statistical Approaches to Justify the Use of the Logistic Function in Binary Logistic Regression,” Math. Probl. Eng., vol. 2023, no. 1, 2023, doi: 10.1155/2023/5525675.

D. Ogaga and A. Olalere, “Evaluation and Comparison of SVM, Deep Learning, and Naïve Bayes Performances for Natural Language Processing Text Classification Task,” no. November, 2023, doi: 10.20944/preprints202311.1462.v1.

P. Saigal and V. Khanna, “Multi-category news classification using Support Vector Machine based classifiers,” SN Appl. Sci., vol. 2, no. 3, 2020, doi: 10.1007/s42452-020-2266-6.

A. Yaqoob et al., “SGA-Driven feature selection and random forest classification for enhanced breast cancer diagnosis : A comparative study,” Sci. Rep., pp. 1–23, 2025, doi: 10.1038/s41598-025-95786-1.

R. Rajoju, V. Sathvika, G. N. S. Smaran, C. Tejashwini, and G. A. Reddy, “Text Phishing Detection System using Random Forest Algorithm,” Proc. 3rd Int. Conf. Appl. Artif. Intell. Comput. ICAAIC 2024, pp. 1332–1339, 2024, doi: 10.1109/ICAAIC60222.2024.10575110.

N. Jalal, A. Mehmood, G. S. Choi, and I. Ashraf, “A novel improved random forest for text classification using feature ranking and optimal number of trees,” J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 6, pp. 2733–2742, 2022, doi: 10.1016/j.jksuci.2022.03.012.

M. Suyal and P. Goyal, “A Review on Analysis of K-Nearest Neighbor Classification Machine Learning Algorithms based on Supervised Learning,” Int. J. Eng. Trends Technol., vol. 70, no. 7, pp. 43–48, 2022, doi: 10.14445/22315381/IJETT-V70I7P205.

N. Kalcheva, M. Todorova, and I. Penev, “Study of the K-Nearest Neighbors Method with Various Features for Text Classification in Machine Learning,” Int. Conf. Autom. Informatics, ICAI 2023 - Proc., pp. 37–40, 2023, doi: 10.1109/ICAI58806.2023.10339061.

C. Chai, J. Wang, Y. Luo, Z. Niu, and G. Li, “Data Management for Machine Learning : A Survey,” IEEE Trans. Knowl. Data Eng., vol. 4347, no. 2, 2022, doi: 10.1109/TKDE.2022.3148237.


Refbacks

  • There are currently no refbacks.


Copyright (c) 2026 Melki Garonga, Mc Rore Rangga Punne, Irene Devi Damayanti

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.