Revised 15 May 2023
Accepted 22 May 2023
Available Online 6 June 2023
- DOI
- https://doi.org/10.55060/j.nlpre.230606.001
- Keywords
- Natural Language Processing
Text pre-processing
Swahili language
Stop words
Slang
Typos
Machine Learning - Abstract
Data pre-processing is an important step in machine learning text classification as it improves data quality and hence improves performance of trained algorithms. We experimentally compare the following pre-processing techniques: punctuation removal, lowercasing, typos replacement, slang replacement and stop-word removal on a Swahili short message service (SMS) dataset for topic classification. Different machine learning algorithms are applied such as Random Forest, Stochastic Gradient Descent, RNN LSTM Unidirectional, RNN LSTM Bidirectional and Support Vector Machine. We analyze the impact of the pre-processing techniques on classification accuracy and f1-score. Our experiments show that all pre-processing steps, when applied separately, have a positive impact on the performance of all evaluated classification algorithms. Among all experimented pre-processing steps, stop-word removal has the highest impact on performance of both accuracy and f1-score metrics. Also, of all evaluated algorithms, Random Forest and Stochastic Gradient Descent are the most positively affected with pre-processing steps.
- Highlights
-
The study aims to evaluate the performance of classification algorithms which are vital in automating error-prone manual work.
-
The study illustrates the importance and effects of different pre-processing steps for Swahili textual data.
-
This article will enable future researchers to decide which pre-processing steps for Swahili textual data are best for their respective machine learning tasks.
-
- Copyright
- © 2023 The Authors. Published by Athena International Publishing B.V.
- Open Access
- This is an open access article distributed under the CC BY-NC 4.0 license (https://creativecommons.org/licenses/by-nc/4.0/).
Cite This Article
TY - JOUR AU - Bernard Masua AU - Noel Masasi AU - Hellen Maziku AU - Betty Mbwilo PY - 2023 DA - 2023/06/06 TI - The Impact of Applying Different Pre-Processing Techniques on Swahili Textual Data Using Doc2Vec JO - Natural Language Processing Research SP - 1 EP - 13 VL - 3 IS - 1-2 SN - 2666-0512 UR - https://doi.org/10.55060/j.nlpre.230606.001 DO - https://doi.org/10.55060/j.nlpre.230606.001 ID - Masua2023 ER -