Publication
Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification
Muhammad Nabeel Asim; Muhammad Usman Ghani; Muhammad Ali Ibrahim; Waqar Mahmood; Andreas Dengel; Sheraz Ahmed
In: Neural Computing and Applications, Vol. 33, Pages 5437-5469, Springer, 9/2020.
Abstract
In order to provide benchmark performance for Urdu text document classification, the contribution of this paper is manifold.
First, it provides a publicly available benchmark dataset manually tagged against 6 classes. Second, it investigates the
performance impact of traditional machine learning-based Urdu text document classification methodologies by embedding 10
filter-based feature selection algorithms which have been widely used for other languages. Third, for the very first time, it
assesses the performance of various deep learning-based methodologies for Urdu text document classification. In this regard,
for experimentation, we adapt 10 deep learning classification methodologies which have produced best performance figures for
English text classification. Fourth, it also investigates the performance impact of transfer learning by utilizing
Bidirectional Encoder Representations from Transformers approach for Urdu language. Fifth, it evaluates the integrity of a
hybrid approach which combines traditional machine learning-based feature engineering and deep learning-based automated
feature engineering. Experimental results show that feature selection approach named as normalized difference measure along
with support vector machine outshines state-of-the-art performance on two closed source benchmark datasets CLE Urdu
Digest 1000k, and CLE Urdu Digest 1Million with a significant margin of 32% and 13%, respectively. Across all three
datasets, normalized difference measure outperforms other filter-based feature selection algorithms as it significantly uplifts
the performance of all adopted machine learning, deep learning, and hybrid approaches.