Automatic term extraction for Arabic text: Approaches, techniques, and challenges

Muhammed, Mariam; Azab, Shahira; Ali, Nesrine; Gheith, Mervat

doi:10.21608/jaiep.2025.423298.1025

Automatic term extraction for Arabic text: Approaches, techniques, and challenges

Document Type : Original Article

Authors

¹ Department of Information Systems and Technology, Faculty of Graduate Studies for Statistical Research (FGSSR), Cairo University, Giza 12613, Egypt

² Department of Computer Science, Faculty of Graduate Studies for Statistical Research (FGSSR), Cairo University, Giza 12613, Egypt.

³ Department of Computer Science, Faculty of Graduate Studies for Statistical Research (FGSSR), Cairo University, Giza 12613, Egypt

10.21608/jaiep.2025.423298.1025

Abstract

Automatic Term Extraction (ATE) is an essential task in Natural Language Processing (NLP) that aims to identify domain-specific terms from large corpora. In the context of Arabic, ATE plays an essential role in applications such as ontology construction, dictionary development, information retrieval, and text mining. However, the rich morphological structure, and orthographic ambiguities of Arabic present unique challenges in the process of ATE. This paper provides a comprehensive survey of ATE for Arabic text, with a focus on approaches, techniques, and evaluation strategies. We review rule-based, statistical, machine learning, deep learning, and hybrid methods, reviewing their strengths, limitations, and applicability to Arabic’s linguistic characteristics. We also review challenges that affect the ATE process such as morphological richness, multiword expression extraction, named entity recognition, and the scarcity of annotated corpora. Furthermore, we outline evaluation metrics that are essential for assessing performance in Arabic. This paper aims to support the development of more accurate, adaptable, and domain specific ATE systems for Arabic texts.

Keywords