← Back to Presentations
Online

A Dual-Task Large Language Model for Adding Diacritics and Translating Jordanian Arabic to Modern Standard Arabic

Slide Cover

Conference: 2025 Asian Conference on Communication and Networks

Start Time: 2025-12-29 17:00:00

Duration: 15min

Session: Track 5: Emerging Trends of AI/ML » Track 5: Emerging Trends of AI/ML

Room: Engineering Hall - 123

View Slides No Video

Abstract

The Arabic language presents unique challenges for natural language processing due to its complex grammar, diverse dialects, and frequent omission of diacritics. This paper proposes a unified token-free model based on ByT5 that simultaneously performs spelling correction (including Jordanian dialect-to-Modern Standard Arabic (MSA) translation) and diacritization. Our approach uses task-specific prefixes (“correct:” for correction and “diacritize:” for combined correction and diacritization) to enable flexible multi-task learning. The model was fine-tuned on the JODA dataset (Jordanian dialect/MSA pairs) and high-quality Tashkeela subsets (Clean-50 and Clean-400), with synthetic errors injection to enhance robustness. Automatic evaluation showed an overall evaluation score of 78.06% on JODA and 92.45% on the combined test set of JODA and Tashkeela. Manual evaluation of 200 JODA samples revealed a character error rate of 4.41% and diacritic error rate of 1.32%, demonstrating practical efficacy in handling Arabic’s complexities.

Speakers

Rabie Otoum
RAN Optimization and
University of Jordan

Details

Type
Online
Model
OFFLINE
Language
EN
Timezone
UTC+8
Views
120
Likes
23