The Defense Advanced Research Projects (DARPA) Spoken Language Communication and Translation System for Tactical Use (TRANSTAC) program faced many challenges in applying automated measures of translation quality to Iraqi Arabic-English speech translation dialogues. Features of speech data in general and of Iraqi Arabic data in particular undermine basic assumptions of automated measures that depend on matching system outputs to reference translations. We show that scores for translation into Iraqi Arabic exhibit higher correlations with human judgments when they are computed from normalized system outputs and reference translations. Orthographic normalization, lexical normalization, and operations involving light stemming resulted in higher correlations with human judgments. Another challenge for use of automated metrics in the TRANSTAC program was the relatively small amount of test data available for evaluation. We present evidence that the datasets of 500-600 utterances for each language which we used to evaluate the systems are adequate for scoring and comparing among different systems.
Citation: Machine Translation
Pub Type: Journals
Arabic, machine translation, evaluation, automated metrics, speech translation