NIST coordinates MetricsMaTr, a series of research challenge events for machine translation (MT) metrology, promoting the development of innovative, even revolutionary, MT metrics. MetricsMaTr focuses entirely on MT metrics.
NIST provides the evaluation infrastructure, the source files being MT system output. The participants develop MT metrics to assess the quality of the source files. The metrics are run on the test set at NIST in two tracks, one using a single reference, one using multiple references.
The goal is to create intuitively interpretable automatic metrics which correlate highly with human assessment of MT quality. Different types of human assessment are used.
There are several drawbacks to the current methods employed for the evaluation of MT technology:
These problems, and the need to overcome them through the development of improved automatic (or even semi-automatic) metrics, have been a constant point of discussion at past NIST MT evaluation events.
MetricsMaTr aims to provide a platform to address these shortcomings. Specifically, the goals of MetricsMaTr are:
The MetricsMaTr challenge is designed appeal to a wide and varied audience including researchers of MT technology and metrology, acquisition programs such as MFLTS, and commercial vendors. We welcome submissions from a wide range of disciplines including computer science, statistics, mathematics, linguistics, and psychology. NIST encourages submissions from participants not currently active in the field of MT.
The most recent MetricsMaTr challenge was MetricsMaTr10.
The MetricsMaTr evaluation tests automatic metric scores for correlation with human assessments of machine translation quality for a variety of languages, data genres, and human assessments. This leads to a large amount of results. Below, we provide a very high-level summary of these extensive results.
The table presents Spearman's rho correlations of automatic metric scores with human assessments on target language English data (stemming from NIST OpenMT, DARPA GALE, DARPA TRANSTAC test sets), limited to:
Evaluation | 1 reference translation | 4 reference translations | ||||
---|---|---|---|---|---|---|
Segment level | Document level | System level | Segment level | Document level | System level | |
MetricsMaTr10 | SVM_rank rho=0.69 | METEOR-next-rank rho=0.84 | METEOR-next-rank rho=0.92 | SVM_rank rho=0.74 | i_letter_BLEU rho=0.85 | SEPIA rho=0.93 |
MetricsMaTr08 | TERp rho=0.68 | METEOR-v0.7 rho=0.84 | CDer rho=0.9 | SVM_RANK rho=0.72 | CDer rho=0.85 | ATEC3 rho=0.93 |
Baseline | METEOR-v0.6 rho=0.68 | NIST rho=0.81 | TER-v0.7.25 rho=0.89 | METEOR-v0.6 rho=0.72 | NIST rho=0.84 | NIST rho=0.93 |
mt_poc [at] nist.gov (mt_poc[at]nist[dot]gov)