Metrics for Machine Translation Evaluation (MetricsMaTr)
NIST coordinates MetricsMaTr, a series of research challenge events for machine translation (MT) metrology, promoting the development of innovative, even revolutionary, MT metrics. MetricsMaTr focuses entirely on MT metrics.
NIST provides the evaluation infrastructure, the source files being MT system output. The participants develop MT metrics to assess the quality of the source files. The metrics are run on the test set at NIST in two tracks, one using a single reference, one using multiple references.
The goal is to create intuitively interpretable automatic metrics which correlate highly with human assessment of MT quality. Different types of human assessment are used.
There are several drawbacks to the current methods employed for the evaluation of MT technology:
These problems, and the need to overcome them through the development of improved automatic (or even semi-automatic) metrics, have been a constant point of discussion at past NIST MT evaluation events.
MetricsMaTr aims to provide a platform to address these shortcomings. Specifically, the goals of MetricsMaTr are:
The MetricsMaTr challenge is designed appeal to a wide and varied audience including researchers of MT technology and metrology, acquisition programs such as MFLTS, and commercial vendors. We welcome submissions from a wide range of disciplines including computer science, statistics, mathematics, linguistics, and psychology. NIST encourages submissions from participants not currently active in the field of MT.
The last NIST MetricsMaTr challenge was MetricsMaTr10. MetricsMaTr is held at regular intervals. Links to specific evaluation cycles are at the bottom of this page.
Summary of Results
The MetricsMaTr evaluation tests automatic metric scores for correlation with human assessments of machine translation quality for a variety of languages, data genres, and human assessments. This leads to a large amount of results. Archives of each year's complete release of results, including descriptions of the different components, are available for download:
Below, we provide a very high-level summary of these extensive results.
The table presents Spearman's rho correlations of automatic metric scores with human assessments on target language English data (stemming from NIST OpenMT, DARPA GALE, DARPA TRANSTAC test sets), limited to: