Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Metrics for Machine Translation Evaluation

Metrics for Machine Translation Evaluation (MetricsMaTr)

NIST coordinates MetricsMaTr, a series of research challenge events for machine translation (MT) metrology, promoting the development of innovative, even revolutionary, MT metrics. MetricsMaTr focuses entirely on MT metrics.

NIST provides the evaluation infrastructure, the source files being MT system output. The participants develop MT metrics to assess the quality of the source files. The metrics are run on the test set at NIST in two tracks, one using a single reference, one using multiple references.

The goal is to create intuitively interpretable automatic metrics which correlate highly with human assessment of MT quality. Different types of human assessment are used.

There are several drawbacks to the current methods employed for the evaluation of MT technology:

  • Automatic metrics have not yet been proved able to predict the usefulness and reliability of MT technologies with respect to real applications with confidence.
  • Automatic metrics have not demonstrated that they are meaningful in target languages other than English.
  • Human assessments are expensive, slow, subjective, and difficult to standardize.

These problems, and the need to overcome them through the development of improved automatic (or even semi-automatic) metrics, have been a constant point of discussion at past NIST MT evaluation events.

MetricsMaTr aims to provide a platform to address these shortcomings. Specifically, the goals of MetricsMaTr are:

  • To inform other MT technology evaluation campaigns and conferences with regard to improved metrology.
  • To establish an infrastructure that encourages the development of innovative metrics.
  • To build a diverse community which will bring new perspectives to MT metrology research.
  • To provide a forum for MT metrology discussion and for establishing future directions of MT metrology.

The MetricsMaTr challenge is designed appeal to a wide and varied audience including researchers of MT technology and metrology, acquisition programs such as MFLTS, and commercial vendors. We welcome submissions from a wide range of disciplines including computer science, statistics, mathematics, linguistics, and psychology. NIST encourages submissions from participants not currently active in the field of MT.

The most recent MetricsMaTr challenge was MetricsMaTr10.

Summary of Results

The MetricsMaTr evaluation tests automatic metric scores for correlation with human assessments of machine translation quality for a variety of languages, data genres, and human assessments. This leads to a large amount of results. Below, we provide a very high-level summary of these extensive results.

The table presents Spearman's rho correlations of automatic metric scores with human assessments on target language English data (stemming from NIST OpenMT, DARPA GALE, DARPA TRANSTAC test sets), limited to:

  • The highest-correlating new metric for each evaluation cycle
  • The highest-correlating baseline metric (out a suite of metrics available to NIST prior to MetricsMaTr08)
  • Correlation with human assessments of semantic adequacy on a 7-point scale
Highest correlation of automatic metrics with human assessments of semantic adequacy
Evaluation1 reference translation4 reference translations
Segment levelDocument levelSystem levelSegment levelDocument levelSystem level
MetricsMaTr10SVM_rank
rho=0.69
METEOR-next-rank
rho=0.84
METEOR-next-rank
rho=0.92
SVM_rank
rho=0.74
i_letter_BLEU
rho=0.85
SEPIA
rho=0.93
MetricsMaTr08TERp
rho=0.68
METEOR-v0.7
rho=0.84
CDer
rho=0.9
SVM_RANK
rho=0.72
CDer
rho=0.85
ATEC3
rho=0.93
BaselineMETEOR-v0.6
rho=0.68
NIST
rho=0.81
TER-v0.7.25
rho=0.89
METEOR-v0.6
rho=0.72
NIST
rho=0.84
NIST
rho=0.93

Contact

mt_poc [at] nist.gov (mt_poc[at]nist[dot]gov)

Created January 20, 2011, Updated April 3, 2024