Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

NIST 2006 Machine Translation Evaluation Official Results

Date of Updated Release: Tuesday, November 1, 2006, version 4

The NIST 2006 Machine Translation Evaluation (MT-06) was part of an ongoing series of evaluations of human language translation technology. NIST conducts these evaluations in order to support machine translation (MT) research and help advance the state-of-the-art in machine translation technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official MT-06 evaluation plan.

Disclaimer

These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial MT products were generally from research systems, not commercially available products. Since MT-06 was an evaluation of research algorithms, the MT-06 test design required local implementation by each participant. As such, participants were only required to submit their translation system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.

There is ongoing discussion within the MT research community regarding the most informative metrics for machine translation. The design and implementation of these metrics are themselves very much part of the research. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.

The data, protocols, and metrics employed in this evaluation were chosen to support MT research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.

Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.

Evaluation Tasks

The MT-06 evaluation consisted of two tasks. Each task required a system to perform translation from a given source language into the target language. The source languages were Arabic and Chinese, and the target language was English.

  • Translate Arabic text into English text
  • Translate Chinese text into English text

Evaluation Conditions

MT research and development requires language data resources. System performance is strongly affected by the type and amount of resources used. Therefore, two different resource categories were defined as conditions of evaluation. The categories differed solely by the amount of data that was available for use in system training and development. The evaluation conditions were called "Large Data Track" and "Unlimited Data Track".

  • Large Data Track – limited the training data to data in the LDC public catalogue existing before February 1st, 2006.
  • Unlimited Data Track – extended the training data to any publicly available data existing before February 1st, 2006.
  • Unlimited Plus Data Track – further extended the training data to include non-publicly available data existing before February 1st, 2006. [see end of page]

Other submissions not in categories described above are not reported here.

Evaluation Data

Source Data

In an effort to reduce data creation costs, the MT-06 evaluation made use of GALE-06 evaluation data (GALE subset). NIST augmented the GALE subset with additional data of equal or greater size for most of the genres (NIST subset). This provided a larger and more diverse test set. Each set contained documents drawn from newswire text documents, web-based newsgroup documents, human transcription of broadcast news, and human transcription of broadcast conversations. The source documents were encoded in UTF-8.

The test data was selected from a pool of data collected by the LDC during February 2006. The careful selection process sought to have a variety of sources (see below), publication dates, and difficulty ratings while hitting the target test set size.

Genre
Arabic
Chinese
Sources
Target Size (num of reference words)
Sources
Target Size (num of reference words)
Newswire
Agence France Presse
Assabah
Xinhua News Agency
30K
Agence France Presse
Xinhua News Agency
30K
Newsgroup
Google's groups
Yahoo's groups
20K
Google's groups
20K
Broadcast News
Dubai TV
Al Jazeera
Lebanese Broadcast Corporation
20K
Central China TV
New Tang Dynasty TV
Phoenix TV
20K
Broadcast Conversation
Dubai TV
Al Jazeera
Lebanese Broadcast Corporation
10K
Central China TV
New Tang Dynasty TV
Phoenix TV
10K

Reference Data

The GALE subset had one adjudicated high quality translation that was produced by the National Virtual Translation Center. The NIST subset had four independently generated high quality translations that were produced by professional translation companies. In both subsets, each translation agency was required to have native speaker(s) of the source and target languages, working on the translations.

Performance Measurement

Machine translation quality was measured automatically using an N-gram co-occurrence statistic metric developed by IBM and referred to as BLEU. BLEU measures translation accuracy according to the N-grams or sequence of N-words that it shares with one or more high quality reference translations. Thus, the more co-occurrences the better the score. BLEU is an accuracy metric, ranging from "0" to "1" with "1" being the best possible score. A detailed description of BLEU can be found in the paper Papineni, Roukos, Ward, Zhu (2001). "Bleu: a Method for Automatic Evaluation of Machine Translation" (keyword = RC22176).

Although BLEU was the official metric for MT-06, measuring translation quality is an ongoing research topic in the MT community. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance. Three additional automatic metrics METEOR, TER, and BLEU-refinement as well as human assessment were used to report the system performance. As stated in the evaluation specification document, this official public version of the results will report only the scores as measured by BLEU.

Evaluation Participants

The table below lists the organizations involved in submitting MT-06 evaluation results. Most submitted results representing their own organizations, some participated only in a collaborative effort (marked by the @ symbol), and some did both (marked by the + symbol).

Site ID
Organization
Location
apptek
Applications Technology Inc.
USA
arl
Army Research Laboratory+
USA
auc
Egypt
bbn
BBN Technologies
USA
cu
Cambridge University@
UK
cmu
Carnegie Mellon University@
USA
casia
Institute of Automation Chinese Academy of Sciences
China
columbia
USA
dcu
Dublin City University
Ireland
google
Google
USA
hkust
China
ibm
IBM
USA
ict
Institute of Computing Technology Chinese Academy of Sciences
China
iscas
China
isi
Information Sciences Institute+
USA
itcirst
ITC-irst
Italy
jhu
Johns Hopkins University@
USA
ksu
Kansas State University
USA
kcsl
KCSL Inc.
Canada
lw
Language Weaver
USA
lcc
Language Computer
USA
lingua
Lingua Technologies Inc.
Canada
msr
Microsoft Research
USA
mit
MIT@
USA
nict
National Institute of Information and Communications Technology
Japan
nlmp
National Laboratory on Machine Perception Peking University
China
ntt
Japan
nrc
National Research Council Canada+
Canada
qmul
Queen Mary University of London
England
rwth
RWTH Aachen University+
Germany
sakhr
Sakhr Software Co.
USA
sri
SRI International
USA
ucb
University of California Berkeley
USA
edinburgh
University of Edinburgh+
Scotland
uka
University of Karlsruhe@
Germany
umd
University of Maryland@
USA
upenn
University of Pennsylvania
USA
upc
Universitat Politecnica de Catalunya
Spain
uw
University of Washington@
USA
xmu
Xiamen University
China
Site ID
Team/Collaboration
Location
arl-cmu
Army Research Laboratory & Carnegie Mellon University
USA
cmu-uka
Carnegie Mellon University & University of Karlsruhe
USA, Germany
edinburgh-mit
University of Edinburgh & MIT
Scotland, USA
isi-cu
Information Sciences Institute & Cambridge University
USA, England
rwth-sri-nrc-uw
RWTH Aachen University, SRI International, National Rearch Council Canada, University of Washington
Germany, USA, Canada, USA
umd-jhu
University of Maryland & Johns Hopkins University
USA
  • DFKI GmbH registered but dropped out of the evaluation on July 28, 2006.
  • Fitchburg State College registered but dropped out of the evaluation on August 3, 2006.

Evaluation Systems

Each site/team could submit one or more systems for evaluation with one system marked as its primary system. The primary system indicated the site/team's best effort. This official public version of the results report the results only for the primary systems.

Evaluation Results

The tables below list the results of the NIST 2006 Machine Translation Evaluation. The results are sorted by the BLEU scores and reported separately for the GALE subset and the NIST subset because they do not have the same number of reference translations. The results are also reported for each data domain. Note that these scores reflect case-errors.

Friedman's Rank Test for k Correlated Samples was used to test for significant difference among the systems. The initial null hypothesis was that all systems were the same. If the null hypothesis was rejected at the 95% level of confidence, the lowest scoring system was taken out of the pool of systems to be tested, and the Friedman's Rank Test was repeated for the remaining systems until no significant difference was found. The remaining systems that were not removed from the pool were deemed to be statistically equivalent. The process was repeated for the systems taken out of the pool. Alternating colors (white and yellow backgrounds) show the different groups.

Key:

  • An asterisk (*) indicates the submission was submitted late.
  • A pound sign (#) indicates the submission was a bug-fix. Bug-fix means that error(s) was found in the system during the testing period and sites fixed the error(s) and reran the test. Format errors do not count as bug-fixes.

Note: Site 'nlmp' was unable to process the entire test set. No result is listed for that site.

Arabic-to-English Results

Large Data Track

NIST Subset

Overall BLEU Scores

Site IDBLEU-4
google0.4281
ibm0.3954
isi0.3908
rwth0.3906
apptek*#0.3874
lw0.3741
bbn0.3690
ntt0.3680
itcirst0.3466
cmu-uka0.3369
umd-jhu0.3333
edinburgh*#0.3303
sakhr0.3296
nict0.2930
qmul0.2896
lcc0.2778
upc0.2741
columbia0.2465
ucb0.1978
auc0.1531
dcu0.0947
kcsl*#0.0522

Newswire BLEU Scores

Site IDBLEU-4
google0.4814
ibm0.4542
rwth0.4441
isi0.4426
lw0.4368
bbn0.4254
apptek*#0.4212
ntt0.4035
umd-jhu0.3997
edinburgh*#0.3945
cmu-uka0.3943
itcirst0.3798
qmul0.3737
sakhr0.3736
nict0.3568
lcc0.3089
upc0.3049
columbia0.2759
ucb0.2369
auc0.1750
dcu0.0875
kcsl*#0.0423

Newsgroup BLEU Scores

Site IDBLEU-4
apptek*#0.3311
google0.3225
ntt0.2973
isi0.2895
ibm0.2774
bbn0.2771
rwth0.2726
itcirst0.2696
sakhr0.2634
lw0.2503
cmu0.2436
edinburgh*#0.2208
lcc0.2135
columbia0.2111
umd-jhu0.2059
nict0.1875
upc0.1842
ucb0.1690
dcu0.1177
qmul0.1116
auc0.1099
kcsl*#0.0770

Broadcast News BLEU Scores

Site IDBLEU-4
google0.3781
apptek*#0.3729
lw0.3646
isi0.3630
ibm0.3612
rwth0.3511
ntt0.3324
bbn0.3302
umd-jhu0.3148
itcirst0.3128
edinburgh*#0.2925
cmu0.2874
sakhr0.2814
qmul0.2768
upc0.2463
nict0.2458
lcc0.2445
columbia0.2054
auc0.1419
ucb0.1114
dcu0.0594
kcsl*#0.0326

GALE Subset

Overall BLEU Scores

Site IDBLEU-4
apptek*#0.1918
google0.1826
isi0.1714
ibm0.1674
sakhr0.1648
rwth0.1639
lw0.1594
ntt0.1533
itcirst0.1475
bbn0.1461
cmu0.1392
umd-jhu0.1370
qmul0.1345
edinburgh*#0.1305
nict0.1192
upc0.1149
lcc0.1129
columbia0.0960
ucb0.0732
auc0.0635
dcu0.0320
kcsl*#0.0176

Newswire BLEU Scores

Site IDBLEU-4
google0.2647
ibm0.2432
isi0.2300
rwth0.2263
apptek*#0.2225
sakhr0.2196
lw0.2193
ntt0.2180
bbn0.2170
itcirst0.2104
umd-jhu0.2084
cmu0.2055
edinburgh*#0.2052
qmul0.1984
nict0.1773
lcc0.1648
upc0.1575
columbia0.1438
ucb0.1299
auc0.0937
dcu0.0466
kcsl*#0.0182

Newsgroup BLEU Scores

Site IDBLEU-4
apptek*#0.1747
sakhr0.1331
google0.1130
ibm0.1060
rwth0.1017
isi0.0918
ntt0.0906
lw0.0853
cmu0.0840
bbn0.0837
itcirst0.0821
qmul0.0818
umd-jhu0.0754
edinburgh*#0.0681
lcc0.0643
nict0.0639
columbia0.0634
upc0.0603
ucb0.0411
auc0.0326
dcu0.0254
kcsl*#0.0089

Broadcast News BLEU Scores

Site IDBLEU-4
apptek*#0.1944
isi0.1766
google0.1721
lw0.1649
rwth0.1599
ibm0.1588
sakhr0.1495
itcirst0.1471
ntt0.1469
bbn0.1391
cmu0.1362
umd-jhu0.1309
qmul0.1266
edinburgh*#0.1240
nict0.1152
upc0.1150
lcc0.1016
columbia0.0879
auc0.0619
ucb0.0412
dcu0.0252
kcsl*#0.0229

Broadcast Conversation BLEU Scores

Site IDBLEU-4
isi0.1756
apptek*#0.1747
google0.1745
rwth0.1615
lw0.1582
ibm0.1563
ntt0.1512
sakhr0.1446
itcirst0.1425
bbn0.1400
umd-jhu0.1277
qmul0.1265
cmu0.1261
edinburgh*#0.1203
upc0.1200
lcc0.1157
nict0.1156
columbia0.0866
ucb0.0783
auc0.0620
dcu0.0306
kcsl*#0.0183

Unlimited Data Track

NIST Subset

Overall BLEU Scores

Site IDBLEU-4
google0.4535
lw0.4008
rwth0.3970
rwth+sri+nrc+uw*0.3966
nrc0.3750
sri0.3743
edinburgh*#0.3449
cmu0.3376
arl-cmu0.1424

Newswire BLEU Scores

Site IDBLEU-4
google0.5034
lw0.4589
rwth+sri+nrc+uw*0.4493
rwth0.4458
nrc0.4300
sri0.4240
edinburgh*#0.4133
cmu0.3974
arl-cmu0.1402

Newsgroup BLEU Scores

Site IDBLEU-4
google0.3652
lw0.2851
rwth0.2829
nrc0.2799
rwth+sri+nrc+uw*0.2755
sri0.2534
cmu0.2372
edinburgh*#0.2287
arl-cmu0.1485

Broadcast News BLEU Scores

Site IDBLEU-4
google0.4018
lw0.3685
rwth0.3662
rwth+sri+nrc+uw*0.3639
sri0.3326
nrc0.3312
edinburgh*#0.3049
cmu0.2988
arl-cmu0.1363

GALE Subset

Overall BLEU Scores

Site IDBLEU-4
google0.1957
lw0.1721
rwth+sri+nrc+uw*0.1710
rwth0.1680
sri0.1614
nrc0.1517
cmu0.1382
edinburgh*#0.1365
arl-cmu0.0736

Newswire BLEU Scores

Site IDBLEU-4
google0.2812
lw0.2294
rwth+sri+nrc+uw*0.2289
rwth0.2258
nrc0.2172
sri0.2081
edinburgh*#0.2068
cmu0.2006
arl-cmu0.0858

Newsgroup BLEU Scores

Site IDBLEU-4
google0.1267
rwth0.1133
rwth+sri+nrc+uw*0.1078
lw0.1007
nrc0.1007
sri0.0953
cmu0.0894
edinburgh*#0.0722
arl-cmu0.0558

Broadcast News BLEU Scores

Site IDBLEU-4
google0.1868
rwth+sri+nrc+uw*0.1730
lw0.1715
sri0.1661
rwth0.1625
nrc0.1415
edinburgh*#0.1293
cmu0.1276
arl-cmu0.0855

Broadcast Conversation BLEU Scores

Site IDBLEU-4
google0.1824
lw0.1756
rwth+sri+nrc+uw*0.1676
sri0.1671
rwth0.1658
nrc0.1429
edinburgh*#0.1341
cmu0.1322
arl-cmu0.0584

Chinese-to-English Results

Large Data Track

NIST Subset

Overall BLEU Scores

Site IDBLEU-4
isi0.3393
google0.3316
lw0.3278
rwth0.3022
ict0.2913
edinburgh*#0.2830
bbn0.2781
nrc0.2762
itcirst0.2749
umd-jhu0.2704
ntt0.2595
nict0.2449
cmu0.2348
msr0.2314
qmul0.2276
hkust0.2080
upc0.2071
upenn0.1958
iscas0.1816
lcc0.1814
xmu0.1580
lingua*0.1341
kcsl*#0.0512
ksu0.0401

Newswire BLEU Scores

Site IDBLEU-4
isi0.3486
google0.3470
lw0.3404
ict0.3085
rwth0.3022
nrc0.2867
umd-jhu0.2863
edinburgh*#0.2776
bbn0.2774
itcirst0.2739
ntt0.2656
nict0.2509
cmu0.2496
msr0.2387
qmul0.2299
upenn0.2064
upc0.2057
hkust0.1999
lcc0.1721
iscas0.1715
xmu0.1619
lingua*0.1412
kcsl*#0.0510
ksu0.0380

Newsgroup BLEU Scores

Site IDBLEU-4
google0.2620
isi0.2571
lw0.2454
edinburgh*#0.2434
rwth0.2417
nrc0.2330
ict0.2325
bbn0.2275
itcirst0.2264
umd-jhu0.2061
ntt0.2036
nict0.2006
msr0.1878
cmu0.1865
hkust0.1851
qmul0.1840
iscas0.1681
upenn0.1665
lcc0.1634
upc0.1619
xmu0.1406
lingua*0.1207
kcsl*#0.0531
ksu0.0361

Broadcast News BLEU Scores

Site IDBLEU-4
rwth0.3501
google0.3481
isi0.3463
lw0.3327
bbn0.3197
edinburgh*#0.3172
itcirst0.3128
ict0.2977
ntt0.2928
umd-jhu0.2928
nrc0.2914
qmul0.2571
nict0.2568
msr0.2527
cmu0.2468
upc0.2403
hkust0.2376
iscas0.2090
lcc0.2046
upenn0.2008
xmu0.1652
lingua*0.1323
kcsl*#0.0475
ksu0.0464

GALE Subset

Overall BLEU Scores

Site IDBLEU-4
google0.1470
isi0.1413
lw0.1299
edinburgh*#0.1199
itcirst0.1194
nrc0.1194
rwth0.1187
ict0.1185
bbn0.1165
umd-jhu0.1140
cmu0.1135
ntt0.1116
nict0.1106
hkust0.0984
msr0.0972
qmul0.0943
upc0.0931
upenn0.0923
iscas0.0860
lcc0.0813
xmu0.0747
lingua*0.0663
ksu0.0218
kcsl*#0.0199

Newswire BLEU Scores

Site IDBLEU-4
google0.1905
isi0.1685
lw0.1596
ict0.1515
edinburgh*#0.1467
rwth0.1448
bbn0.1433
umd-jhu0.1419
nrc0.1404
itcirst0.1377
cmu0.1353
ntt0.1350
msr0.1280
hkust0.1161
nict0.1155
qmul0.1102
upenn0.1068
upc0.1039
iscas0.0947
lcc0.0878
xmu0.0861
lingua*0.0657
kcsl*#0.0178
ksu0.0138

Newsgroup BLEU Scores

Site IDBLEU-4
google0.1365
isi0.1235
edinburgh*#0.1140
lw0.1137
ict0.1130
itcirst0.1108
nrc0.1098
nict0.1075
rwth0.1071
cmu0.1054
bbn0.1049
ntt0.1026
umd-jhu0.0978
upenn0.0941
hkust0.0892
qmul0.0858
upc0.0851
msr0.0841
lcc0.0765
iscas0.0745
lingua*0.0687
xmu0.0681
ksu0.0249
kcsl*#0.0177

Broadcast News BLEU Scores

Site IDBLEU-4
isi0.1441
google0.1409
lw0.1343
rwth0.1231
itcirst0.1193
nrc0.1192
cmu0.1159
bbn0.1146
ict0.1146
edinburgh*#0.1110
ntt0.1096
nict0.1090
umd-jhu0.1084
hkust0.1005
upc0.0986
qmul0.0951
msr0.0922
iscas0.0891
upenn0.0882
lcc0.0814
xmu0.0705
lingua*0.0609
kcsl*#0.0204
ksu0.0192

Broadcast Conversation BLEU Scores

Site IDBLEU-4
isi0.1280
google0.1262
edinburgh*#0.1119
lw0.1112
itcirst0.1106
nict0.1106
umd-jhu0.1102
nrc0.1095
bbn0.1060
ntt0.1016
rwth0.1013
ict0.0990
cmu0.0973
hkust0.0891
msr0.0873
qmul0.0870
upc0.0848
iscas0.0842
upenn0.0815
lcc0.0796
xmu0.0753
lingua*0.0700
ksu0.0270
kcsl*#0.0223

Unlimited Data Track

NIST Subset

Overall BLEU Scores

Site IDBLEU-4
google0.3496
rwth0.2975
edinburgh*#0.2843
cmu0.2449
casia0.1894
xmu0.1713

Newswire BLEU Scores

Site IDBLEU-4
google0.3634
rwth0.2974
edinburgh*#0.2852
cmu0.2430
casia0.1905
xmu0.1696

Newsgroup BLEU Scores

Site IDBLEU-4
google0.2870
edinburgh*#0.2450
rwth0.2307
cmu0.2004
casia0.1709
xmu0.1618

Broadcast News BLEU Scores

Site IDBLEU-4
google0.3649
rwth0.3509
edinburgh*#0.3142
cmu0.2644
casia0.1889
xmu0.1818

GALE Subset

Overall BLEU Scores

Site IDBLEU-4
google0.1526
edinburgh*#0.1187
rwth0.1172
cmu0.1034
casia0.0900
xmu0.0793

Newswire BLEU Scores

Site IDBLEU-4
google0.2057
edinburgh*#0.1465
rwth0.1436
cmu0.1158
casia0.1001
xmu0.0817

Newsgroup BLEU Scores

Site IDBLEU-4
google0.1432
edinburgh*#0.1070
rwth0.1032
cmu0.1015
casia0.0916
xmu0.0782

Broadcast News BLEU Scores

Site IDBLEU-4
google0.1482
rwth0.1224
edinburgh*#0.1090
cmu0.1020
casia0.0891
xmu0.0775

Broadcast Conversation BLEU Scores

Site IDBLEU-4
google0.1206
edinburgh*#0.1157
rwth0.1010
cmu0.0957
casia0.0812
xmu0.0801


Unlimited Plus Data track

NIST data set

BLEU-4

Site IDLanguageOverallNewswireNewsgroupBroadcast News
googleArabic0.45690.50600.37270.4076
googleChinese0.36150.37250.29260.3859

 

GALE data set

BLEU-4

Site IDLanguageOverallNewswireNewsgroupBroadcast NewsBroadcast Conversation
googleArabic0.20240.28200.13590.19320.1925
googleChinese0.15760.20860.14540.15320.1300


Release History

  • Version 1: Initial release of preliminary results to evaluation participants
  • Version 2: Added late and bug fixed submissions, added METEOR, TER, and BLEU-refinement scores.
  • Version 3: Public version of the results (included only the BLEU scores for primary systems)
  • Version 4: Public version of the results, updated disclaimer
Created August 27, 2024, Updated August 30, 2024