Date of Release: Tue, Jan. 10th, 2006
Version 2
The NIST 2005 Automatic Content Extraction Evaluation (ACE05) was part of an ongoing series of evaluations dedicated to the development of technologies that automatically infer meaning from language data. NIST conducts these evaluations in order to support information extraction research and help advance the state-of-the-art in information extraction technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities.
|
Disclaimer These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers were generally from research systems, not commercially available products. Since ACE05 was an evaluation of research algorithms, the ACE05 test design required local implementation by each participant. As such, participants were only required to submit their system output to NIST for scoring. The systems themselves were not evaluated. The data, protocols, and metrics employed in this evaluation were chosen to support Information Extraction research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems. For that reason, this should not be considered a product testing exercise. |
The ACE05 evaluation consisted of five main tasks, three mention level tasks, and three diagnostic tasks. These tasks require systems to process language data in documents and then to output, for each of these documents, information about the entities, values, temporal expressions, relations, and events mentioned or discussed in them. These tasks are evaluated separately for one or more of the ACE languages (Arabic, Chinese, and/or English).
View the official ACE evaluation specification document for a complete listing of the evaluation protocols and the list of language and task combinations evaluated (page 4, table 8).
There were three evaluation source sets, one for each language under test. The English source set contained approximately 50,000 words from the following domains:
View the official ACE evaluation specification document for a complete listing of the sources and time epochs of the evaluation data (page 6, table 10).
Both the Chinese and English evaluation test sets were fully annotated by the Linguistic Data Consortium. The Arabic evaluation set was annotated by an outside source and a shortage of funding resulted in less than 33% of the evaluation data being properly annotated. For this reason, Arabic results will not be posted since the lack of reference data reduces the statistical power of the results.
Each ACE task is a composite task involving detection and one or more of recognition, clustering, and normalization. Multiple attributes are considered important and individually measured with the overall performance being measured using a value formula that applies weights to each attribute.
The value score is defined to be the sum of all values of all of the system's output tokens, normalized by the sum of the values of the reference data. The possible value of a system output token depends on how closely it matches that of the reference token to which it is mapped. A value score can range from a negative score up to 100%.
View the appendices of the official ACE evaluation specification document for a complete description of the ACE scoring formulas.
The table below lists the sites that participated in one or more of the five main ACE tasks for the 2005 Automatic Content Extraction evaluation. The letters 'A', 'C', and 'E' are used to identify which language(s) were processed by each site's system:
|
Site
|
entities
|
relations
|
events
|
time
|
values
|
| University of Amsterdam | E | - | E | E | - |
| BBN Technologies | ACE | ACE | CE | - | CE |
| #Basis Technology, Inc. | ACE | - | - | - | - |
| University of Colorado | CE | CE | - | - | - |
| #Harbin Institute of Technology | CE | - | - | - | - |
| IBM | ACE | ACE | E | - | - |
| Janya Inc. | - | - | - | E | - |
| Language Computer Corporation | - | - | - | E | - |
| Lockheed Martin | E | - | E | E | E |
| New York University | C | - | - | - | - |
| Peking University | - | - | - | C | C |
| #Polytechnic University of Hong Kong | CE | - | - | C | - |
| SRA team #1 | E | E | - | - | - |
| SRA team #2 | E | E | - | - | - |
| #XIAMEN University | C | - | - | - | - |
# Sites that did not fulfill the requirement of attending the follow-up workshop.
The tables below list the official results of the NIST 2005 Automatic Content Extraction evaluation. Scores for each site's primary system are shown. Systems are ordered by their "overall" value score.
Table (1a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.
| Table 1a - Chinese - Entities | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
| IBM | 69.2 | 70.5 | 69.6 | 65.0 |
| BBN Technologies | 68.8 | 67.9 | 70.1 | 67.1 |
| New York University | 65.7 | 64.3 | 69.9 | 65.7 |
| University of Colorado | 61.1 | 64.9 | 57.4 | 63.1 |
| Polytechnic University of Hong Kong | 49.4 | 51.3 | 50.2 | 42.4 |
| XIAMEN University | 47.6 | 44.8 | 51.0 | 44.0 |
| Harbin Institute of Technology | 43.8 | 44.1 | 48.0 | 30.1 |
| Basis Technology, Inc. | 3.8 | 3.0 | 4.7 | 2.8 |
| Note: Using the Wilcoxon Signed Ranks test to compare system performance at the document level, no difference is system performance was found between IBM and BBN technologies, using a 5% p-value. | ||||
Table (1b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 1b - English - Entities | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations |
Broadcast News |
Newswire | Telephone | Usernet Newsgroups |
Weblogs |
| SRA team #1 | 71.9 | 72.7 | 77.1 | 72.8 | 62.9 | 61.5 | 67.6 |
| BBN Technologies | 71.7 | 71.8 | 75.4 | 72.2 | 67.7 | 59.7 | 71.6 |
| SRA team #2 | 71.3 | 67.2 | 77.3 | 73.1 | 59.3 | 60.7 | 69.0 |
| IBM | 69.6 | 61.7 | 76.2 | 72.0 | 57.8 | 60.5 | 66.3 |
| University of Colorado | 68.5 | 67.2 | 73.7 | 72.7 | 65.1 | 50.2 | 61.6 |
| Lockheed Martin | 57.4 | 58.7 | 63.3 | 60.0 | 48.5 | 41.7 | 50.6 |
| University of Amsterdam | 27.3 | 26.8 | 31.5 | 27.6 | 22.6 | 20.3 | 24.4 |
| Polytechnic University of Hong Kong | 20.8 | 20.3 | 31.7 | 25.1 | 6.6 | -10.8 | 13.1 |
| Harbin Institute of Technology | 15.2 | 9.1 | 26.9 | 15.3 | 10.8 | -12.9 | 14.7 |
| Basis Technology, Inc. | 4.0 | 1.4 | 9.6 | 3.6 | -3.8 | -3.3 | 2.2 |
| Note: Using the Wilcoxon Signed Ranks test to compare system performance at the document level, no difference is system performance was found between SRA team #1, BBN Technologies, SRA team #2, and IBM, using a 5% p-value. | |||||||
###### ######
The ACE "Relation Detection and Recognition" task requires systems to identify the occurrences of a specified set of relations {Artifacts, Gen-Affiliation, Metonymy, Org-Affiliation, Part-Whole, Person-Social, Physical} in the source language documents. Complete descriptions of ACE relations and relation attributes that are to be detected can be found in the ACE evaluation specification document .
Table (2a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.
| Table 2a - Chinese - Relations | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
| IBM | 26.8 | 24.4 | 28.6 | 26.6 |
| BBN Technologies | 22.7 | 20.6 | 24.6 | 21.8 |
| University of Colorado | 21.0 | 22.4 | 20.4 | 19.4 |
| Note: Using the Wilcoxon Signed Ranks test to compare system performance at the document level, a difference in IBM's system performance was found to be significant at the 5% p-value. | ||||
Table (2b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 2b - English - Relations | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations |
Broadcast News |
Newswire | Telephone | Usernet Newsgroups |
Weblogs |
| SRA team #2 | 25.2 | 12.3 | 32.6 | 27.3 | 16.7 | 16.9 | 20.5 |
| BBN Technologies | 25.1 | 16.0 | 28.3 | 26.2 | 35.6 | 19.6 | 17.0 |
| IBM | 23.8 | 8.8 | 33.1 | 25.1 | 19.1 | 13.9 | 15.9 |
| SRA team #1 | 23.4 | 11.8 | 29.4 | 24.7 | 21.3 | 17.3 | 18.2 |
| University of Colorado | 20.1 | 8.7 | 26.1 | 22.7 | 20.1 | 0.6 | 17.8 |
| Note: Using the Wilcoxon Signed Ranks test to compare system performance at the document level, no difference is system performance was found between SRA team #2, BBN Technologies, and IBM, using a 5% p-value. | |||||||
###### ######
The ACE "Event Detection and Recognition" task requires systems to identify the occurrences of a specified set of events {Life, Movement, Transaction, Business, Conflict, Contact, Personnel, Justice} in the source language documents. Complete descriptions of ACE events and event attributes that are to be detected can be found in the ACE evaluation specification document .
Table (3a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.
| Table 3a - Chinese - Events | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
| BBN Technologies | 10.2 | 11.2 | 10.8 | 2.0 |
Table (3b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 3b - English - Events | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations |
Broadcast News |
Newswire | Telephone | Usernet Newsgroups |
Weblogs |
| BBN Technologies | 14.4 | 6.2 | 12.3 | 17.8 | 15.2 | 13.1 | 15.4 |
| IBM | 6.7 | -3.5 | 8.3 | 10.4 | -5.5 | 3.5 | 3.2 |
| Lockheed Martin | 3.5 | 4.0 | 4.4 | 5.6 | -2.4 | -3.9 | 1.0 |
| University of Amsterdam | -8.6 | -13.4 | -8.4 | -6.9 | -11.9 | -8.8 | -10.4 |
| Note: Using the Wilcoxon Signed Ranks test to compare system performance at the document level, a difference in BBN's system performance was found to be significant at the 5% p-value. | |||||||
###### ######
The ACE "Value detection" task requires systems to identify the occurrences of a specified set of values {Contact-Info, Numeric, and when part of an event: Crime, Job-title, Sentence} in the source language documents. Complete descriptions of ACE values and value attributes that are to be detected can be found in the ACE evaluation specification document .
Table (4a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.
| Table 4a - Chinese - Values | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
| Peking University | 49.7 | 48.7 | 42.1 | 71.2 |
| BBN Technologies | 45.7 | 47.4 | 38.3 | 63.8 |
Table (4b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 4b - English - Values | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations |
Broadcast News |
Newswire | Telephone | Usernet Newsgroups |
Weblogs |
| BBN Technologies | 34.8 | 26.6 | 12.6 | 43.4 | 25.0 | 48.7 | 38.5 |
| Lockheed Martin | 25.5 | 30.0 | 0.8 | 26.4 | 25.0 | 49.7 | 32.6 |
###### ######
The ACE "TERN" task requires systems to identify the occurrences of a specified set of temporal expressions and specific attributes about the expressions {Value, Modifier, Anchor value, Anchor directionality, Set} and for the English data to normalize the expressions. Complete descriptions of ACE TERN and TERN attributes that are to be detected can be found in the ACE evaluation specification document .
Table (5a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.
| Table 5a - Chinese - Temporal Expressions | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
| Polytechnic University of Hong Kong | 83.7 | 81.8 | 84.3 | 86.2 |
| Peking University | 79.0 | 75.0 | 82.9 | 78.2 |
Table (5b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 5b - English - Temporal Expressions | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations |
Broadcast News |
Newswire | Telephone | Usernet Newsgroups |
Weblogs |
| Language Computer Corp. | 63.7 | 48.0 | 65.6 | 72.6 | 56.2 | 63.4 | 61.7 |
| Lockheed Martin | 56.2 | 39.8 | 62.6 | 53.1 | 58.6 | 55.8 | 62.8 |
| Janya Inc. | 54.8 | 40.6 | 59.8 | 62.7 | 37.3 | 52.0 | 53.6 |
| University of Amsterdam | 33.2 | 23.8 | 32.4 | 39.4 | 32.1 | 24.8 | 42.1 |
| Section for the ACE05 Mention Level Tasks |
The mention level tasks are designed to measure a system's ability to correctly identify all mentions of the ACE entities, relations, and events. The same set of types and attributes listed above apply to the mention level tasks.
The table below lists the sites that participated in one or more of the three ACE mention level tasks for this year's automatic content extraction evaluation. The letters 'A', 'C', and 'E' are used to identify which language(s) were processed for each task:
|
Site
|
entity
mentions |
relation
mentions |
event
mentions |
| BBN Technologies | ACE | ACE | - |
| #Basis Technology, Inc. | ACE | - | - |
| #Chinese Academy of Sciences | C | - | - |
| University of Colorado | CE | CE | - |
| #Harbin Institute of Technology | CE | - | - |
| Lockheed Martin | E | - | E |
| New York University | C | - | - |
| Peking University | C | - | - |
| #Polytechnic University of Hong Kong | CE | C | - |
| SRA team #1 | E | - | - |
| SRA team #2 | E | - | - |
| University of Amsterdam | E | - | E |
| #XIAMEN University | C | - | - |
# Sites that did not fulfill the requirement of attending the follow-up workshop.
The tables below list the official results of the NIST 2005 Automatic Content Extraction Evaluation. Scores for each site's primary system are shown. Systems are ordered by their "overall" value score.
Table (6a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.
| Table 6a - Chinese - Entity Mentions | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
| New York University | 79.1 | 78.4 | 78.3 | 82.9 |
| BBN Technologies | 78.8 | 79.2 | 78.9 | 77.9 |
| University of Colorado | 73.0 | 79.5 | 68.1 | 71.9 |
| Harbin Institute of Technology | 62.8 | 64.2 | 65.7 | 51.8 |
| Peking University | 62.2 | 61.2 | 62.9 | 62.3 |
| XIAMEN University | 61.3 | 58.5 | 64.1 | 60.3 |
| Chinese Academy of Sciences | 51.4 | 50.8 | 50.3 | 55.8 |
| Basis Technology, Inc. | 46.6 | 44.9 | 48.5 | 45.2 |
| Polytechnic University of Hong Kong | 43.4 | 45.6 | 42.3 | 41.4 |
Table (6b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 6b - English - Entity Mentions | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations |
Broadcast News |
Newswire | Telephone | Usernet Newsgroups |
Weblogs |
| BBN Technologies | 85.1 | 86.8 | 84.2 | 84.6 | 93.7 | 74.0 | 85.0 |
| SRA team #2 | 84.7 | 83.3 | 84.5 | 84.8 | 93.8 | 75.8 | 82.4 |
| SRA team #1 | 83.7 | 84.0 | 84.4 | 84.0 | 89.9 | 74.6 | 81.5 |
| University of Colorado | 82.9 | 85.1 | 82.4 | 85.2 | 91.3 | 66.3 | 79.7 |
| Lockheed Martin | 68.3 | 69.1 | 73.6 | 73.2 | 59.5 | 55.6 | 66.2 |
| University of Amsterdam | 45.6 | 44.3 | 38.8 | 42.2 | 74.1 | 36.3 | 41.7 |
| Basis Technology, Inc. | 30.5 | 30.4 | 31.7 | 41.9 | 1.2 | 29.8 | 37.7 |
| Polytechnic University of Hong Kong | 29.4 | 27.6 | 27.9 | 23.3 | 56.8 | 12.8 | 28.8 |
| Harbin Institute of Technology | 19.1 | 17.5 | 29.9 | 22.2 | 11.8 | -8.5 | 24.0 |
###### ######
Table (7a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.
| Table 7a - Chinese - Relation Mentions | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
| BBN Technologies | 31.9 | 29.5 | 33.7 | 32.2 |
| University of Colorado | 30.2 | 31.7 | 28.3 | 33.2 |
| Peking University | 15.0 | 14.6 | 16.5 | 9.9 |
Table (7b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 7b - English - Relation Mentions | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations |
Broadcast News |
Newswire | Telephone | Usernet Newsgroups |
Weblogs |
| BBN Technologies | 36.8 | 27.3 | 37.1 | 39.1 | 44.4 | 30.3 | 38.8 |
| University of Colorado | 34.6 | 26.7 | 38.1 | 36.1 | 41.3 | 20.6 | 32.9 |
###### ######
| Table 8a - Chinese - Event Mentions | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
Table (8b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 8b - English - Event Mentions | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations |
Broadcast News |
Newswire | Telephone | Usernet Newsgroups |
Weblogs |
| University of Amsterdam | 7.3 | 13.9 | 4.4 | 7.0 | 17.9 | 6.2 | 8.9 |
| Lockheed Martin | 2.8 | 0.0 | 4.6 | 4.3 | -17.4 | .8 | 0.2 |
| Section for the ACE05 Diagnostic Tasks |
The diagnostic tasks are offered to assist the researchers. They are designed to measure a system's ability for various ACE tasks when systems are given partial ground truth information (commonly referred to as cheating-experiments).
The table below lists the sites that participated in one or more of the three ACE diagnostic tasks for this year's automatic content extraction evaluation. The letters 'C', and 'E' are used to identify which language(s) were processed for each task:
|
Site
|
entities
given mentions |
relations
given entities |
events
given entities |
| University of Amsterdam | - | - | E |
| BBN Technologies | CE | CE | CE |
| University of Colorado | CE | CE | - |
| Language Computer Corporation | E | - | - |
| New York University | E | - | E |
The tables below list the official results of the NIST 2005 Automatic Content Extraction Evaluation. Scores for each site's primary system are shown. Systems are ordered by their "overall" value score.
Table (9a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.
| Table 9a - Chinese - Entities Given Mentions | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
| University of Colorado | 90.4 | 90.9 | 90.1 | 90.4 |
| BBN Technologies | 89.9 | 90.8 | 89.4 | 89.4 |
Table (9b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 9b - English - Entities Given Mentions | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations |
Broadcast News |
Newswire | Telephone | Usernet Newsgroups |
Weblogs |
| BBN Technologies | 88.9 | 86.4 | 90.4 | 89.1 | 82.5 | 89.4 | 89.5 |
| University of Colorado | 88.0 | 85.3 | 90.4 | 89.4 | 80.1 | 86.0 | 86.0 |
| New York University | 87.0 | 84.1 | 88.7 | 88.4 | 75.2 | 84.8 | 88.3 |
| Language Computer Corporation | 83.1 | 78.1 | 84.2 | 84.0 | 76.7 | 83.6 | 84.7 |
###### ######
Table (10a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.
| Table 10a - Chinese - Relations Given Entities, Values, and TIMEX2s | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
| University of Colorado | 56.0 | 52.6 | 58.0 | 57.8 |
| BBN Technologies | 53.1 | 50.6 | 54.0 | 57.0 |
Table (10b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 10b - English - Relations Given Entities, Values, and TIMEX2s | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations |
Broadcast News |
Newswire | Telephone | Usernet Newsgroups |
Weblogs |
| BBN Technologies | 54.4 | 47.4 | 58.1 | 56.2 | 50.0 | 50.9 | 49.7 |
| University of Colorado | 50.8 | 44.4 | 55.6 | 50.0 | 57.5 | 42.4 | 46.0 |
###### ######
Table (11a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.
| Table 11a - Chinese - Events Given Entities, Values, and TIMEX2s | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
| BBN Technologies | 25.0 | 30.7 | 23.7 | 9.1 |
Table (11b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 11b - English - Events Given Entities, Values, and TIMEX2s | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations |
Broadcast News |
Newswire | Telephone | Usernet Newsgroups |
Weblogs |
| BBN Technologies | 32.7 | 37.0 | 28.2 | 34.9 | 28.6 | 32.4 | 35.4 |
| New York University | 29.7 | 34.2 | 26.3 | 32.4 | 31.4 | 24.3 | 30.4 |
| University of Amsterdam | 19.7 | 20.6 | 18.9 | 21.0 | 12.0 | 13.4 | 23.8 |