The NIST 2007 Automatic Content Extraction Evaluation (ACE07) was part of an ongoing series of evaluations dedicated to the development of technologies that automatically infer meaning from language data. NIST conducts these evaluations in order to support information extraction research and help advance the state-of-the-art in information extraction technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities.
|
Disclaimer These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers were generally from research systems, not commercially available products. Since ACE07 was an evaluation of research algorithms, the ACE07 test design required local implementation by each participant. As such, participants were only required to submit their system output to NIST for scoring. The systems themselves were not evaluated. The data, protocols, and metrics employed in this evaluation were chosen to support Information Extraction research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems. For that reason, this should not be considered a product testing exercise. |
The ACE07 evaluation consisted of five main tasks, three mention level tasks, and three diagnostic tasks (not reviewed here). These tasks required systems to process language data in documents and then to output, for each of these documents, information about the entities, values, temporal expressions, relations and events mentioned or discussed in them. These tasks were evaluated separately for each of the ACE languages (Arabic, Chinese, English, and Spanish).
View the official ACE evaluation specification document for a complete listing of the evaluation protocols and the list of language and task combinations evaluated (page 5, table 10).ACE-07 reused the entire evaluation test set from ACE-05. In addition, new data was included that came from the REFLEX corpus (a three-way translated/annotated data set between English, Arabic, and Chinese sources). This added data originated from the ACE-05 data so 10k words of Arabic were translated into both Chinese and English, annotated for ACE entities and temporal expressions and used for the evaluation.
Source Data
There were four evaluation source sets, one for each language under test. The English sources contained approximately 70,000 words from the following domains:
Both the Arabic and Chinese source sets contained approximately 70,000 words (1.5 Chinese characters = 1 word) from the following domains:
All of the Spanish source data was taken from Newswire data and the test set size was approximately 50,000 words.
Reference Data
All of the ACE test data were fully annotated by the Linguistic Data Consortium.
Each ACE task is a composite task involving detection and one or more of recognition, clustering, and normalization. Multiple attributes are considered important and individually measured with the overall performance being measured using a value formula that applies weights to each attribute.
The value score is defined to be the sum of all values of all of the system's output tokens, normalized by the sum of the values of the reference data. The possible value of a system output token depends on how closely it matches that of the reference token to which it is mapped. A value score can range from a negative score up to 100%.
View the appendices of the official ACE evaluation specification document for a complete description of the ACE scoring formulas.
The table below lists the sites that registered and processed the evaluation test data for one or more of the five main ACE tasks for the 2007 ACE evaluation. The letters 'A', 'C', 'E', and 'S' are used to identify which language(s) were processed at each site.
| Site | entities | relations | events | time | values | emd | rmd | vmd |
| BBN Technologies | AE | E | E | . | . | AE | E | E |
| # Chinese Academy of Science - Institute of Automation |
. | . | . | C | . | C | . | . |
| # Chinese Academy of Science - Institute of Software |
C | C | . | . | . | C | . | . |
| Fudan University | CE | . | . | . | . | CE | . | . |
| IBM | AES | . | . | E | . | AES | . | . |
| Language Computer Corporation | CE | C | . | . | . | ACE | CE | . |
| Lockheed Martin | CE | . | . | CE | . | CE | . | . |
| * Macquarie University | . | . | . | E | . | . | . | . |
| # Northeastern University of China | C | . | . | . | . | C | . | . |
| # Polytechnic University of Hong Kong | C | . | . | C | . | C | . | . |
| SAIC | E | . | . | . | . | E | . | . |
| SUNY - University of Albany | . | C | . | . | . | . | . | . |
| Technical University of Catalonia | . | . | . | . | . | E | E | . |
| University of Amsterdam | . | . | . | E | . | . | . | . |
| Universidad Carlos III de Madrid | . | . | . | S | . | . | . | . |
| # XIAMEN University | C | . | . | C | . | . | . | . |
# Sites with incomplete participation -
failed to attending the evaluation workshop. Value scores from these systems are not included here.
* Site with excused absence from evaluation
workshop - medical.
The tables below list the official results of the NIST 2007 Automatic Content Extraction evaluation. Scores for each site's primary system are shown and are ordered by their "overall" value score. In some cases, after preliminary ACE scores were released to the participants bug-fixed systems were submitted. Scores for these revised systems are shown at the bottom of each chart.
Results for ACE Entities (EDR task):There were no participants in the Value task for any language.
Results for ACE Entity Mention (EMD task):The ACE "Entity Detection and Recognition" task requires systems to identify the occurrences of a specified set of entities {Persons, Organizations, Locations, Facilities, Geo-Politicals, Weapons, Vehicles} in the source language documents. Complete descriptions of ACE entities and entity attributes that are to be detected can be found in the ACE evaluation specification document .
Table (1a) list the overall value score for the Arabic evaluation test set, and breaks out the value score for each of the three domains.
| Table 1a - Arabic - Entities | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
| BBN Technologies | 48.8 | 51.9 | 49.4 | 42.1 |
| IBM | 45.4 | 49.4 | 46.6 | 34.6 |
| Note: The Wilcoxon Signed Ranks test comparing system performance at the document level finds that the difference in performance between these two systems is statistically significant at the 95% confidence level. This test was run over ALL documents. | ||||
Table (1b) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.
| Table 1b - Chinese - Entities | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
| Language Computer Corporation | 45.0 | 49.7 | 46.9 | 35.0 |
| Fudan University | 28.8 | 35.6 | 30.2 | 18.4 |
| Lockheed Martin | 26.9 | 30.3 | 26.1 | 25.7 |
| Note: The Wilcoxon Signed Ranks test comparing system performance at the document level finds that the difference in performance between each of these systems is statistically significant at the 95% confidence level. This test was run over ALL documents. | ||||
Table (1c) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 1c - English - Entities | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations | Broadcast News | Newswire | Telephone | Usenet | Weblogs |
| BBN Technologies | 56.3 | 44.7 | 65.4 | 58.1 | 49.2 | 39.2 | 52.7 |
| IBM | 52.7 | 48.7 | 65.9 | 52.8 | 45.4 | 44.0 | 45.8 |
| Lockheed Martin | 46.1 | 50.5 | 50.0 | 46.8 | 39.5 | 39.7 | 42.1 |
| Fudan University | 24.2 | 21.0 | 34.7 | 22.9 | 34.9 | 14.6 | 20.7 |
| #Language Computer Corporation | 20.1 | 25.2 | 47.6 | 13.0 | 8.7 | 19.7 | 16.9 |
| # SAIC | . | . | . | . | . | . | . |
| Note: The Wilcoxon Signed Ranks
test comparing system performance at the document level finds that
all differences in system performance are statistically
significant at the 95% confidence level, except when
comparing Fudan University to Language Computer Corporation, in
which case no significant difference is found. This test was run over ALL documents
The Wilcoxon Signed Ranks test finds that Language Computer Corporation's revised submission lies between the Fudan University submission and the Lockheed Martin submission, in terms of statistical significance. | |||||||
| Corrected Submission | |||||||
| #Language Computer Corporation - revised | 35.8 | 25.2 | 47.6 | 39.3 | 8.7 | 19.7 | 27.3 |
# SAIC was a first time participant. Positive value was not achieved.
# Bug in system caused original submission to have invalid byte offsets resulting in low score. A script corrected these offsets resulting in the revised submission..
Table (1d) list the overall value score for the Spanish evaluation test set (only the newswire domain was used).
| Table 1d - Spanish - Entities | |
|---|---|
| Site | Overall (newswire) |
| IBM | 51.0 |
The ACE "Relation Detection and Recognition" task requires systems to identify the occurrences of a specified set of relations {Artifacts, GEN-Affiliation, Metonymy, Org-Affiliation, Part-Whole, Person-Social, Physical} in the source language documents. Complete descriptions of ACE relations and relation attributes that were to be detected can be found in the ACE evaluation specification document .
Table (2a) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 2a - English - Relations | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations | Broadcast News | Newswire | Telephone | Usenet | Weblogs |
| BBN Technologies | 21.6 | 11.0 | 24.7 | 21.2 | 32.4 | 19.6 | 18.2 |
Table (2b) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.
| Table 2b - Chinese - Relations | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
| Language Computer Corporation | 17.6 | 16.8 | 18.7 | 15.7 |
| #SUNY - University of Albany | . | . | . | . |
#SUNY - University of Albany was a first time participant. Positive value was not achieved.
The ACE "Event Detection and Recognition" task requires systems to identify the occurrences of a specified set of events {Life, Movement, Transaction, Business, Conflict, Contact, Personnel, Justice} in the source language documents. Complete descriptions of ACE events and event attributes that are to be detected can be found in the ACE evaluation specification document .
Table (3a) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 3a - English - Events | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations | Broadcast News | Newswire | Telephone | Usenet | Weblogs |
| BBN Technologies | 13.4 | 7.4 | 12.9 | 15.9 | 6.6 | 11.3 | 15.0 |
The ACE "TERN" task requires systems to identify the occurrences of a specified set of temporal expressions and specific attributes about the expressions {Value, Modifier, Anchor value, Anchor directionality, Set} and to normalize the expressions. Complete descriptions of ACE TERN and TERN attributes that were to be detected can be found in the ACE evaluation specification document .
Table (4a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.
| Table 4a - Chinese - TERN | |||
|---|---|---|---|
| Site | Overall | Newswire | Weblogs |
| Lockheed Martin | 4.0 | 3.4 | 5.1 |
| Corrected Submission | |||
| Lockheed Martin (revised) | 14.8 | 9.9 | 24.8 |
Table (4b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 4b - English - TERN | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations | Broadcast News | Newswire | Telephone | Usenet | Weblogs |
| Lockheed Martin | 61.6 | 44.2 | 68.4 | 67.4 | 52.6 | 63.1 | 51.4 |
| IBM | 59.3 | 48.2 | 68.6 | 60.9 | 60.2 | 58.2 | 52.9 |
| University of Amsterdam | 45.0 | 46.6 | 67.8 | 32.2 | 64.2 | 54.1 | 44.4 |
| Macquarie University | 24.2 | 20.9 | 43.4 | 21.9 | 38.7 | 26.6 | 11.0 |
Note: The Wilcoxon Signed Ranks test to compare system performance at the document level finds no difference is system performance between Lockheed Martin and IBM, but all other comparisons found the differences to be statistically significant at the 95% confidence level. This test was run over ALL documents. The Wilcoxon Signed Ranks test finds that the University of Amsterdam's revised submission is no different than IBM's original submission, in terms of statistical significance. The University of Amsterdam's revised submission was found to be significantly different from Macquarie University's revised submission. | |||||||
| Corrected Submissions | |||||||
| # University of Amsterdam (revised) | 58.2 | 46.6 | 67.8 | 57.3 | 64.2 | 59.0 | 54.8 |
| # Macquarie University (revised) | 48.3 | 30.0 | 44.4 | 54.2 | 38.7 | 55.9 | 44.8 |
Table (4c) list the overall value score for the Spanish evaluation test set (only the newswire domain was used).
| Table 4c - Spanish - TERN | |
|---|---|
| Site | Overall (newswire) |
| Universidad Carlos III de Madrid | 46.5 |
The mention level tasks are designed to measure a system's ability to correctly identify all mentions of the ACE entities, relations, and events. The same set of types and attributes listed above apply to the mention level tasks.
The tables below list the official results of the NIST 2007 Automatic Content Extraction Evaluation. Scores for each site's primary system are shown. Systems are ordered by their "overall" value score.
Table (5a) list the overall value score for the Arabic evaluation test set, and breaks out the value score for each of the three domains.
| Table 5a - Arabic - Entity Mentions | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
| BBN Technologies | 73.4 | 78.7 | 73.2 | 64.1 |
| IBM | 71.9 | 78.6 | 72.5 | 57.1 |
| Language Computer Corporation | 67.3 | 76.1 | 66.2 | 54.4 |
| Note: The Wilcoxon Signed Ranks test, comparing system performance at the document level, finds that each of the system differences are statistically significant at the 95% confidence level. This test was run over ALL documents. | ||||
Table (5b) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.
| Table 5b - Chinese - Entity Mentions | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
| Language Computer Corporation | 76.7 | 80.7 | 76.8 | 72.0 |
| Fudan University | 59.4 | 67.2 | 59.7 | 50.5 |
| Lockheed Martin | 42.0 | 48.3 | 40.8 | 38.1 |
| Note: The Wilcoxon Signed Ranks test, comparing system performance at the document level, finds that each of the system differences are statistically significant at the 95% confidence level. This test was run over ALL documents. | ||||
Table (5c) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 5c - English - Entity Mentions | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations | Broadcast News | Newswire | Telephone | Usenet | Weblogs |
| IBM | 82.9 | 87.0 | 85.4 | 82.8 | 93.8 | 72.5 | 77.3 |
| BBN Technologies | 81.2 | 83.6 | 82.5 | 80.6 | 92.7 | 70.9 | 78.5 |
| Technical University of Catalonia | 75.0 | 80.2 | 78.7 | 73.8 | 93.1 | 60.2 | 69.2 |
| Lockheed Martin | 67.3 | 71.3 | 69.7 | 65.8 | 88.2 | 55.0 | 61.7 |
| #Language Computer Corporation | 64.4 | 83.2 | 83.3 | 51.1 | 93.7 | 71.3 | 62.4 |
| Fudan University | 42.3 | 37.6 | 49.9 | 42.2 | 50.3 | 28.0 | 39.5 |
| # SAIC | . | . | . | . | . | . | . |
Note: The Wilcoxon Signed Ranks test, comparing system performance at the document level, finds that most of the differences in system performance are found to be statistically significant at the 95% confidence level. An exception exists with LCC. The LCC system is found to better than the Lockheed Martin system and equivalent to the Technical University of Catalonia system. This test was run over ALL documents. The Wilcoxon Signed Ranks test finds that Language Computer Corporation's revised submission is no different than BBN's original submission, in terms of statistical significance. Comparisons with each of the other systems does find a difference. | |||||||
| Corrected Submission | |||||||
| #Language Computer Corporation - revised | 80.9 | 83.2 | 83.3 | 80.4 | 93.7 | 71.3 | 76.6 |
# SAIC was a first time participant. Positive value was not achieved.
# Bug in system caused original submission to have invalid byte offsets resulting in low score. A script corrected these offsets resulting in the revised submission.
Table (5d) list the overall value score for the Spanish evaluation test set (only the newswire domain was used).
| Table 5d - Spanish - Entity Mentions | |
|---|---|
| Site | Overall (newswire) |
| IBM | 78.7 |
Table (6a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.
| Table 6a - Chinese - Relation Mentions | ||||
|---|---|---|---|---|
| Site | Overall | Broadcast News | Newswire | Weblogs |
| Language Computer Corporation | 29.7 | 28.8 | 31.0 | 27.3 |
Table (6b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 6b - English - Relation Mentions | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations | Broadcast News | Newswire | Telephone | Usenet | Weblogs |
| BBN Technologies | 33.4 | 24.7 | 34.0 | 33.7 | 42.6 | 31.7 | 34.8 |
| Technical University of Catalonia | 33.1 | 24.1 | 38.2 | 33.2 | 43.6 | 20.5 | 27.8 |
| #Language Computer Corporation | 32.5 | 29.3 | 35.3 | 34.6 | 38.4 | 21.1 | 23.0 |
Note: The Wilcoxon Signed Ranks test, comparing system performance at the document level, does not find the differences in system performance to be statistically significant. This test was run for ALL documents. # The Wilcoxon Signed Ranks test results did not change when using Language Computer Corporation's revised submission. | |||||||
| Corrected Submission | |||||||
| # Language Computer Corporation - revised | 32.5 | 25.5 | 42.3 | 41.0 | 41.2 | 54.5 | 4.6 |
# Bug in system caused original submission to have invalid byte offsets resulting in low score. A script corrected these offsets resulting in the revised submission.
Table (7a) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.
| Table 7a - English - Event Mentions | |||||||
|---|---|---|---|---|---|---|---|
| Site | Overall | Broadcast Conversations | Broadcast News | Newswire | Telephone | Usenet | Weblogs |
| BBN Technologies | 24.1 | 18.2 | 25.7 | 25.4 | 22.1 | 19.1 | 24.8 |