RTE-4 GUIDELINES FOR PARTICIPANTS
The Recognizing Textual Entailment (RTE) challenge is an annual exercise that provides a framework for evaluation of textual entailment systems, and promotes international research in this area. In this evaluation exercise systems must recognize whether one piece of text entails another.
We define textual entailment as a directional relationship between two text fragments, which we term the Text (T) and the Hypothesis (H). We say that:
For example, given assumed common background knowledge of the business news domain and the following text:
- H1.2 Overture was acquired by Yahoo
- H1.3 Overture was bought
- H1.4 Yahoo is an internet company
2) the truth of H cannot be judged on the basis of T.
- H1.6 Yahoo sold Overture
- H1.8 Overture shareholders will receive $4.75 cash and 0.6108 Yahoo stock for each of their shares.
Textual entailment recognition is the task of deciding, given a T-H pair, whether T entails H.
The three-way RTE task is to decide whether:
When T entails H
Following are some guidelines for deciding whether a Text entails a Hypothesis:
When T does not entail H: CONTRADICTION VS UNKNOWN
Following are some guidelines for determining whether T contradicts H, or the truth of H is unknown based on T:
H2: Jennifer Hawkins is Australia's 20-year-old beauty queen.
T3: In that aircraft accident, four people were killed: the pilot, who was wearing civilian clothes, and three other people who were wearing military uniforms.
H3: Four people were assassinated by the pilot.
H5: However, the documents leaked to ITV News suggest that Menezes, an electrician, walked casually into the subway station and was wearing a light denim jacket.
H6: Shapour Bakhtiar died in 1989.
T7: Five people were killed in another suicide bomb blast at a police station in the northern city of Mosul.
H7: Five people were killed and 20 others wounded in a car bomb explosion outside an Iraqi police station south of Baghdad.
H8: A woman passionately wanted to watch the soccer championship.
H9: 100 or more people lost their lives in a ferry sinking.
In other circumstances, it may be most reasonable to regard the two passages as describing different events. For instance, example T-H7 above was not marked as a contradiction, as it does not seem compelling to regard "another suicide bomb blast" and "a car bomb explosion" as referring to the same event.
(Many other examples can be found at http://nlp.stanford.edu/RTE3-pilot/, where the links to three-way annotated datasets from previous campaigns are provided. The current guidelines for contradiction annotation are based on the guidelines by Marie-Catherine de Marneffe and Christopher Manning, used for the evaluation of the pilot task at RTE-3; for the original version see http://nlp.stanford.edu/RTE3-pilot/contradictions.pdf.)
TEST SET FORMAT
The dataset of Text-Hypothesis pairs is collected by human annotators and consists of four subsets which correspond to different application settings: Information Extraction (IE), Information Retrieval (IR), Question Answering (QA), and Multi-Document Summarization (SUM).
The dataset is formatted as an XML file, as follows:
There are two tasks in this year's RTE challenge:
Teams can participate in either or both tasks. No partial submissions are allowed, i.e. the submission must cover the whole dataset. Each team is allowed to submit up to 6 runs (up to 3 runs for each task). This allows teams who attempt both 3-way and 2-way classification to optimize/train separately for each task. Teams that participate in the 3-way task and do not have a separate strategy for the 2-way task (other than to automatically conflate CONTRADICTION and UNKNOWN to NO ENTAILMENT), should not submit separate runs for the 2-way task, because runs for the 3-way task will automatically be scored for both the 3-way task and the 2-way task.
Each run may optionally rank all the T-H pairs in the test set according to their entailment confidence (in decreasing order from the most certain entailment to the least certain). The more the system is confident that T entails H, the higher the ranking is. A perfect ranking would place all the pairs for which T entails H, before all the pairs for which T does not entail H. Because the evaluation measure for confidence ranking applies only to the 2-way classification task, in the case of three-way runs the pairs tagged as CONTRADICTION and UNKNOWN will be conflated and automatically re-tagged as NO ENTAILMENT for scoring purposes.
Runs will be submitted using a password-protected online submission form on the RTE web page. The link to the submission form will be posted at the same time that the test data set is released. Only teams who have registered for the TAC 2008 RTE track and who have submitted the required Agreement Concerning Dissemination of TAC Results may access the test data and submit runs.
At the time of submission, each team will be asked to fill out the form stating:
NB: Analyses of the test set (either manual or automatic) should not impact in any way the design and tuning of systems that publish their results on the RTE-4 test set. We regard it as acceptable to run automatic knowledge acquisition methods (such as synonym collection) specifically for the lexical and syntactic constructs that will be present in the test set, as long as the methodology and procedures are general and not tuned specifically for the test data. In any case, participants are asked to report about any process that was performed specifically for the test set.
RESULT SUBMISSION FORMAT
Results will be submitted as one file per run. Each submitted file must be a plain ASCII file with one line for each T-H pair in the test set, in the following format:
If the run includes confidence ranking, then the pairs in the file should be ordered by decreasing entailment confidence: the first pair should be the one for which the entailment is most certain, and the last pair should be the one for which the entailment is least likely. Thus, in a ranked run, all the pairs classified as ENTAILMENT are expected to appear before all the pairs that are classified as NO ENTAILMENT (for the two-way task) or CONTRADICTION or UNKNOWN (for the three-way task).
The evaluation of all submitted runs will be automatic. The judgments (classifications) returned by the system will be compared to those manually assigned by the human annotators (the Gold Standard). For the two-way task, a judgment of "NO ENTAILMENT" in a submitted run is considered to match either "CONTRADICTION" or "UNKNOWN" in the Gold Standard. The percentage of matching judgments will provide the accuracy of the run, i.e. the fraction of correct responses.
As a second measure, an Average Precision score will be computed for systems that provide as output a confidence-ranked list of all test examples. This measure evaluates the ability of systems to rank all the T-H pairs in the test set according to their entailment confidence (in decreasing order from the most certain entailment to the least certain). The more the system is confident that T entails H, the higher the ranking is. A perfect ranking would place all the pairs for which T entails H, before all the pairs for which T does not entail H. Average precision is a common evaluation measure for system rankings, and is computed as the average of the system's precision values at all points in the ranked list in which recall increases, that is at all points in the ranked list for which the gold standard annotation is ENTAILMENT. More formally, it can be written as follows:
Participating teams are requested to write a paper for the TAC 2008 proceedings that describes how the submitted runs were produced. For more details see the TAC 2008 guidelines for participants' papers.
NIST is an agency of the
U.S. Department of Commerce
Last updated: Monday, 18-Oct-2010 16:00:40 EDT
Comments to: firstname.lastname@example.org