2008 TRECVid Event Detection Evaluation Plan

 

 

1.    Overview

This document presents the evaluation plan for event detection in surveillance video for TRECVid 2008. The goal of the evaluation will be to build and evaluate systems that can detect instances of a variety of observable events in the airport surveillance domain.  The video source data to be used is a ~100-hour corpus of video surveillance data collected by the UK Home Office at the London Gatwick International Airport.

 

Two event detection tasks will be supported: a retrospective event detection task run with complete reference annotations, and a “freestyle” experimental analysis track to permit participants to explore their own ideas with regard to the airport surveillance domain. 

 

Because this is an initial effort, the evaluation will be run as more of an experimental test-bed. By doing so, we propose two changes to the typical evaluation paradigm.  First, the entire source video corpus will be released early so that research can begin immediately.  Participants will be on the honor system to keep the evaluation set blind.  Second, two sets of events will be defined: a required set defined by NIST and the LDC, whose descriptions and annotations will be released quickly for research to begin, and an optional, secondary set of events nominated by participants. The development resources (event definitions and annotations) for nominated events will be released later in the year.  These steps will hopefully encourage an acceleration of the research and knowledge sharing and will permit faster evolution of the evaluation paradigm.

 

The following topics are discussed below:

·         Video source data

·         Evaluation tasks

·         Evaluation measures

·         Evaluation Infrastructure

·         Event definitions

·         Schedule

 

2.    Video Source Data

The source data will consist of 100 hours (10 days * 2 hours/day * 5 cameras), obtained from Gatwick Airport surveillance video data (courtesy of the UK Home Office).  The Linguistic Data Consortium will provide event annotations for the entire corpus according to the milestones listed in the schedule.

 

The 100-hour corpus will be divided into development and evaluation subsets. In particular, the first 5 days of the corpus will be used as the development subset (devset), and the second 5 days of the corpus will be used as the evaluation subset (evalset). 

 

Developers may use the devset in any manner to build their systems, including activities such as dividing it into internal test sets, jackknifed training, etc.  During the summer months, NIST will conduct a dry run evaluation using the devset as the video source.   While testing on the development data is a non-blind system test, the purpose of the dry run (to test the evaluation infrastructure) is most easily accomplished using the devset.

 

We will release the full corpus (devset + evalset) early in the evaluation cycle to give people the opportunity to preprocess the full corpus throughout the year. The evaluation set must not be inspected or mined for information until after the evalset annotations are released. The evalset restriction applies to both evaluation tasks. However, participants can run feature extraction programs on the evalset to prepare for the formal evaluation.

 

Allowable side information (i.e., “contextual” information) will include resources posted on the TRECVid Event Detection website as well as any annotations constructed by developers based on the devset. Participants may share devset annotations. No annotation of the evalset is permitted prior to the evaluation submission deadline.

 

3.    Evaluation tasks

This proposal includes the following evaluation tasks:

 

 

 

4.    Evaluation Infrastructure

Systems will be evaluated on how well they can detect event occurrences in the evaluation corpus.  The determination of correct detection will be based solely on the temporal similarity between the annotated reference event observations and the system-detected event observations.

 

System detection performance is measured as a tradeoff between two error types: missed detections (MD) and false alarms (FA).  The two error types will be combined into a single error measure using the Detection Cost Rate (DCR) model, which is a linear combination of the two errors.  The DCR model distills the needs of a hypothetical application into a set of predefined constant parameters that include the event priors and weights for each error type.  While the chosen constants have been motivated by discussions with the research and user communities, the single operation point characterized by the DCR model is a small window into the performance of an event detection system.  In addition to DCR measures, Detection Error Tradeoff (DET) curves will be produced to graphically depict the tradeoff of the two error types over a wide range of operational points. The DCR model and the DET curve are related: the DCR model defines an optimal point along the DET curve.

 

The rest of this section defines the system output, followed by the three steps of the evaluation process: temporal alignment, Decision Error Tradeoff (DET) curve production, and DCR computations.

 

4.1. System Outputs

Systems will record observations of events in a VIPER-formatted XML file as described in the “TRECVid 2008 Event Detection: ViPER XML Representation of Events” document.  Each event observation generated by a system will include the following items:

 

·         Start frame: The frame number indicating the beginning of the observation (the first frame in the video source file is frame #1.)

·         End frame: The frame number indicating the last frame of the observation.

·         Decision score: A numeric score indicating how likely the event observation exists with more positive values indicating more likely observations.

·         Actual Decision: A Boolean value indicating whether or not the event observation should be counted for the primary metric computation.

 

The decision scores and actual decisions permit performance assessment over a wide range of operating points.  The decision scores provide the information needed to construct the DET curve.  In order to construct a fuller DET curve, a system must over-generate putative observations far beyond the optimal point for the system’s best DCR value.  The actual decisions provide the mechanism for the system to indicate which putative observations to include in the DCR calculation: i.e., the putative decisions with a true actual decision.

 

Systems must ensure their decision scores have the following two characteristics:  first, the values must form a non-uniform density function so that the relative evidential strength between two putative terms is discernable.  Second, the density function must be consistent across events for a single system so that event-averaged measures using decision scores are meaningful.

 

Since the decision scores are consistent across events, the system must use a single threshold for differentiating true and false actual decisions.

 

4.2. Event Alignment

Event observations can occur at any time and for any duration.  Therefore, in order to compare the output of a system to the reference annotations, an optimal one-to-one mapping is needed between the system and reference observations. The mapping is required because there is no pre-defined segmentation in the video. The alignment will be performed using the Hungarian Solution to the Bipartite Graph [1] matching problem by modeling event observations as nodes in the bipartite graph.  The system observations are represented as one set of nodes, and the reference observations are represented as a second set of nodes. The kernel formulas below assume the mapping is performed for a single event (Ej) at a time.

 





                        Where: