e
TDT3seg.pl User Manual
TDT3seg.pl User Manual

TDT3 Segmentation Task Scoring


Usage:

TDT3seg.pl -R Rootdir -I TDT3_seg_index <Options> Seg_system_output

The 'TDT3seg.pl' program will score the output generated by a TDT3 segmentation system. The program requires the directory path, 'Rootdir', to the LDC's TDT3 Test corpus. The corpus must be in the same structure as released by the LDC, with all file formats identical to their original form. The program uses the index file TDT3_seg_index, provide with the test corpus and described below, to load the appropriate data from the corpus and to verify the completeness of the Seg_system_output file.

Upon completion of the load, the segmentations are scored, and a report is generated. Segmentation scoring function is fully described in The Topic Detection and Tracking (TDT3) Evaluation Plan (In Microsoft Word). This version if TDT3seg.pl implements the scoring protocols described in the pointed to version of the eval plan.

The scoring of Mandarin ASR segmentation involves extra processing by the evaluation script. The RECIDs in the Mandarin ASR files correspond to words, whereas evaluation of story segmentation of Mandarin is defined in terms of characters. Thus, for this condition, the evaluation script reads in the tokenized text file, converting RECIDS to character offsets.

The following <Options> are recognized by the program:

-C Cmiss:Cfa -> Set the cost of a missed detection and the cost of a false alarm to 'Cmiss' and 'Cfa' respectively. These numbers are used in the segmentation cost function. Default values are Cmiss=1.0 and Cfa=0.3 ;
-d -> print the loaded database and exit
-D Detail -> Write internally organize evaluation corpus and pertinent statistics for debugging purposes. This report, though voluminous, is intended to help researchers debug their internal versions of evaluation code.
-E SubsetFile -> Compute performance excluding source files in the subset definition file. NOTE: Only the first set defined in the subset definition file is used for the filter. All others are ignored.
-f size -> Set the evaluation frame size to 'size'. The defaults are 50 RECIDS, (75 RECIDS for Mandarin) and 15 second TIME scoring.
-i -> Include all bounded regions of text in scoring. By default, evaluation frames within stories not marked as NEWS are not tallied to calculate performance.
-L -> Print the loaded TDT database and exit.
-p -> Print out precision and recall performance statistics. The analysis will only be printed IFF the evaluation frame size is set to 1.
-P P(seg) -> Use P(seg) for the segmentation cost function. Default is 0.3.
-v num -> Set the verbose level to 'num'. Default 1.
==0 None, ==1 Normal, >5 Slight, >10 way too much, >15 not even funny
-r Report -> Write the summary report to 'Report' rather than STDOUT, the default.
-s -> Use all available speedups. Currently, the only speedups involve NOT using 'nsmgls' and 'SGMLS.pm' parser and PERL libraries to read the TDT3 Corpus files.
Options that apply to the DET plots:
-d DETfile -> Create a DET plot in GNUplot format with the file root 'DETfile'. The program makes several files each with additional extensions. The file 'DETfile'.plt is a command file for GNUplot and can be printed using the command "gnuplot 'DETfile'.plt | lpr".
-t title -> Set the title line for the plot to 'title'.

Segmentation Task Index File Format

The index file for the segmentation file is as follows. The first line in the index file is a header line. The line indicates the TDT3 task, 'SEGMENTATION' in this case, and the type of pointer used to mark segment changes. Each subsequent data record in the file will identify source file to process. These records will have only field and be separated with a newline.

The BNF structure of the segmentation index file is:

<HEADER_LINE>
<SOURCE>
<SOURCE>
...

Where:

<HEADER_LINE> :== # SEGMENTATION <POINTER_TYPE> SRC=<SRC_COND> TEST:SL=<LANG> TEST:CL=<C_LANG>
<POINTER_TYPE> :== RECID | TIME
A POINTER_TYPE is the type of boundaries to be output by the system. The possible values are RECID for text stream segmentation or TIME for audio segmentation.
<SRC_COND> :== bnasr | bnman
The source condition used to build the index file. The only source defined by the evaluation spec iw broadcast new ASR or manual transcripts.
<S_LANG> :== eng | man | mul
The source language of the broadcast source.
<C_LANG> :== nat | eng
The language in which the content has been transcribed and/or converted to.
<SOURCE> :== TDT3 corpus filename with directory and extension names relative to the TDT3 root directory specified on the command line.
The following is an excerpt from a segmentation task index file.
# SEGMENTATION RECID SRC=bnasr TEST:SL=eng TEST:CL=nat
#
# Generated: on Wed Jul 15 11:44:27 EDT 1998
#    by command '/da2/TDT/TDT3eval_v0.2/TDT3BuildIndex.pl 
#                -R /da1/LDC_transcript_data/TDT/tdt_deliv_980708
#                -f indexes_devtest/flist.devtest 
#                -O indexes_devtest -a ccap'
#
tkntext/19980301_0553_0719_APW_ENG.tkn
tkntext/19980301_1014_1116_APW_ENG.tkn
tkntext/19980301_1403_1529_APW_ENG.tkn

Segmentation Task System Output Format

Segmentation systems under evaluation must record segmentation decisions in an output file, one record for each hypothesized story boundary. The first record in this file is a header record and will contain three fields which specify information that applies globally to the whole file. Each subsequent data record in the file will identify a hypothesized boundary. These records will have two fields indicating the source file, (copied from the index file), and point at which a segment boundary occurs. Comment lines begin with the '#' character, and any text following a '#' is ignored. The exception to this rule is the first comment line can optionally contain a long description of the system under test. This description will be included in the scoring report along side the <SYSTEM> value described below. Blank lines after the initial comment line will be ignored.

The BNF structure of the segmentation system output file is:

<SYSTEM> <DEF_PERIOD> <POINTER_TYPE>
<SOURCE> <BOUNDARY>
<SOURCE> <BOUNDARY>
...

Where:

<SYSTEM> :== System is an alphanumeric character string that uniquely identifies the system being tested. (E.g., CDM_P05-8.v37)
<DEF_PERIOD> :== The deferral period before before decisions are made. Permissible values defined by the TDT3 test specification.
<POINTER_TYPE> :== RECID | TIME
POINTER_TYPE is the type of boundaries to be output by the system. The possible values are RECID for text stream segmentation or TIME for audio segmentation.
<SOURCE> :== TDT3 corpus filename with directory and extension names relative to the TDT3 root directory specified on the command line.
<BOUNDARY> :== Boundary is a hypothesized boundary. For text files, Boundary is the index number of the first word in the hypothesized segment, in the range {1, 2, . . .}. For audio files, Boundary is the time of the beginning of the segment {0.0, . . .}. (It isn't necessary to output the beginning of the first segment.) The hypothesized Boundary points must occur in chronological order.
The following is an excerpt from a segmentation system output file.
# Degenerate segmentation results, Correct, RECID
100Correct 100 RECID
tkntext/19980301_0553_0719_APW_ENG.tkn 1
tkntext/19980301_0553_0719_APW_ENG.tkn 167
tkntext/19980301_0553_0719_APW_ENG.tkn 690
tkntext/19980301_0553_0719_APW_ENG.tkn 1047
tkntext/19980301_0553_0719_APW_ENG.tkn 1198
tkntext/19980301_0553_0719_APW_ENG.tkn 1619

Example Output Report

-------------------------------------------------------------------------------
------------------  TDT Segmentation Task Performance Report  -----------------


LDC TDT Corpus Root Dir: ../../..
Index File:              ../indexes_devtest/seg_man.ndx
System Output File:      seg_man.seg
Omit Non-NEWS stories:   FALSE
Pointer Type:            RECID
Deferral Period:         100
Evaluation Frame Size:   50

Segmentation Performance Calculations:
    System Identifier:   Errors 'Degenerate segmentation results, With errors, RECID'
    Number Source Files:  381


    P(miss) = ( 1905 / 517319 ) = 0.003682 
    P(fa)   = ( 1905 / 1237223 ) = 0.001540

--------------- End of TDT Segmentation Task Performance Report  --------------
-------------------------------------------------------------------------------