TDT3BuildIndex.pl builds the TDT3 evaluation indexes for an evaluation. The script reads in a list of source filenames 'SourceList'. The list should be one document source filename per file without extensions, for example see doc/example_indexes/index.flist. The script then reads the boundary file information from the TDT3 corpus' root directory, 'Rootdir', and writes index files into 'OutputDir'.
The directory doc/example_indexes contains the output generated as an illustrative example. (See the Script Output Description section below for more details.)
The following Options are recognized by the program:
| -a [ccap|fdch] | -> | For the ABC shows, choose either the closed captioning transcripts (ccap) or the Federal Documents Clearing House transcripts (fdch) in the index files. The default is the close captioning. | ||||||||||
| -l LNKfile | -> | Use the hand generated database of link judgements in 'LNKfile' rather than randomly generating a set of link story pairs. | ||||||||||
| -L LNKfile | -> | Write the Link database used to generate the link index files. If the '-l' option is used, it is essentially a copy of the input database, otherwise it's built from the randomly selected links. | ||||||||||
| -n Nn | -> | The maximum number of certified off-topc stories to write into the tracking index files. | ||||||||||
| -r RANDOMSEED | -> | A seed number to supply to the random number generator. Otherwise, the seed is set to a number based on the current time and process id (A trick from the PERL manual). | ||||||||||
| -s | -> | Use all available speedups. Currently, the only speedups involve NOT using 'nsmgls' and 'SGMLS.pm' parser and PERL libraries to read the TDT3 Corpus files. | ||||||||||
| -S EXT[:EXT]* | -> | Speicify the extension to use for the ASR transcripts. Fall back extensions can be identified by adding colon separator between extensions. So the string 'as1:asr' forces the program to look for .as1 transcripts, and use them if found. If there isn't a .as1 transcript, the program attempts to locate a .asr transcript and use it if found. The default value is 'asr'. | ||||||||||
| -t Nt | -> | Maximum number of training stories per topic for the tracking indexes. Default is 16 | ||||||||||
| -T Topic_regexp | -> | Restrict the topics for which the index files
are created using the PERL regular expression 'Topic_regexp'. The default
is to use all occuring annotated topics. There are a number of macro names
for defined topic sets that may be used in place of regular expressions,
they are:
| ||||||||||
| -v num | -> | Set the verbose level to 'num'. Default 1. ==0 None, ==1 Normal, >5 Slight, >10 way too much, >15 not even funny | ||||||||||
| -y YEAR | -> | Build index files for the evaluation year 'YEAR'. The default is 1999. Possible values are 1999 and 2000. |
In order to explain the output of this script, the evaluation software has a complete example of the script's execution. The command executed is in the Bourne shell script doc/example_indexes/create_indexes.sh. The script has two steps, the first generates the list of source filename, doc/example_indexes/index.flist, to generate the index files for. Notice that this particular list covers the June data of 1998. The second step is to execute TDT2BuildIndex.pl. The command line arguemts are defined above in the Options section.
After execution, the output directory, doc/example_indexes, is filled with all sorts of goodies, the most important file is the HTML readme file doc/example_indexes/index.htm. It has links to all the files and explainations of the contents.
There are 7 sections in the readme file: