The 2008 NIST Speaker Recognition Evaluation Results

The 2008 NIST Speaker Recognition Evaluation Results

Date of Release: Wednesday, August 6, 2008

The goal of the NIST Speaker Recognition Evaluation (SRE) series is to contribute to the direction of research efforts and the calibration of technical capabilities of text independent speaker recognition. The overarching objective of the evaluations has always been to drive the technology forward, to measure the state-of-the-art, and to find the most promising algorithmic approaches.

NIST has been coordinating Speaker Recognition Evaluations since 1996. Since then over 50 research sites have participated in our evaluations. Each year new researchers in industry and universities are encouraged to participate. Collaboration between universities and industries is also welcomed.

NIST maintains a general mailing list for the Speaker Recognition Evaluation. Relevant evaluation information is posted to the list. If you would like to join the list or have any question for NIST related to our speaker recognition evaluation, please email us at speaker_poc@nist.gov

The 2008 evaluation was administered as outlined in the official SRE08 evaluation plan. The sections below are taken from the evaluation plan, please see it for more detail. We also have made available a brief presentation.

 

Disclaimer

These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial SR products were generally from research systems, not commercially available products. Since SRE-08 was an evaluation of research algorithms, the SRE-08 test design required local implementation by each participant. As such, participants were only required to submit their system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.

The data, protocols, and metrics employed in this evaluation were chosen to support SR research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems. 

Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.

Evaluation Task

The year 2008 speaker recognition evaluation is limited to the broadly defined task of speaker detection. This has been NIST’s basic speaker recognition task over the past twelve years. The task is to determine whether a specified speaker is speaking during a given segment of conversational speech.

Evaluation Tests

The speaker detection task for 2008 is divided into 13 distinct and separate tests. Each of these tests involves one of six training
conditions and one of four test conditions. One of these tests is designated as the core test. For each test, there is also an
optional unsupervised adaptation condition.

Evaluation Conditions

Training Conditions

The training segments in the 2008 evaluation consist of continuous conversational excerpts. As in recent evaluations, there is no
prior removal of intervals of silence. Also, except for summed channel telephone conversations and long interview segments as
described below, two separate conversation channels are provided (to aid systems in echo cancellation, dialog analysis, etc.).
For all such two-channel segments, the primary channel containing the target speaker to be recognized is identified.

The six training conditions included involve target speakers defined by the following training data:

1. 10-sec: A two-channel excerpt from a telephone conversation estimated to contain approximately 10 seconds of speech of the target on its designated side. (An energy-based automatic speech detector was used to estimate the duration of actual speech in the chosen excerpts.)

2. short2: One two-channel telephone conversational excerpt, of approximately five minutes total duration, with the target speaker channel designated or a microphone recorded conversational segment of approximately three minutes total duration involving the target speaker and an interviewer. For the interview segments most of the speech is generally spoken by the target speaker, and for consistency across the condition, a second zeroed out channel is included.

3. 3conv: Three two-channel telephone conversation excerpts involving the target speaker on their designated sides.

4. 8conv: Eight two-channel telephone conversation excerpts involving the target speaker on their designated sides.

5. long: A single channel microphone recorded conversational segment of eight minutes or longer duration involving the target speaker and an interviewer. Most of the speech is generally spoken by the target speaker.

6. 3summed: Three summed-channel telephone conversational excerpts, formed by sample-by-sample summing of their two sides. Each of these conversations include both the target speaker and another speaker. These three non-target speakers are all distinct.

Test Conditions

The test segments in the 2008 evaluation consist of continuous conversational excerpts. As in recent evaluations, there is no prior removal of intervals of silence. Also, except for summed channel telephone conversations and long interview segments as described below, two separate conversation channels are provided (to aid systems in echo cancellation, dialog analysis, etc.). For all such two-channel segments, the primary channel containing the target speaker to be recognized is identified.

The four test segment conditions to be included are the following:

1. 10-sec: A two-channel excerpt from a telephone conversation estimated to contain approximately 10 seconds of speech of the putative target speaker on its designated side (An energy-based automatic speech detector was used to estimate the duration of actual speech in the chosen excerpts.)

2. short3: A two-channel telephone conversational excerpt, of approximately five minutes total duration, with the putative target speaker channel designated or a similar such telephone conversation but with the putative target channel being a (simultaneously recorded) microphone channel or a microphone recorded conversational segment of approximately three minutes total duration involving the putative target speaker and an interviewer. For the interview segments, most of the speech is generally spoken by the target speaker, and for consistency across the condition, a second zeroed out channel is included.

3. long: A single channel microphone recorded conversational segment of eight minutes orlonger duration involving the putative target speaker and an interviewer. Most of the speech is generally spoken by the target speaker.

4. summed: A summed-channel telephone conversation formed by sample-by-sample summing of its two sides.

The matrix of training and test segment condition combinations is shown below. Note that only 13 (out of 24) condition combinations are included in this year’s evaluation. Each test consists of a sequence of trials, where each trial consists of a target speaker, defined by the training data provided, and a test segment. The shaded box labeled “required” is the core test for the 2008 evaluation. All participants submitted results for this test.

Common Evaluation Condition

In each evaluation NIST has specified a common evaluation condition, a subset of trials in the core test that satisfy additional constraints, in order to better foster technical interactions and technology comparisons among ites. The performance results on these trials are treated as the basic official evaluation outcome.

Because of the broader scope of the 2008 evaluation and the multiple types of audio data included in the core test, several common evaluation conditions are specified. At the same time, it is not to appropriate to examine performance results over all trials of the core test lumped together. The common conditions to be considered include the following subsets of all of the core test trials:

1) All trials involving only interview speech in training and test

2) All trials involving interview speech from the same microphone type in training and test

3) All trials involving interview speech from different microphones types in training and test

4) All trials involving interview training speech and telephone test speech

5) All trials involving telephone training speech and noninterview microphone test speech

6) All trials involving only telephone speech in training and test

7) All trials involving only English language telephone speech in training and test

8) All trials involving only English language telephone speech spoken by a native U.S. English speaker in training and test

 

Evaluation Rules

Participants were given a set of rules to follow during the evaluation. The rules were created to ensure the quality of the evaluation and can be found in the evaluation plan.

 

Performance Measurement

There is a single basic cost model for measuring speaker detection performance, used for all speaker detection tests. For each test, a detection cost function is computed over the sequence of trials provided. This detection cost function is defined as a weighted sum of miss and false alarm error probabilities:

A further type of scoring and graphical presentation was performed on submissions whose scores declared to represent log likelihood ratios. A log likelihood ratio (llr) based cost function, which is not application specific and may be given an information theoretic interpretation, is defined as follows:

More information on the performance measurement can be found in the evaluation plan.

Participation

The 2008 Speaker Recognition Evaluation saw record participation, both with regard to number of paritipcants and number of systems. A total of 46 single or collaborative groups offered submissions, representing, coincidentally, 46 different participating sites. 107 systems were submitted for various tests, creating a total of 246 test/system combinations. For a list of participants, please see our brief presentation.

Results Representation

Detection Error Tradeoff (DET) curves, a linearized version of ROC curve, are used to show all operating points as the likelihood threshold was varied. Two special operating points–(a) the system decision point and (b) the optimal decision point– are plotted on the curve. More information on the DET curve can be found in a paper by Martin, A. F. et al., "The DET Curve in Assessment of Detection Task Performance", Proc. Eurospeech '97, Rhodes, Greece, September 1997, Vol. 4, pp. 1899-1903.

 

Evaluation Results

The graphs in the following file show the results for all sites' primary systems for each test and condition. Note that the cross on the DET curve indicates the system decision point while the circle indicates the optimal decision point. The width and height of the cross indicate a 95% confidence interval. SRE08 DET Plots

 

 

Created September 26, 2017, Updated September 26, 2017