Disclaimer
These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial SR products were generally from research systems, not commercially available products. Since SRE10 was an evaluation of research algorithms, the SRE10 test design required local implementation by each participant. As such, participants were only required to submit their system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.
The data, protocols, and metrics employed in this evaluation were chosen to support SR research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.
Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.
The goal of the NIST Speaker Recognition Evaluation (SRE) series is to contribute to the direction of research efforts and the calibration of technical capabilities of text independent speaker recognition. The overarching objective of the evaluations has always been to drive the technology forward, to measure the state-of-the-art, and to find the most promising algorithmic approaches.
NIST has been coordinating Speaker Recognition Evaluations since 1996. Since then over 75 research sites have participated in our evaluations. Each year new researchers in industry and universities are encouraged to participate. Collaboration between universities and industries is also welcomed.
NIST maintains a general mailing list for the Speaker Recognition Evaluation. Relevant evaluation information is posted to the list. If you would like to join the list or have any question for NIST related to our speaker recognition evaluation, please e-mail us at speaker_poc [at] nist.gov (speaker_poc[at]nist[dot]gov).
The 2010 evaluation was administered as outlined in the official SRE10 evaluation plan. The sections below are taken from the evaluation plan, please see it for more detail. We also have made available a brief presentation.
The year 2010 speaker recognition evaluation is limited to the broadly defined task of speaker detection. This has been NIST's basic speaker recognition task over the past twelve years. The task is to determine whether a specified speaker is speaking during a given segment of conversational speech.
The speaker detection task for 2010 is divided into nine distinct and separate tests. Each of these tests involves one of four training conditions and one of three test conditions. One of these tests is designated as the core test.
The training segments in the 2010 evaluation consist of continuous conversational excerpts. As in recent evaluations, there is no prior removal of intervals of silence. Also, except for summed channel telephone conversations and long interview segments as described below, two separate conversation channels are provided (to aid systems in echo cancellation, dialog analysis, etc.). For all such two-channel segments, the primary channel containing the target speaker to be recognized is identified.
The four training conditions included involve target speakers defined by the following training data:
The test segments in the 2010 evaluation consist of continuous conversational excerpts. As in recent evaluations, there is no prior removal of intervals of silence. Also, except for summed channel telephone conversations and long interview segments as described below, two separate conversation channels are provided (to aid systems in echo cancellation, dialog analysis, etc.). For all such two-channel segments, the primary channel containing the target speaker to be recognized is identified.
The three test segment conditions to be included are the following:
The matrix of training and test segment condition combinations is shown below. Note that only 9 (out of 12) condition combinations are included in this year's evaluation. Each test consists of a sequence of trials, where each trial consists of a target speaker, defined by the training data provided, and a test segment. The shaded box labeled "required" is the core test for the 2008 evaluation. All participants submitted results for this test.
Test Segment Condition | ||||
---|---|---|---|---|
10sec | core | summed | ||
Training Condition | 10sec | optional | ||
core | optional | required | optional | |
8conv | optional | optional | optional | |
8summed | optional | optional |
In each evaluation NIST has specified a common evaluation condition, a subset of trials in the core test that satisfy additional constraints, in order to better foster technical interactions and technology comparisons among sites. The performance results on these trials are treated as the basic official evaluation outcome.
Because of the multiple types of training and test conditions in the 2010 core test, and the likely disparity in the numbers of trials of different types, it is not appropriate to simply pool all trials as a primary indicator of overall performance. Rather, the common conditions considered in 2010 as primary performance indicators will include the following subsets of all of the core test trials:
Participants were given a set of rules to follow during the evaluation. The rules were created to ensure the quality of the evaluation and can be found in the evaluation plan.
A single basic cost model for measuring speaker detection performance has been used in all previous NIST speaker recognition evaluations. In 2010, however, for two of the test conditions, including the core condition, a new set of parameter values was used to compute the detection cost over the test trials. The old parameter values used in previous evaluations was used for the other conditions.
CDet = CMiss × PMiss|Target × PTarget + CFalseAlarm × PFalseAlarm|NonTarget × (1-PTarget)
CMiss | CFalseAlarm | PTarget |
---|---|---|
1 | 1 | 0.0001 |
CMiss | CFalseAlarm | PTarget |
---|---|---|
10 | 1 | 0.01 |
More information on the performance measurement can be found in the evaluation plan.
Detection Error Tradeoff (DET) curves, a linearized version of ROC curves, are used to show all operating points as the likelihood threshold is varied. Two special operating points — (a) the system decision point and (b) the optimal decision point — are plotted on the curve. More information on the DET curve can be found in a paper by Martin, A. F. et al., "The DET Curve in Assessment of Detection Task Performance", Proc. Eurospeech '97, Rhodes, Greece, September 1997, Vol. 4, pp. 1899-1903.
Evaluation results and additional summary of the evaluation are available in two publications:
Martin, Alvin F. / Greenberg, Craig S. (2010): "The NIST 2010 speaker recognition evaluation", In INTERSPEECH-2010, 2726-2729. [paper]
Greenberg, Craig S. / Martin, Alvin F. / Barr, Bradford N. / Doddington, George R. (2011): "Report on performance results in the NIST 2010 speaker recognition evaluation", In INTERSPEECH-2011, 261-264. [paper]