NIST logo

Disclaimer

These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial SR products were generally from research systems, not commercially available products. Since SRE10 was an evaluation of research algorithms, the SRE10 test design required local implementation by each participant. As such, participants were only required to submit their system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.

The data, protocols, and metrics employed in this evaluation were chosen to support SR research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.

Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.


 

The goal of the NIST Speaker Recognition Evaluation (SRE) series is to contribute to the direction of research efforts and the calibration of technical capabilities of text independent speaker recognition. The overarching objective of the evaluations has always been to drive the technology forward, to measure the state-of-the-art, and to find the most promising algorithmic approaches.

NIST has been coordinating Speaker Recognition Evaluations since 1996. Since then over 75 research sites have participated in our evaluations. Each year new researchers in industry and universities are encouraged to participate. Collaboration between universities and industries is also welcomed.

NIST maintains a general mailing list for the Speaker Recognition Evaluation. Relevant evaluation information is posted to the list. If you would like to join the list or have any question for NIST related to our speaker recognition evaluation, please e-mail us at speaker_poc@nist.gov.

The 2010 evaluation was administered as outlined in the official SRE10 evaluation plan. The sections below are taken from the evaluation plan, please see it for more detail. We also have made available a brief presentation.

Evaluation Task

The year 2010 speaker recognition evaluation is limited to the broadly defined task of speaker detection. This has been NIST's basic speaker recognition task over the past twelve years. The task is to determine whether a specified speaker is speaking during a given segment of conversational speech.  

Evaluation Tests

The speaker detection task for 2010 is divided into nine distinct and separate tests. Each of these tests involves one of four training conditions and one of three test conditions. One of these tests is designated as the core test.  

Evaluation Conditions

Training Conditions

The training segments in the 2010 evaluation consist of continuous conversational excerpts. As in recent evaluations, there is no prior removal of intervals of silence. Also, except for summed channel telephone conversations and long interview segments as described below, two separate conversation channels are provided (to aid systems in echo cancellation, dialog analysis, etc.). For all such two-channel segments, the primary channel containing the target speaker to be recognized is identified.

The four training conditions included involve target speakers defined by the following training data:  

  1. 10-sec: A two-channel excerpt from a telephone conversation estimated to contain approximately 10 seconds of speech of the target on its designated side. (An energy-based automatic speech detector will be used to estimate the duration of actual speech in the chosen excerpts.)
  2. core: One two-channel telephone conversational excerpt, of approximately five minutes total duration, with the target speaker channel designated or a microphone recorded conversational segment of three or eight minutes total duration involving the interviewee (target speaker) and an interviewer. In the former case the designated channel may either be a telephone channel or a room microphone channel; the other channel will always be a telephone one. In the latter case the designated microphone channel will be the A channel, and most of the speech will generally be spoken by the interviewee, while the B channel will be that of the interviewer's head mounted close-talking microphone, with some level of speech spectrum noise added to mask any residual speech of the target speaker in it.
  3. 8conv: Eight two-channel telephone conversation excerpts involving the target speaker on their designated sides.
  4. 8summed: Eight summed-channel excerpts from telephone conversations of approximately five minutes total duration formed by sample-by-sample summing of their two sides. Each of these conversations will include both the target speaker and another speaker. These eight non-target speakers will all be distinct.

Test Conditions

The test segments in the 2010 evaluation consist of continuous conversational excerpts. As in recent evaluations, there is no prior removal of intervals of silence. Also, except for summed channel telephone conversations and long interview segments as described below, two separate conversation channels are provided (to aid systems in echo cancellation, dialog analysis, etc.). For all such two-channel segments, the primary channel containing the target speaker to be recognized is identified.

The three test segment conditions to be included are the following:  

  1. 10-sec: A two-channel excerpt from a telephone conversation estimated to contain approximately 10 seconds of speech of the putative target speaker on its designated side (An energy-based automatic speech detector will be used to estimate the duration of actual speech in the chosen excerpts.)
  2. core: One two-channel telephone conversational excerpt, of approximately five minutes total duration, with the target speaker channel designated or a microphone recorded conversational segment of three or eight minutes total duration involving the interviewee (speaker of interest) and an interviewer. In the former case the designated channel may either be a telephone channel or a room microphone channel; the other channel will always be a telephone one. In the latter case the designated microphone channel will be the A channel, and most of the speech will generally be spoken by the interviewee, while the B channel will be that of the interviewer's head mounted close-talking microphone, with some level of speech spectrum noise added to mask any residual speech of the target speaker in it.
  3. summed: A summed-channel telephone conversation of approximately five minutes total duration formed by sample-by-sample summing of its two sides

The matrix of training and test segment condition combinations is shown below. Note that only 9 (out of 12) condition combinations are included in this year's evaluation. Each test consists of a sequence of trials, where each trial consists of a target speaker, defined by the training data provided, and a test segment. The shaded box labeled "required" is the core test for the 2008 evaluation. All participants submitted results for this test.  

 

Matrix of training and test segment conditions. The shaded entry is the required core test condition.
Test Segment Condition
10sec
core summed
Training Condition 10sec optional    
core optional required optional
8conv optional optional optional
8summed   optional optional

 

Common Evaluation Conditions

In each evaluation NIST has specified a common evaluation condition, a subset of trials in the core test that satisfy additional constraints, in order to better foster technical interactions and technology comparisons among sites. The performance results on these trials are treated as the basic official evaluation outcome.

Because of the multiple types of training and test conditions in the 2010 core test, and the likely disparity in the numbers of trials of different types, it is not appropriate to simply pool all trials as a primary indicator of overall performance. Rather, the common conditions considered in 2010 as primary performance indicators will include the following subsets of all of the core test trials:  

  1. All trials involving interview speech from the same microphone in training and test.
  2. All trials involving interview speech from different microphones in training and test.
  3. All trials involving interview training speech and normal vocal effort conversational telephone test speech.
  4. All trials involving interview training speech and normal vocal effort conversational telephone test speech recorded over a room microphone channel.
  5. All different number trials involving normal vocal effort conversational telephone speech in training and test.
  6. All telephone channel trials involving normal vocal effort conversational telephone speech in training and high vocal effort conversational telephone speech in test.
  7. All room microphone channel trials involving normal vocal effort conversational telephone speech in training and high vocal effort conversational telephone speech in test.
  8. All telephone channel trials involving normal vocal effort conversational telephone speech in training and low vocal effort conversational telephone speech in test.
  9. All room microphone channel trials involving normal vocal effort conversational telephone speech in training and low vocal effort conversational telephone speech in test.

Evaluation Rules

Participants were given a set of rules to follow during the evaluation. The rules were created to ensure the quality of the evaluation and can be found in the evaluation plan.

Performance Measure

A single basic cost model for measuring speaker detection performance has been used in all previous NIST speaker recognition evaluations. In 2010, however, for two of the test conditions, including the core condition, a new set of parameter values was used to compute the detection cost over the test trials. The old parameter values used in previous evaluations was used for the other conditions.  

CDet = CMiss × PMiss|Target × PTarget + CFalseAlarm × PFalseAlarm|NonTarget × (1-PTarget)

 

Speaker Detection Cost Model Parameters for the core and 8conv/core test conditions.
CMiss CFalseAlarm PTarget
1 1 0.0001

 

Speaker Detection Cost Model Parameters to be computed for all test conditions.
CMiss CFalseAlarm PTarget
10 1 0.01

 

Cost Function for SRE10

More information on the performance measurement can be found in the evaluation plan. 

Results Representation

Detection Error Tradeoff (DET) curves, a linearized version of ROC curves, are used to show all operating points as the likelihood threshold is varied. Two special operating points — (a) the system decision point and (b) the optimal decision point — are plotted on the curve. More information on the DET curve can be found in a paper by Martin, A. F. et al., "The DET Curve in Assessment of Detection Task Performance", Proc. Eurospeech '97, Rhodes, Greece, September 1997, Vol. 4, pp. 1899-1903.

Evaluation Results

The graphs below show the results for all sites' primary systems for each test and condition. The common conditions reference the conditions found in the Common Evaluation Condition section.  

 

Tests Conditions
Common Condition 1 Common Condition 2 Common Condition 3 Common Condition 4 Common Condition 5 Common Condition 6 Common Condition 7 Common Condition 8 Common Condition 9
10sec-10sec 10sec-10sec-cc5-thumb
8conv-10sec 8conv-10sec-cc5-thumb
8conv-core 8conv-core-cc5-thumb
8conv-coreext 8conv-coreext-cc5-thumb 8conv-coreext-cc6-thumb 8conv-coreext-cc8-thumb
8summed-core 8summed-core-cc5-thumb
8summed-summed 8summed-summed-cc5-thumb
core-10sec core-10sec-cc5-thumb
core-core core-core-cc1-thumb core-core-cc2-thumb core-core-cc3-thumb core-core-cc4-thumb core-core-cc5-thumb core-core-cc6-thumb core-core-cc7-thumb core-core-cc8-thumb core-core-cc9-thumb
coreext-coreext coreext-coreext-cc1-thumb coreext-coreext-cc2-thumb coreext-coreext-cc3-thumb coreext-coreext-cc4-thumb coreext-coreext-cc5-thumb coreext-coreext-cc6-thumb coreext-coreext-cc7-thumb coreext-coreext-cc8-thumb coreext-coreext-cc9-thumb
core-summed core-summed-cc5-thumb