Date of Release: Tuesday, January 10, 2013
The 2011 NIST Language Recognition Evaluation (LRE11) was the fifth in the evaluation series that began in 1996 to evaluate human-language recognition technology. NIST conducts these evaluations in order to support language recognition (LR) research and help advance the state-of-the-art in language recognition. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official LRE11 evaluation plan. More information is available in a brief presentation.
These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial LR products were generally from research systems, not commercially available products. Since LRE11 was an evaluation of research algorithms, the LRE11 test design required local implementation by each participant. As such, participants were only required to submit their system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.
The data, protocols, and metrics employed in this evaluation were chosen to support LR research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.
Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.
The 2011 NIST language recognition evaluation task was language detection in the context of a fixed pair of languages: Given a segment of speech and a specified language pair (i.e., two of the possible target languages of interest), the task was to decide which of these two languages is in fact spoken in the given segment, based on an automated analysis of the data contained in the segment.
There were 24 target languages included in LRE11. Table 1 lists the included target languages.
Table 1: The 24 target languages.
Arabic Iraqi Arabic Levantine Arabic Maghrebi Arabic MSA Bengali Czech Dari English (American) English (Indian) Farsi/Persian Hindi Lao Mandarin Panjabi Pashto Polish Russian Slovak Spanish Tamil Thai Turkish Ukranian Urdu
Participants in the evaluation were given conversational telephone speech data from past LREs to train their systems and narrowband broadcast data from multiple broadcast sources.
The evaluation data contained narrowband broadcast speech from multiple broadcast sources. All evaluation data was provided in 16-bit 8 KHz format. Table 2 shows the language statistics. Some languages have more data than others due to data availability. Participants could also augment this data with additional data obtained independently. However, any additional data that sites utilized must be publicly available and the source for the data had to be documented in their system descriptions.
Table 2: Counts of telephone segments, broadcast segments, and broadcast sources per language.
Number of Telephone Segments
Number of Broadcast Segments
Number of Broadcast Sources
Each of the target languages included 100 or more test segments of each of the three test durations. All segments were in 16-bit linear pcm format, and segments derived from conversational telephone speech were not distinguished from segments derived from narrow band speech.
Participants were given a set of rules to follow during the evaluation. The rules were created to ensure the quality of the evaluation and can be found in the evaluation plan.
Performance was measured for each target language pair. For each pair L1/L2, the miss probabilities for L1 and for L2 over all segments in either language were determined. In addition, these probabilities were combined into a single number representing the cost performance of a system for distinguishing the two languages (see Figure 1). An overall performance measure for each system was computed as an average cost for those target language pairs presenting the greatest challenge to the system.
Let N be the number of target languages included in the evaluation. For each segment duration (30-second, 10-second, or 3-second), a system’s overall performance measure was based on the N target language pairs for which the minimum cost operating points for 30-second segments are greatest. For each segment duration, the performance measure was then the mean of the actual decision operating point cost function values over these N pairs.
Thus calibration errors were not considered in choosing which N cost function pairs to average, but did contribute to the system’s overall performance measure.
Figure 1: Evaluation metric measuring system performance for a language pair. CL1, CL2 and PL1 are application model parameters, which may be viewed as the costs of a miss for L1 and L2, respectively, and PL1 as the prior probability for L1 with respect to this language pair
A diverse group of organizations from three continents participated in the evaluation. Table 3 lists these organizations.
Table 3: Participating organizations in LRE11.
Universidad Autonoma de Madrid
L2F-Spoken Language Systems Lab INESC-ID Lisboa
University of the Basque Country
Brno, Czech Republic
Brno University of Technology
Brno, Czech Republic
University of Texas at Dallas
Richardson, Texas, USA
University of the Basque Country#13;
University of Zaragoza
Chinese Academy of Sciences
iFlyTek Speech Lab, EEIS University of Science and Technology of China
HeFei, AnHui, China
Institute for Infocomm Research
Indian Institute of Technology, Kharagpur
Spoken Language Systems Lab INESC-ID Lisboa
CNRS-LIMSI (Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur)
Laboratorie Informatique D'Avignon
MIT Lincoln Laboratory
Lexington, MA, USA
MIT Computer Science and Artificial Intelligence Laboratory
Cambridge, MA, USA
National Taipei University of Technology
Tsinghua University Department of Electrical Engineering
Cirencester, United Kingdom
Swansea, United Kingdom
The results are presented without attribution to the participating organizations in accordance with a policy established for NIST Language Recognition Evaluations.
What follows are the overall scores for primary systems in LRE11 (Figure 2), the language pairs most often contributing to the overall scores for top primary systems (Figure 3), and a language confusability matrix for a leading primary system (Figure 4).
Figure 2: The overall performance measure for the primary systems in LRE11. Each of the 17 stacked-bars represents the performance for one primary system. The royal blue bars show performance on 30-second segments, the royal blue and cyan bars combined show performance on 10-second segments, and the royal blue, cyan, and white bars combined show performance on 3-second segments.
Language Pairs Contributing to Overall Scores
Figure 3: The language pairs most often contributing to the overall primary measure of the leading six primary systems.
Language Pair Confusability Matrix
Figure 4: The language confusability matrix for a leading system on 3 second segments. The darker the cell for a language pair, the higher the cost the system had for that pair.