Omid Sadjadi, Craig Greenberg, Elliot Singer, Douglas A. Reynolds, Lisa Mason, Jaime Hernandez-Cordero
In 2018, the U.S. national institute of standards and technology (NIST) conducted the most recent in an ongoing series of speaker recognition evaluations (SRE). SRE18 was organized in a similar manner to SRE16, focusing on speaker detection over conversational telephony speech (CTS) collected outside north America. SRE18 also featured several new aspects including: two new data domains, namely voice over internet protocol (VoIP) and audio extracted from amateur online videos (AfV), as well as a new language (Tunisian Arabic). A total of 78 organizations (forming 48 teams) from academia and industry participated in SRE18 and submitted 129 valid system outputs under fixed and open training conditions first introduced in SRE16. This paper presents an overview of the evaluation and several analyses of system performance for all primary conditions in SRE18. The evaluation results suggest 1) speaker recognition on AfV was more challenging than on telephony data, 2) speaker representations (aka embeddings) extracted using end-to-end neural network frameworks were most effective, 3) top performing systems exhibited similar performance, and 4) greatest performance improvements were largely due to data augmentation, use of extended and more complex models for data representation, as well as effective use of the provided development sets.
, Greenberg, C.
, Singer, E.
, Reynolds, D.
, Mason, L.
and Hernandez-Cordero, J.
The 2018 NIST Speaker Recognition Evaluation, INTERSPEECH, Graz, AT, [online], https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=927673
(Accessed September 24, 2023)