The 2019 NIST Audio-Visual Speaker Recognition Evaluation

Seyed Omid Sadjadi; Craig S. Greenberg; Elliot Singer; Douglas A. Reynolds; Lisa Mason; Jaime Hernandez-Cordero

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

The 2019 NIST Audio-Visual Speaker Recognition Evaluation

Published

May 18, 2020

Author(s)

Seyed Omid Sadjadi, Craig S. Greenberg, Elliot Singer, Douglas A. Reynolds, Lisa Mason, Jaime Hernandez-Cordero

Abstract

In 2019, the U.S. National Institute of Standards and Technology (NIST) conducted the most recent in an ongoing series of speaker recognition evaluations (SRE). There were two components to SRE19: 1) a leaderboard style Challenge using unexposed conversational telephone speech (CTS) data from the Call My Net 2 (CMN2) corpus, and 2) an Audio-visual (AV) evaluation using video material extracted from the unexposed portions of the Video Annotation for Speech Technology (VAST) corpus. This paper presents an overview of the Audio-Visual SRE19 including the task, the performance metric, data, and the evaluation protocol, results and system performance analyses. The Audio-Visual SRE19 was organized in a similar manner to the audio from video (AfV) track in SRE18, except it offered only the open training condition. In addition, instead of extracting and releasing only the AfV data, unexposed multimedia data from the VAST corpus was used to support the Audio-Visual SRE19. It featured two core evaluation tracks, namely audio only and audio-visual tracks, as well as an optional visual only track. A total of 26 organizations (forming 14 teams) from academia and industry participated in the Audio-Visual SRE19 and submitted 102 valid system outputs. Evaluation results indicate: 1) notable performance improvements for the audio only speaker recognition on the challenging amateur online video domain due to the use of more complex neural network architectures (e.g., ResNet) along with soft margin losses, 2) state-of-the-art speaker and face recognition technologies provide comparable person recognition performance on the amateur online video domain, and 3) audio-visual fusion results in remarkable performance gains (greater than 85% relative) over the audio only or face only systems.

Conference Dates

November 1-5, 2020

Conference Location

Tokyo

Conference Title

The Speaker and Language Recognition Workshop: Odyssey 2020

Pub Type

Conferences

Download Paper

Local Download

Keywords

audio-visual fusion, face recognition, NIST SRE, person recognition, speaker recognition

Video analytics, Statistical analysis, Image and signal processing, Human language technology and Artificial intelligence

Citation

, S. , Greenberg, C. , Singer, E. , Olson, D. , Mason, L. and Hernandez-Cordero, J. (2020), The 2019 NIST Audio-Visual Speaker Recognition Evaluation, The Speaker and Language Recognition Workshop: Odyssey 2020, Tokyo, -1, [online], https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=929541 (Accessed December 30, 2025)

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created May 18, 2020, Updated March 27, 2020

Was this page helpful?

The 2019 NIST Audio-Visual Speaker Recognition Evaluation

Author(s)

Abstract

Download Paper

Keywords

Citation

Additional citation formats

Issues