Skip to main content
U.S. flag

An official website of the United States government

Dot gov

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Https

Secure .gov websites use HTTPS
A lock ( ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

The 2019 NIST Audio-Visual Speaker Recognition Evaluation

Published

Author(s)

Seyed Omid Sadjadi, Craig S. Greenberg, Elliot Singer, Douglas A. Reynolds, Lisa Mason, Jaime Hernandez-Cordero

Abstract

In 2019, the U.S. National Institute of Standards and Technology (NIST) conducted the most recent in an ongoing series of speaker recognition evaluations (SRE). There were two components to SRE19: 1) a leaderboard style Challenge using unexposed conversational telephone speech (CTS) data from the Call My Net 2 (CMN2) corpus, and 2) an Audio-visual (AV) evaluation using video material extracted from the unexposed portions of the Video Annotation for Speech Technology (VAST) corpus. This paper presents an overview of the Audio-Visual SRE19 including the task, the performance metric, data, and the evaluation protocol, results and system performance analyses. The Audio-Visual SRE19 was organized in a similar manner to the audio from video (AfV) track in SRE18, except it offered only the open training condition. In addition, instead of extracting and releasing only the AfV data, unexposed multimedia data from the VAST corpus was used to support the Audio-Visual SRE19. It featured two core evaluation tracks, namely audio only and audio-visual tracks, as well as an optional visual only track. A total of 26 organizations (forming 14 teams) from academia and industry participated in the Audio-Visual SRE19 and submitted 102 valid system outputs. Evaluation results indicate: 1) notable performance improvements for the audio only speaker recognition on the challenging amateur online video domain due to the use of more complex neural network architectures (e.g., ResNet) along with soft margin losses, 2) state-of-the-art speaker and face recognition technologies provide comparable person recognition performance on the amateur online video domain, and 3) audio-visual fusion results in remarkable performance gains (greater than 85% relative) over the audio only or face only systems.
Conference Dates
November 1-5, 2020
Conference Location
Tokyo, -1
Conference Title
The Speaker and Language Recognition Workshop: Odyssey 2020

Keywords

audio-visual fusion, face recognition, NIST SRE, person recognition, speaker recognition
Created May 18, 2020, Updated March 27, 2020