The 2019 NIST Audio-Visual Speaker Recognition Evaluation
Seyed Omid Sadjadi, Craig S. Greenberg, Elliot Singer, Douglas A. Reynolds, Lisa Mason, Jaime Hernandez-Cordero
In 2019, the U.S. National Institute of Standards and Technology (NIST) conducted the most recent in an ongoing series of speaker recognition evaluations (SRE). There were two components to SRE19: 1) a leaderboard style Challenge using unexposed conversational telephone speech (CTS) data from the Call My Net 2 (CMN2) corpus, and 2) an Audio-visual (AV) evaluation using video material extracted from the unexposed portions of the Video Annotation for Speech Technology (VAST) corpus. This paper presents an overview of the Audio-Visual SRE19 including the task, the performance metric, data, and the evaluation protocol, results and system performance analyses. The Audio-Visual SRE19 was organized in a similar manner to the audio from video (AfV) track in SRE18, except it offered only the open training condition. In addition, instead of extracting and releasing only the AfV data, unexposed multimedia data from the VAST corpus was used to support the Audio-Visual SRE19. It featured two core evaluation tracks, namely audio only and audio-visual tracks, as well as an optional visual only track. A total of 26 organizations (forming 14 teams) from academia and industry participated in the Audio-Visual SRE19 and submitted 102 valid system outputs. Evaluation results indicate: 1) notable performance improvements for the audio only speaker recognition on the challenging amateur online video domain due to the use of more complex neural network architectures (e.g., ResNet) along with soft margin losses, 2) state-of-the-art speaker and face recognition technologies provide comparable person recognition performance on the amateur online video domain, and 3) audio-visual fusion results in remarkable performance gains (greater than 85% relative) over the audio only or face only systems.
November 1-5, 2020
The Speaker and Language Recognition Workshop: Odyssey 2020
, Greenberg, C.
, Singer, E.
, Olson, D.
, Mason, L.
and Hernandez-Cordero, J.
The 2019 NIST Audio-Visual Speaker Recognition Evaluation, The Speaker and Language Recognition Workshop: Odyssey 2020, Tokyo, -1, [online], https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=929541
(Accessed September 24, 2021)