MULTI-SPEAKER SPEECH RECOGNITION FACILITATED BY LANGUAGE MODELS

Organization Name

nvidia corporation

Inventor(s)

Taejin Park of San Jose CA (US)

Kunal Dhawan of San Jose CA (US)

Nithin Rao Koluguri of Milpitas CA (US)

Jagadeesh Balam of Campbell CA (US)

MULTI-SPEAKER SPEECH RECOGNITION FACILITATED BY LANGUAGE MODELS

This abstract first appeared for US patent application 20250078842 titled 'MULTI-SPEAKER SPEECH RECOGNITION FACILITATED BY LANGUAGE MODELS

Original Abstract Submitted

disclosed are apparatuses, systems, and techniques that leverage one or more language models (lms)—such as large language models (llms—for efficient multi-speaker speech recognition. the techniques include processing, using a speaker diarization model, an audio feature to generate a first association of the audio feature with one or more prospective speakers, the audio feature being representative of one or more spoken words. the techniques further include providing, to an lm, a first prompt requesting the lm to identify a second association of the one or more spoken words with the one or more prospective speakers and receiving, from the lm, a first response identifying the second association of the one or more spoken words with the one or more prospective speakers. the techniques further include determining, using the first association and the second association, one or more speakers that produced the one or more spoken words.