17850617. Multi-Talker Audio Stream Separation, Transcription and Diaraization simplified abstract (Amazon Technologies, Inc.)

From WikiPatents
Jump to navigation Jump to search

Multi-Talker Audio Stream Separation, Transcription and Diaraization

Organization Name

Amazon Technologies, Inc.

Inventor(s)

Masahito Togami of San Jose CA (US)

Ritwik Giri of Mountain View CA (US)

Michael Mark Goodwin of Scotts Valley CA (US)

Arvindh . Krishnaswamy of Palo Alto CA (US)

Siddhartha Shankara Rao of Seattle WA (US)

Multi-Talker Audio Stream Separation, Transcription and Diaraization - A simplified explanation of the abstract

This abstract first appeared for US patent application 17850617 titled 'Multi-Talker Audio Stream Separation, Transcription and Diaraization

Simplified Explanation

The abstract describes a system for personalized noise suppression in audio streams based on talker embedding vectors. Here is a simplified explanation of the abstract:

  • Talker embedding vectors are derived for different talkers in an audio stream based on their voice characteristics.
  • Personalized noise suppression models are applied to the audio stream using the talker embedding vectors.
  • Single-talker audio streams are generated using the personalized noise suppression models.
  • Single-talker transcriptions are created and merged into a multi-talker output transcription.
    • Potential Applications:**

- Improved speech recognition in multi-talker environments - Enhanced audio quality in conference calls or group discussions

    • Problems Solved:**

- Addressing background noise in audio streams with multiple talkers - Enhancing speech intelligibility in complex audio environments

    • Benefits:**

- Personalized noise suppression for each talker - Clearer and more accurate transcriptions in multi-talker scenarios

    • Potential Commercial Applications:**

- Audio conferencing systems - Transcription services for meetings or interviews

    • Possible Prior Art:**

One possible prior art could be the use of speaker recognition technology in audio processing to differentiate between multiple speakers in a conversation. This technology has been used in various applications such as call center analytics and security systems.

    • Unanswered Questions:**
    • 1. How does the system handle overlapping speech from multiple talkers in the audio stream?**

The abstract does not provide details on how the system distinguishes and processes overlapping speech from different talkers.

    • 2. What is the computational overhead of generating and applying multiple instances of personalized noise suppression models in real-time applications?**

The abstract does not mention the computational resources required to execute the personalized noise suppression models on the input audio stream.


Original Abstract Submitted

A plurality of talker embedding vectors may be derived that correspond to a plurality of talkers in an input audio stream. Each talker embedding vector may represent respective voice characteristics of a respective talker. The talker embedding vectors may be generated based on, for example, a pre-enrollment process or a cluster-based embedding vector derivation process. A plurality of instances of a personalized noise suppression model may be executed on the input audio stream. Each instance of the personalized noise suppression model may employ a respective talker embedding vector. A plurality of single-talker audio streams may be generated by the plurality of instances of the personalized noise suppression model. A plurality of single-talker transcriptions may be generated based on the plurality of single-talker audio streams. The plurality of single-talker transcriptions may be merged into a multi-talker output transcription.