US Patent Application 17828240. CONDITIONAL FACTORIZATION FOR JOINTLY MODELING CODE-SWITCHED AND MONOLINGUAL ASR simplified abstract

From WikiPatents
Jump to navigation Jump to search

CONDITIONAL FACTORIZATION FOR JOINTLY MODELING CODE-SWITCHED AND MONOLINGUAL ASR

Organization Name

Tencent America LLC

Inventor(s)

Chunlei Zhang of Bellevue WA (US)

Brian Yan of Palo Alto CA (US)

Dong Yu of Palo Alto CA (US)

CONDITIONAL FACTORIZATION FOR JOINTLY MODELING CODE-SWITCHED AND MONOLINGUAL ASR - A simplified explanation of the abstract

This abstract first appeared for US patent application 17828240 titled 'CONDITIONAL FACTORIZATION FOR JOINTLY MODELING CODE-SWITCHED AND MONOLINGUAL ASR

Simplified Explanation

This patent application describes a method, apparatus, and computer-readable medium for automatic speech recognition using conditional factorization for bilingual code-switched and monolingual speech.

  • The approach involves receiving an audio observation sequence that contains audio in either a first language or a second language.
  • The audio observation sequence is then mapped into two separate sequences of hidden representations using encoders specific to each language.
  • A label-to-frame sequence is generated based on the hidden representations from both languages using a joint neural network model.
  • This method allows for accurate speech recognition in bilingual code-switched and monolingual speech scenarios.


Original Abstract Submitted

A method, apparatus, and non-transitory computer-readable medium for automatic speech recognition using conditional factorization for bilingual code-switched and monolingual speech may include receiving an audio observation sequence comprising a plurality of frames, the audio observation sequence including audio in a first language or a second language. The approach may further include mapping the audio observation sequence into a first sequence of hidden representations, the mapping being generated by a first encoder corresponding to the first language and mapping the audio observation sequence into a second sequence of hidden representations, the mapping being generated by a second encoder corresponding to the second language. The approach may further include generating a label-to-frame sequence based on the first sequence of hidden representations and the second sequence of hidden representations, using a joint neural network based model.