Intel Corporation (20250014590). MULTIMODAL LARGE LANGUAGE MODEL WITH AUDIO TRIGGER
Contents
MULTIMODAL LARGE LANGUAGE MODEL WITH AUDIO TRIGGER
Organization Name
Inventor(s)
MULTIMODAL LARGE LANGUAGE MODEL WITH AUDIO TRIGGER
This abstract first appeared for US patent application 20250014590 titled 'MULTIMODAL LARGE LANGUAGE MODEL WITH AUDIO TRIGGER
Original Abstract Submitted
systems and methods to trigger llm inference based on the presences of relevant audio, such as a keyword or sound event of interest. a detection head receives acoustic embeddings from an audio encoder and determines whether the audio stream includes relevant sounds (e.g., a selected audio trigger). when the audio stream does not include relevant sounds, multimodal llm inference is bypassed, thereby saving power and protecting privacy. when relevant sounds are detected in the audio stream by the detector, the acoustic embeddings from the audio encoder are transmitted to the multimodal llm, which proceeds to perform inference on the acoustic embeddings. the audio encoder and/or detection head can be offloaded in the hardware and implemented before the multimodal llm in the hardware pipeline, while the multimodal llm can be implemented in a neural processing unit.