Microsoft Technology Licensing, LLC (20240338860). TEXT AND IMAGE GENERATION FOR CREATION OF IMAGERY FROM AUDIBLE INPUT simplified abstract

From WikiPatents
Jump to navigation Jump to search

TEXT AND IMAGE GENERATION FOR CREATION OF IMAGERY FROM AUDIBLE INPUT

Organization Name

Microsoft Technology Licensing, LLC

Inventor(s)

Alexander Ian Pfister Trzyna of Seattle WA (US)

TEXT AND IMAGE GENERATION FOR CREATION OF IMAGERY FROM AUDIBLE INPUT - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240338860 titled 'TEXT AND IMAGE GENERATION FOR CREATION OF IMAGERY FROM AUDIBLE INPUT

The patent application describes systems and methods for using an artificial intelligence model to generate live images based on audio transcription.

  • The system converts a live audio stream into a text transcript using speech-to-text conversion.
  • A segment of the transcript is used to generate a language model prompt requesting summarization.
  • The summarization is received from a large language model and used to generate a prompt for an image of the summarization.
  • The text-to-image model generates an image based on the prompt, which is then displayed on a screen.
  • This process continues as the live audio stream is received.

Potential Applications: - Real-time image generation for live events, lectures, or conversations - Accessibility tools for the hearing impaired - Educational tools for visual learners

Problems Solved: - Providing visual representation of audio content in real-time - Enhancing user experience by combining audio and visual elements

Benefits: - Improved understanding and retention of audio content - Enhanced accessibility for a wider range of users - Efficient generation of visual content from audio sources

Commercial Applications: Title: Real-time Image Generation System for Audio Transcription This technology could be used in live events, online education platforms, video conferencing tools, and accessibility software.

Questions about the technology: 1. How does the system ensure accuracy in generating images based on audio content? 2. What are the potential limitations of using AI models for live image generation?


Original Abstract Submitted

systems and methods for using an artificial intelligence (ai) model for providing live image generation based on audio transcription. an image generation system and method convert a live audio stream, such as a conversation, speech, lecture, etc., into a live text transcript using speech-to-text conversion. a segment of the live text transcript is extracted and included in a first language model (lm) prompt. the first lm prompt includes a request for summarization of the transcript segment. the first lm prompt is provided to a large language model (llm), and a summarization is received in response. a second lm prompt is generated including the summarization and a request for an image of the summarization. the second lm prompt is provided to a text-to-image model, and an image is received in response. the image is displayed on a display screen. images continue to be generated and displayed as the live audio stream is received.