Reducing Latency by Processing Parts of a Language Model Query in Parallel

Organization Name

Microsoft Technology Licensing, LLC

Inventor(s)

Sayan Dev Pathak of Kirkland WA US

Osama Abuelsorour of Menlo Park CA US

Christopher Hakan Basoglu of Everett WA US

Harini Kesavamoorthy of Bellevue WA US

Girish Milind Mahajan of Redmond WA US

Salman Mohammad Quazi of Mountain View CA US

Valeriy Viktorovich Kirshin of Kirkland WA US

Reducing Latency by Processing Parts of a Language Model Query in Parallel

This abstract first appeared for US patent application 18385408 titled 'Reducing Latency by Processing Parts of a Language Model Query in Parallel

Original Abstract Submitted

A technique partitions a user's original query into plural smaller component queries, each of which has a common part and an instance-specific part. The technique distributes the component queries to plural processor instances of a processor. The plural processor instances transform the respective component queries into query-component responses by acting in parallel, independent of each other. The technique generates a final response based on the query-component responses, e.g., by assembling the component-query responses into the final response. The technique reduces latency because the processor instances work on parts of the user's original query at the same time, rather than as a single stream of consecutive tokens. The plural processor instances have access to a shared cache memory, and utilize relevant data that has been computed in response to previous queries.

18385408. Reducing Latency by Processing Parts of a Language Model Query in Parallel (Microsoft Technology Licensing, LLC)

Reducing Latency by Processing Parts of a Language Model Query in Parallel

Organization Name

Inventor(s)

Reducing Latency by Processing Parts of a Language Model Query in Parallel

Original Abstract Submitted

(Ad) Transform your business with AI in minutes, not months

Transform your business with AI in minutes, not months