Patent Application 18297288 - Mitigating Speech Collision by Predicting

Title: Mitigating Speech Collision by Predicting Speaking Intent for Participants

Application Information

Invention Title: Mitigating Speech Collision by Predicting Speaking Intent for Participants
Application Number: 18297288
Submission Date: 2025-05-21T00:00:00.000Z
Effective Filing Date: 2023-04-07T00:00:00.000Z
Filing Date: 2023-04-07T00:00:00.000Z
National Class: 704
National Sub-Class: 231000
Examiner Employee Number: 90155
Art Unit: 2655
Tech Center: 2600

Rejection Summary

102 Rejections: 2
103 Rejections: 0

Cited Patents

The following patents were cited in the rejection:

Office Action Text

DETAILED ACTION

Notice of Pre-AIA or AIA Status
1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
2. Claim 13 is objected to because of the following informalities: grammatical error. In line 2, Claim 13 is directed to “makinga”. This should be amended as “making a”. Appropriate correction is required.

Claim Rejections - 35 USC § 101
3. 35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

4. Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more.
Claim 1 recites
“1. A computer-implemented method, comprising:
obtaining, by a participant computing device comprising one or more processor devices, sensor data from one or more sensors of the participant computing device, wherein the participant computing device and one or more other participant computing devices are connected to a teleconference orchestrated by a teleconference computing system;
based at least in part on the sensor data, determining, by the participant computing device, that a participant associated with the participant computing device intends to speak to other participants of the teleconference; and
providing, by the participant computing device, information indicating that the participant intends to speak to one or more of:
the teleconference computing system; or
at least one of the one or more other participant computing devices.”
The limitations recited in Claim 1 as drafts covers a mental process More specifically, the underlying abstract idea revolved around what happen once a human sends a message to a host of a teleconference and let the host know that one participant wants to talk. The human could listen to the participant and/or see the participant’ mouse moving, the human determines that the participant wants to talk, then the human could send a message to a host of a teleconference and let the host know that one participant wants to talk.
Claim 10 recites
“10. A participant computing device, comprising:
one or more processors;
one or more sensors;
one or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the participant computing device to perform operations, the operations comprising:
connecting to a teleconference orchestrated by a teleconference computing system, wherein the participant computing device is associated with a participant of the teleconference;
receiving information indicating that a second participant of the teleconference intends to speak, wherein the second participant is associated with a second participant computing device that is connected to the teleconference, wherein the information indicating that the participant of the teleconference intends to speak is determined based at least in part on sensor data captured at second participant computing device; and
responsive to the information indicating that the second participant intends to speak, performing one or more actions to indicate, to the participant associated with the participant computing device, that some other participant of the teleconference intends to speak.”
The limitations recited in Claim 10 as drafts covers a mental process More specifically, the underlying abstract idea revolved around what happen once a human sends a message to a host of a teleconference and let the host know that one remote participant wants to talk. The human could listen to the remote participant and/or see the remote participant’ mouse moving, the human determines that the remote participant wants to talk, then the human could send a message to a host of a teleconference and let the host know that one remote participant wants to talk.
Claim 17 recites
“17. One or more non-transitory computer-readable media that store instructions that, when executed by one or more processors of a teleconference computing system, cause the teleconference computing system to perform operations, the operations comprising:
receiving, from a participant computing device, speaking intent information from a participant computing device of a plurality of participant computing devices connected to a teleconference orchestrated by the teleconference computing system, wherein the speaking intent information indicates that a participant associated with the participant computing device intends to speak;
making an evaluation of one or more indication criteria based on the speaking intent information; and
based on the evaluation, instructing a second participant computing device of the plurality of participant computing devices connected to the teleconference to perform one or more actions to indicate, to a second participant associated with the second participant computing device, that some other participant of the teleconference intends to speak.”
The limitations recited in Claim 17 as drafts covers a mental process More specifically, the underlying abstract idea revolved around what happen once a human sends a message to a host of a teleconference and let the host know that one remote participant wants to talk. The human could listen to the remote participant and/or see the remote participant’ mouse moving, the human determines that the remote participant wants to talk, then the human could send a message to a host of a teleconference and let the host know that one remote participant wants to talk.
The judicial exception is not integrated into a practical application. In particular, claims recite the additional limitations of a computing device, a teleconference computing system, a processor, a sensor, a non-transitory computer-readable media. The additional element(s) or combination of elements such as a computing device, a teleconference computing system, a processor, a sensor, a non-transitory computer-readable media in the claim(s) other than the abstract idea per se amount(s) to no more than (i) mere instructions to implement the idea on a computer, and/or (ii) recitation of generic computer structure that serves to perform generic computer functions that are well-understood, routine, and conventional activities previously known to the pertinent industry. Viewed as a whole, these additional claim element(s) do not provide meaningful limitation(s) to transform the abstract idea into a patent eligible application of the abstract idea such that the claim(s) amounts to significantly more than the abstract idea itself. Therefore, the claim(s) are rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter. There is further no improvement to the computing device other than let one participant in the conference knows that other participant wants to talk. The mere recitation of a computing device, a teleconference computing system, a processor, a sensor, a non-transitory computer-readable media and/or the like is akin of adding the word “apply it” and/or “use it” with a computer in conjunction with the abstract idea. The paragraph [0004] discloses “[0004] One example aspect of the present disclosure is directed to a participant computing device. The participant computing device includes one or more processors, one or more sensors, and one or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the participant computing device to perform operations. The operations include obtaining sensor data from the one or more sensors of the participant computing device, wherein the participant computing device and one or more other participant computing devices are connected to a teleconference orchestrated by a teleconference computing system. The operations include, based at least in part on the sensor data, determining that a participant associated with the participant computing device intends to speak to other participants of the teleconference. The operations include providing information indicating that the participant intends to speak to one or more of the teleconference computing system or at least one of the one or more other participant computing devices.”.
As filed in the specification, the computer is listed as a general-purpose computer and are mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea.
The dependent claims further do not remedy the issues noted above. More specifically, Claim 2 recites a list of elements. As stated previously, the mere recitation of a computing device, a teleconference computing system, a processor, a sensor, a non-transitory computer-readable media and/or the like is akin of adding the word “apply it” and/or “use it” with a computer in conjunction with the abstract idea. There is no additional limitation presented. Claim 3 recites processing the sensor data to determine whether the participant intents to speak. This reads on the human could look at the camera or listen the participant via a loudspeaker to determine whether the participant intents to speak. The machine learning model is generic in nature and merely stands in for human mind in an otherwise mental process. There is no additional limitation presented. Claim 4 recites using the user’s gesture to determine whether the participant intents to speak. This reads on the human could look at the camera and see that the participant’s mouth is moving, the human determines that the participant intents to speak. There is no additional limitation presented. Claim 5 recites determining whether the participant intents to speak based on obtaining that the conversation has ended and the participant’s gesture. This reads on the human obtains that the conversation has ended and the participant’s mouth is moving, so the human concludes that the participant intents to speak. There is no additional limitation presented. Claim 6 recites receiving the information of the remote participant and determining whether the participant intents to speak. This reads on the human could see the remote participant (i.e., via a camera) or listen to the remote participant (i.e., loudspeaker) and determines that the remote participant intents to speak. There is no additional limitation presented. Claims 7-9 recites how to alert that the participant intents to speak. Sending a haptic or give somebody a call to alert something is a mental process. There is no additional limitation presented. Claim 11-13 recites the similar features as Claims 7-9 respectively. Claim 14 recites the similar as Claim 1. Claim 15 recites the similar features as Claim 2. Claim 16 recites the similar features as Claim 4. Claim 18 recites indication criteria in determining whether the participant intents to speak. Using actions in the past or based on the degree of certainty is/are a mental process. There is no additional limitation presented. Claim 19 recites the degree of speaking priority. Comparing the degree of speaking priority is a mental process. There is no additional limitation presented. Claim 20 recites comparing priority metric between the participant devices. Comparing priority metric between the participant devices is a mental process. There is no additional limitation presented.
For at least the supra provided reasons, claims 1-20 are rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter.

Claim Rejections - 35 USC § 112
5. The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
6. Claims 10-16 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the
subject matter which the inventor or a joint inventor, or for pre-AIA the applicant regards as the invention.
Claim 10 recites “the participant” in the limitation of “receiving information indicating that a second participant of the teleconference intends to speak, wherein the second participant is associated with a second participant computing device that is connected to the teleconference, wherein the information indicating that the participant of the teleconference intends to speak is determined based at least in part on sensor data captured at second participant computing device; and”. Examiner notes that “the participant” in this limitation has to be “the second participant” because the clause wherein as recited “wherein the information indicating that the participant of the teleconference intends to speak is determined based at least in part on sensor data captured at second participant computing device;” is further limitation of “receiving information indicating that a second participant of the teleconference intends to speak, wherein the second participant is associated with a second participant computing device that is connected to the teleconference,” The previous limitation receiving information indicating that a second participant intents to speak, not a participant.
Claims 11-16 depends directly or indirectly on claim 10. Thus, Claims 11-16 are rejected as the same ground as Claim 10.

Claim Rejections - 35 USC § 102
7. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

8. Claims 1-2 are rejected under 35 U.S.C. 102(a) (1) as being anticipated by Desai et al. (US 2022/0398064 A1, hereinafter Desai et al. “8064”.)

With respect to Claim 1, Desai et al. “8064”disclose
A computer-implemented method, comprising:
obtaining, by a participant computing device comprising one or more processor devices, sensor data from one or more sensors of the participant computing device, wherein the participant computing device and one or more other participant computing devices are connected to a teleconference orchestrated by a teleconference computing system (Desai et al. “8064” Fig. 2, [0032] communication device 100 can be used as a first, local electronic device for a local participant in a communication session, [0010] The electronic device includes an image capturing device, at least one user interface device, including at least one microphone, and a controller. The controller is communicatively coupled to the image capturing device and to the at least one user interface device. During the communication session with one or more second electronic devices, the controller monitors a status of the at least one microphone during the communication session. While the microphone is muted, the controller monitors an image stream received from the image capturing device for movements by the local participant in the communication session. The controller autonomously generates a prompt to unmute the microphone in response to determining that the microphone is muted while identifying at least one of a speaking movement of a mouth of the local participant or a gesture by the local participant associated with unmuting the microphone);
based at least in part on the sensor data, determining, by the participant computing device, that a participant associated with the participant computing device intends to speak to other participants of the teleconference (Desai et al. “8064” [0011] According to aspects of the present disclosure, an electronic device, locally managed by a controller, automatically intervenes on behalf of a local participant in a communication session that is using the electronic device. The communication session can be audio only or can include video. In particular, the electronic device can determine, based on visually monitoring movements of the local participant during the communication session, that the local participant is attempting to speak to other remote participants who are using respective second communication devices, [0013] the electronic device improves a video communication session by automatically detecting situations in which the local participant(s) present certain gestures or movements of the mouth that visually indicate an attempt by that participant to speak. See paragraph [0019].); and
providing, by the participant computing device, information indicating that the participant intends to speak (Desai et al. “8064” [0045] Automated alert of a detected attempt by non-presenting participant 201 to speak can be provided to other participants via chat box 325 as chat entry 327) to one or more of:
the teleconference computing system; or
at least one of the one or more other participant computing devices (Desai et al. “8064” [0014] Having a visually triggered indication, a participant can more intuitively speak or gesture to trigger an automatic alert to the presenting participant. One or more types of alerts can be triggered, [0050] method 400 further includes transmitting a raised hand indication to at least one second electronic device to alert the associated second participant that the local participant is desirous of speaking or has started speaking (block 427).)

With respect to Claim 2, Desai et al. “8064” disclose
wherein the one or more sensors of the participant computing device comprise at least one of:
a camera;
a microphone (Desai et al. “8064” Fig. 1 element 104 Microphone)
a button;
a touch surface;
an Inertial Measurement Unit (IMU);
a gyroscope; or
an accelerometer.

9. Claims 10-13, 17 are rejected under 35 U.S.C. 102(a) (1) as being anticipated by Desai et al. (US 2022/0400022 A1, hereinafter Desai et al. “0022”).

With respect to Claim 10, Desai et al. “0022” disclose
A participant computing device, comprising:
one or more processors (Desai et al. “0022” [0023] processor, [0020] It is appreciated that the second electronic device can be similarly configured and/or provide similar functionality as communication device 100);
one or more sensors (Desai et al. “0022” Fig. 1 Image capturing device(s) 102, Microphone 104);
one or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the participant computing device to perform operations (Desai et al. “0022” paragraphs [0029-0030], the operations comprising:
connecting to a teleconference orchestrated by a teleconference computing system, wherein the participant computing device is associated with a participant of the teleconference (Desai et al. “0022”Fig. 2A);
receiving information indicating that a second participant of the teleconference intends to speak, wherein the second participant is associated with a second participant computing device that is connected to the teleconference (Desai et al. “0022” [0011] A controller configures the electronic device to receive and monitor one or more image streams originating respectively at the one or more second electronic devices during a video communication session. The controller identifies, in a particular image stream of the one or more image streams, at least one of a speaking movement of a mouth of a particular participant or a gesture by the particular participant to present an audio input via the video communication session, [0043] the electronic device that receives and monitors an image stream from communication device 100a can be network server 210 (FIG. 2A), communication device 100b, communication device 100c, or another electronic device that autonomously operates in support of the communication session), wherein the information indicating that the participant of the teleconference intends to speak is determined based at least in part on sensor data captured at second participant computing device (Desai et al. “0022” [0020] controller 101 monitors, during a communication session with one or more second communication devices 107a-107b, an image stream received from one or more second communication devices 107a107b for specific movements and/or gestures by a remote participant in the communication session using image recognition engine 109. See paragraph [0036]); and
responsive to the information indicating that the second participant intends to speak, performing one or more actions to indicate, to the participant associated with the participant computing device, that some other participant of the teleconference intends to speak (Desai et al. “0022” [0020] In response to controller 101 identifying at least one of a speaking movement of a mouth of the local participant or a gesture by the local participant that is associated with the remote participant attempting to speak, while the remote participant's device is muted, controller 101 autonomously generates and presents an alert on user interface 108 that the remote participant is attempting to speak.)

With respect to Claim 11, Desai et al. “0022” disclose
wherein performing the one or more actions comprises causing playback of audio with an audio output device associated with the participant computing device, wherein the audio indicates to the participant that some other participant intends to speak (Desai et al. “0022” [0049] In one or more embodiments, method 400 includes automatically triggering, on the user interface, a hand raised status for the particular participant as the alert that the participant is attempting/desiring to speak (block 424). In one or more embodiments, method 400 includes presenting one or more of an audible output, a visual output, and a haptic output via the at least one user interface device as a notification to the presenting participant (block 426).)

With respect to Claim 12, Desai et al. “0022” disclose
wherein performing the one or more actions comprises generating a haptic feedback signal for one or more haptic feedback devices associated with the participant computing device, wherein the haptic feedback signal indicates that some other participant intends to speak (Desai et al. “0022” [0049] In one or more embodiments, method 400 includes automatically triggering, on the user interface, a hand raised status for the particular participant as the alert that the participant is attempting/desiring to speak (block 424). In one or more embodiments, method 400 includes presenting one or more of an audible output, a visual output, and a haptic output via the at least one user interface device as a notification to the presenting participant (block 426).)

With respect to Claim 13, Desai et al. “0022” disclose
wherein performing the one or more actions comprises makinga modification to an interface of an application that facilitates participation in the teleconference, wherein the interface of the application is displayed within a display device associated with the participant computing device, and wherein the modification indicates that some other participant intends to speak (Desai et al. “0022” [0049] In one or more embodiments, method 400 includes automatically triggering, on the user interface, a hand raised status for the particular participant as the alert that the participant is attempting/desiring to speak (block 424). In one or more embodiments, method 400 includes presenting one or more of an audible output, a visual output, and a haptic output via the at least one user interface device as a notification to the presenting participant (block 426).)

With respect to Claim 17, Desai et al. “0022” disclose
One or more non-transitory computer-readable media that store instructions that, when executed by one or more processors of a teleconference computing system, cause the teleconference computing system to perform operations (Desai et al. “0022” paragraph [0029], Fig. 2A), the operations comprising:
receiving, from a participant computing device, speaking intent information from a participant computing device of a plurality of participant computing devices connected to a teleconference orchestrated by the teleconference computing system, wherein the speaking intent information indicates that a participant associated with the participant computing device intends to speak (Desai et al. “0022” [0011] A controller configures the electronic device to receive and monitor one or more image streams originating respectively at the one or more second electronic devices during a video communication session. The controller identifies, in a particular image stream of the one or more image streams, at least one of a speaking movement of a mouth of a particular participant or a gesture by the particular participant to present an audio input via the video communication session. See paragraph [0043]);
making an evaluation of one or more indication criteria based on the speaking intent information (Desai et al. “0022” [0036] FIG. 2B depicts a diagram of example pre-defined movement data 122 stored in device memory 106 of presenter communication device 100b (FIG. 2A). In an example, pre-defined movement data 122 can include mouth movement data 251. Mouth movement data 251 can include first and second mouth closed recognition images 253a-253b that depicts a closed mouth that can be compared to an image live stream. Pre-defined movement data 122 can include first and second mouth open recognition images 255a-255b that depicts an opened mouth that can be compared to the image live stream. Mouth movement sequence 257 defines a minimum number of opening and closing of the mouth during a defined time period that indicates speaking. In one or more embodiments, audio confirmation pattern 259, if available, can be matched to the mouth movement sequence 257 to confirm an attempt to speak by non-presenting participant 201 (FIG. 2A). See paragraph [0040]); and
based on the evaluation, instructing a second participant computing device of the plurality of participant computing devices connected to the teleconference to perform one or more actions to indicate, to a second participant associated with the second participant computing device, that some other participant of the teleconference intends to speak (Desai et al. “0022” [0011] The controller identifies, in a particular image stream of the one or more image streams, at least one of a speaking movement of a mouth of a particular participant or a gesture by the particular participant to present an audio input via the video communication session. A user interface presented on at least one user interface device of the electronic device can present an alert that the particular participant is attempting to speak to other participants in the video communication session.)

Claim Rejections - 35 USC § 103
10. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

11. Claims 3-4 are rejected under 35 U.S.C.103 as being unpatentable over Desai et al. (US 2022/0398064 A1) in view of Lenke et al. (US 2020/0110572 A.)

With respect to Claim 3, Desai et al. disclose all the limitations of Claim 2 upon which Claim 3 depends. Desai et al. “8064” fail to explicitly teach
wherein determining that the participant associated with the participant computing device intends to speak comprises:
processing, by the participant computing device, the sensor data with a machine-learned speaking intent model to obtain a speaking intent output indicating that the participant associated with the participant computing device intends to speak.
However, Lenke et al. teach
wherein determining that the participant associated with the participant computing device intends to speak comprises:
processing, by the participant computing device, the sensor data with a machine-learned speaking intent model to obtain a speaking intent output indicating that the participant associated with the participant computing device intends to speak (Lenke et al. [0035] The system can determine a direction or an orientation of the user's face (whether it is towards a computer screen or facing a door behind them, or to the side), which can be data included in the algorithm or model to determine whether the speech is intended for the conference. The speech may be at a certain volume, or particular words might be used that relate to the conference or not, and thus the content of the speech may be used to determine user intent to be part of the conference. The user intent can be inferred by the system based on the audio/visual/other data received and evaluated. The system can use machine learning to recognize that when a user is staring out the window and talking, they often do that as part of the conference session 250. Based on such a determination, and by the system distinguishing between talking to the conference and background noise or side speech, the component 220 can automatically unmute 224 the device 202, such that the speech provided by the user 208 will be heard by other users in the communication session 250, or mute the device 202. See paragraph [0034 and 0037].)
Desai et al. “8064” and Lenke et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of determining whether a local participant in a teleconference want to speech as taught by Desai et al. “8064”, using teaching of the learning machine as taught by Lenke et al. for the benefit of determining whether the participant in the teleconference intends to speech to other participant (Lenke et al. [0035] The system can determine a direction or an orientation of the user's face (whether it is towards a computer screen or facing a door behind them, or to the side), which can be data included in the algorithm or model to determine whether the speech is intended for the conference. The speech may be at a certain volume, or particular words might be used that relate to the conference or not, and thus the content of the speech may be used to determine user intent to be part of the conference. The user intent can be inferred by the system based on the audio/visual/other data received and evaluated. The system can use machine learning to recognize that when a user is staring out the window and talking, they often do that as part of the conference session 250. Based on such a determination, and by the system distinguishing between talking to the conference and background noise or side speech, the component 220 can automatically unmute 224 the device 202, such that the speech provided by the user 208 will be heard by other users in the communication session 250, or mute the device 202.)

With respect to Claim 4, Desai et al. “8064” in view of Lenke et al. teach
wherein processing the sensor data with the machine-learned speaking intent model comprises processing, by the participant computing device, the sensor data with the machine-learned speaking intent model to obtain a speaking intent output indicating performance of a pre-configured speaking intent gesture by the participant (Lenke et al. [0035] The system can determine a direction or an orientation of the user's face (whether it is towards a computer screen or facing a door behind them, or to the side), which can be data included in the algorithm or model to determine whether the speech is intended for the conference. The speech may be at a certain volume, or particular words might be used that relate to the conference or not, and thus the content of the speech may be used to determine user intent to be part of the conference. The user intent can be inferred by the system based on the audio/visual/other data received and evaluated. The system can use machine learning to recognize that when a user is staring out the window and talking, they often do that as part of the conference session 250. Based on such a determination, and by the system distinguishing between talking to the conference and background noise or side speech, the component 220 can automatically unmute 224 the device 202, such that the speech provided by the user 208 will be heard by other users in the communication session 250, or mute the device 202. See paragraph [0036].)

12. Claims 6-9 are rejected under 35 U.S.C.103 as being unpatentable over Desai et al. (US 2022/0398064 A1) in view of Desai et al. (US 2022/0400022 A1, hereinafter Desai et al. “0022”.)

With respect to Claim 6, Desai et al. “8064” fail to explicitly teach
wherein the method further comprises:
receiving, by the participant computing device, information indicating that a second participant associated with one of the one or more other participant computing devices intends to speak; and
responsive to the information indicating that the second participant intends to speak, performing, by the participant computing device, one or more actions to indicate, to the participant associated with the participant computing device, that some other participant of the teleconference intends to speak.
However, Desai et al. “0022” teach
wherein the method further comprises:
receiving, by the participant computing device, information indicating that a second participant associated with one of the one or more other participant computing devices intends to speak (Desai et al. “0022” [0011] The electronic device includes a network interface device that enables the electronic device to communicatively communicate in a video communication session with at least one second electronic device. A controller configures the electronic device to receive and monitor one or more image streams originating respectively at the one or more second electronic devices during a video communication session. The controller identifies, in a particular image stream of the one or more image streams, at least one of a speaking movement of a mouth of a particular participant or a gesture by the particular participant to present an audio input via the video communication session, [0020] controller 101 monitors, during a communication session with one or more second communication devices 107a-107b, an image stream received from one or more second communication devices 107a/107b for specific movements and/or gestures by a remote participant in the communication session using image recognition engine 109, Claim 1); and
responsive to the information indicating that the second participant intends to speak, performing, by the participant computing device, one or more actions to indicate, to the participant associated with the participant computing device, that some other participant of the teleconference intends to speak (Desai et al. “0022” [0011] A user interface presented on at least one user interface device of the electronic device can present an alert that the particular participant is attempting to speak to other participants in the video communication session, Claim 1.)
Desai et al. “8064” and Desai et al. “0022” are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of determining whether a local participant in a teleconference desire to speech as taught by Desai et al. “8064”, using teaching of monitoring hand and mouth of a remote participant in the teleconference as taught by Desai et al. “0022” for the benefit of determining whether a remote participant in the teleconference desire to speech (Desai et al. “0022” [0011] The electronic device includes a network interface device that enables the electronic device to communicatively communicate in a video communication session with at least one second electronic device. A controller configures the electronic device to receive and monitor one or more image streams originating respectively at the one or more second electronic devices during a video communication session. The controller identifies, in a particular image stream of the one or more image streams, at least one of a speaking movement of a mouth of a particular participant or a gesture by the particular participant to present an audio input via the video communication session. A user interface presented on at least one user interface device of the electronic device can present an alert that the particular participant is attempting to speak to other participants in the video communication session.)

With respect to Claim 7, Desai et al. “8064” in view of Desai et al. “0022” teach
wherein performing the one or more actions comprises causing, by the participant computing device, playback of audio with an audio output device associated with the participant computing device, wherein the audio indicates to the participant that some other participant intends to speak (Desai et al. “0022” [0049] In one or more embodiments, method 400 includes presenting one or more of an audible output, a visual output, and a haptic output via the at least one user interface device as a notification to the presenting participant (block 426).)

With respect to Claim 8, Desai et al. “8064” in view of Desai et al. “0022” teach
wherein performing the one or more actions comprises generating, by the participant computing device, a haptic feedback signal for one or more haptic feedback devices associated with the participant computing device, wherein the haptic feedback signal indicates that some other participant intends to speak (Desai et al. “0022” [0049] In one or more embodiments, method 400 includes presenting one or more of an audible output, a visual output, and a haptic output via the at least one user interface device as a notification to the presenting participant (block 426).)

With respect to Claim 9, Desai et al. “8064” in view of Desai et al. “0022” teach
wherein performing the one or more actions comprises making, by the participant computing device, a modification to an interface of an application that facilitates participation in the teleconference, wherein the interface of the application is displayed within a display device associated with the participant computing device, and wherein the modification indicates that some other participant intends to speak (Desai et al. “0022” [0011] The electronic device includes a network interface device that enables the electronic device to communicatively communicate in a video communication session with at least one second electronic device. A controller configures the electronic device to receive and monitor one or more image streams originating respectively at the one or more second electronic devices during a video communication session. The controller identifies, in a particular image stream of the one or more image streams, at least one of a speaking movement of a mouth of a particular participant or a gesture by the particular participant to present an audio input via the video communication session. A user interface presented on at least one user interface device of the electronic device can present an alert that the particular participant is attempting to speak to other participants in the video communication session.)

13. Claims 14-15 are rejected under 35 U.S.C.103 as being unpatentable over Desai et al. (US 2022/0400022 A1, hereinafter Desai et al. “0022”) in view of Desai et al. (US 2022/0398064 A1.)

With respect to Claim 14, Desai et al. “0022” fail to explicitly teach
wherein the operations further comprise:
obtaining sensor data from the one or more sensors of the participant computing device;
based at least in part on the sensor data, determining that the participant associated with the participant computing device intends to speak to the other participants of the teleconference; and
providing information indicating that the participant intends to speak to one or more of:
the teleconference computing system; or
at least one of the one or more other participant computing devices.
However, Desai et al. “8064” teach
wherein the operations further comprise:
obtaining sensor data from the one or more sensors of the participant computing device (Desai et al. “8064” Fig. 2, [0032] communication device 100 can be used as a first, local electronic device for a local participant in a communication session, [0010] The electronic device includes an image capturing device, at least one user interface device, including at least one microphone, and a controller. The controller is communicatively coupled to the image capturing device and to the at least one user interface device. During the communication session with one or more second electronic devices, the controller monitors a status of the at least one microphone during the communication session. While the microphone is muted, the controller monitors an image stream received from the image capturing device for movements by the local participant in the communication session. The controller autonomously generates a prompt to unmute the microphone in response to determining that the microphone is muted while identifying at least one of a speaking movement of a mouth of the local participant or a gesture by the local participant associated with unmuting the microphone);
based at least in part on the sensor data, determining that the participant associated with the participant computing device intends to speak to the other participants of the teleconference (Desai et al. “8064” [0011] According to aspects of the present disclosure, an electronic device, locally managed by a controller, automatically intervenes on behalf of a local participant in a communication session that is using the electronic device. The communication session can be audio only or can include video. In particular, the electronic device can determine, based on visually monitoring movements of the local participant during the communication session, that the local participant is attempting to speak to other remote participants who are using respective second communication devices, [0013] the electronic device improves a video communication session by automatically detecting situations in which the local participant(s) present certain gestures or movements of the mouth that visually indicate an attempt by that participant to speak. See paragraph [0019]); and
providing information indicating that the participant intends to speak (Desai et al. “8064” [0045] Automated alert of a detected attempt by non-presenting participant 201 to speak can be provided to other participants via chat box 325 as chat entry 327) to one or more of:
the teleconference computing system; or
at least one of the one or more other participant computing devices (Desai et al. “8064” [0014] Having a visually triggered indication, a participant can more intuitively speak or gesture to trigger an automatic alert to the presenting participant. One or more types of alerts can be triggered, [0050] method 400 further includes transmitting a raised hand indication to at least one second electronic device to alert the associated second participant that the local participant is desirous of speaking or has started speaking (block 427).)
Desai et al. “0022” and Desai et al. “8064” are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of determining whether a remote participant in a teleconference desire to speech as taught by Desai et al. “0022”, using teaching of monitoring hand and mouth of a local participant in the teleconference as taught by Desai et al. “8064” for the benefit of determining whether a local participant in the teleconference desire to speech (Desai et al. “8064” [0014] Having a visually triggered indication, a participant can more intuitively speak or gesture to trigger an automatic alert to the presenting participant. One or more types of alerts can be triggered, [0050] method 400 further includes transmitting a raised hand indication to at least one second electronic device to alert the associated second participant that the local participant is desirous of speaking or has started speaking (block 427).)

With respect to Claim 15, Desai et al. “0022” in view of Desai et al. “8064” teach
wherein the one or more sensors of the participant computing device comprise at least one of:
a camera (Desai et al. “8064” Fig. 1 Image capturing devices(s))
a microphone (Desai et al. “8064” Microphone);
a button;
a touch surface;
an Inertial Measurement Unit (IMU);
a gyroscope; or
an accelerometer.

14. Claim 16 is rejected under 35 U.S.C.103 as being unpatentable over Desai et al. (US 2022/0400022 A1, hereinafter Desai et al. “0022”) in view of Desai et al. (US 2022/0398064 A1, hereinafter “8064”) and Lenke et al. (US 2020/0110572 A1.)

With respect to Claim 16, Desai et al. “0022” in view of Desai et al. “8064” teach all the limitations of Claim 15 upon which Claim 16 depends. Desai et al. “0022” in view of Desai et al. “8064” fail to explicitly teach
wherein determining that the participant associated with the participant computing device intends to speak comprises processing the sensor data with the machine-learned speaking intent model to obtain a speaking intent output indicating performance of a pre-configured speaking intent gesture by the participant.
However, Lenke et al. teach
wherein determining that the participant associated with the participant computing device intends to speak comprises processing the sensor data with the machine-learned speaking intent model to obtain a speaking intent output indicating performance of a pre-configured speaking intent gesture by the participant (Lenke et al. [0035] The system can determine a direction or an orientation of the user's face (whether it is towards a computer screen or facing a door behind them, or to the side), which can be data included in the algorithm or model to determine whether the speech is intended for the conference. The speech may be at a certain volume, or particular words might be used that relate to the conference or not, and thus the content of the speech may be used to determine user intent to be part of the conference. The user intent can be inferred by the system based on the audio/visual/other data received and evaluated. The system can use machine learning to recognize that when a user is staring out the window and talking, they often do that as part of the conference session 250. Based on such a determination, and by the system distinguishing between talking to the conference and background noise or side speech, the component 220 can automatically unmute 224 the device 202, such that the speech provided by the user 208 will be heard by other users in the communication session 250, or mute the device 202. See paragraph [0034, 0036 and 0037.)
Desai et al. “0022”, Desai et al. “8064” and Lenke et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of determining whether a remote participant in a teleconference desire to speech as taught by Desai et al. “0022”, using teaching of monitoring hand and mouth of a local participant in the teleconference as taught by Desai et al. “8064” for the benefit of determining whether a local participant in the teleconference desire to speech, using teaching of the learning machine as taught by Lenke et al. for the benefit of determining whether the participant in the teleconference intends to speech to other participant (Lenke et al. [0035] The system can determine a direction or an orientation of the user's face (whether it is towards a computer screen or facing a door behind them, or to the side), which can be data included in the algorithm or model to determine whether the speech is intended for the conference. The speech may be at a certain volume, or particular words might be used that relate to the conference or not, and thus the content of the speech may be used to determine user intent to be part of the conference. The user intent can be inferred by the system based on the audio/visual/other data received and evaluated. The system can use machine learning to recognize that when a user is staring out the window and talking, they often do that as part of the conference session 250. Based on such a determination, and by the system distinguishing between talking to the conference and background noise or side speech, the component 220 can automatically unmute 224 the device 202, such that the speech provided by the user 208 will be heard by other users in the communication session 250, or mute the device 202.)

15. Claims 18-19 are rejected under 35 U.S.C.103 as being unpatentable over Desai et al. (US 2022/0400022 A1, hereinafter Desai et al. “0022”) in view of Deng et al. (US 2023/0208898 A1.)

With respect to Claim 18, Desai et al. “0022” teach all the limitations of Claim 17 upon which Claim 18 depends. Desai et al. “0022” fail to explicitly teach
wherein the one or more indication criteria comprise at least one of:
a number of times that a speaking intent has been previously indicated for the participant associated with the participant computing device;
a degree of certainty associated with the speaking intent information; or a connection quality associated with a connection of the participant computing device to the teleconference; or
a number of other participant computing devices of the plurality of participant computing devices that have also provided speaking intent information to the teleconference computing system.
However, Deng et al. teach
wherein the one or more indication criteria comprise at least one of:
a number of times that a speaking intent has been previously indicated for the participant associated with the participant computing device;
a degree of certainty associated with the speaking intent information (Deng et al. [0005] The user interface format can also include an ordered queue based on a time stamp and a system priority recommendation based on respective a participant's familiarity to the topic, historical or potential impact on the effectiveness of a meeting having a particular topic, or individual participation score. The time stamp can be based on the timing of any type of input indicating an intent to speak, e.g., a user input on an input device or a gesture captured by video camera or an audio device. The input can include, but is not limited to, video data showing a person raising a hand raise, video data showing a movement indicating a person's intent to speak, audio data containing spoken words or a vocal request to speak, etc.); or
a connection quality associated with a connection of the participant computing device to the teleconference; or
a number of other participant computing devices of the plurality of participant computing devices that have also provided speaking intent information to the teleconference computing system.
Desai et al. “0022” and Deng et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of determining whether a remote participant in a teleconference desire to speech as taught by Desai et al. “0022”, using teaching of the confidence level as taught by Deng et al. for the benefit of determining whether the participant intents to speak (Deng et al. [0005] The user interface format can also include an ordered queue based on a time stamp and a system priority recommendation based on respective a participant's familiarity to the topic, historical or potential impact on the effectiveness of a meeting having a particular topic, or individual participation score. The time stamp can be based on the timing of any type of input indicating an intent to speak, e.g., a user input on an input device or a gesture captured by video camera or an audio device. The input can include, but is not limited to, video data showing a person raising a hand raise, video data showing a movement indicating a person's intent to speak, audio data containing spoken words or a vocal request to speak, etc.)

With respect to Claim 19, Desai et al. “0022” teach all the limitations of Claim 17 upon which Claim 18 depends. Desai et al. “0022” fail to explicitly teach
wherein receiving the speaking intent information from the participant computing device further comprises receiving additional speaking intent information from a third participant computing device of the plurality of participant computing devices, wherein the speaking intent information indicates that a third participant associated with the third participant computing device intends to speak; and
wherein the one or more indication criteria comprises a priority criteria indicative of a degree of speaking priority for the participant computing device and the third participant computing device.
However, Deng et al. teach
wherein receiving the speaking intent information from the participant computing device further comprises receiving additional speaking intent information from a third participant computing device of the plurality of participant computing devices, wherein the speaking intent information indicates that a third participant associated with the third participant computing device intends to speak (Deng et al. [0005] In some configurations a system can generate a user interface format that includes a time ordered queue with identifications of system recommendations of a priority for each recommended speaker. The user interface format can also include an ordered queue based on a time stamp and a system priority recommendation based on respective a participant's familiarity to the topic, historical or potential impact on the effectiveness of a meeting having a particular topic, or individual participation score. The time stamp can be based on the timing of any type of input indicating an intent to speak, e.g., a user input on an input device or a gesture captured by video camera or an audio device. The input can include, but is not limited to, video data showing a person raising a hand raise, video data showing a movement indicating a person's intent to speak, audio data containing spoken words or a vocal request to speak, etc.); and
wherein the one or more indication criteria comprises a priority criteria indicative of a degree of speaking priority for the participant computing device and the third participant computing device (Deng et al. [0086] The top recommendation 203D can include the top system recommended speaker. It can be one of the three candidates in the time ordered queue, or a candidate who raised his/her hand later than the first three candidates, or someone who didn't raise his/her hand. The top recommendation speaker is chosen from the waiting queue with the highest recommendation score. The icon for the top recommendation can also be labeled in a similar fashion explained before. With a up arrow from bottom to top of the icon to illustrate recommendation strength is the highest.)
Desai et al. “0022” and Deng et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of determining whether a remote participant in a teleconference desire to speech as taught by Desai et al. “0022”, using teaching of the confidence level as taught by Deng et al. for the benefit of determining the priority for each recommended speaker (Deng et al. [0005] In some configurations a system can generate a user interface format that includes a time ordered queue with identifications of system recommendations of a priority for each recommended speaker. The user interface format can also include an ordered queue based on a time stamp and a system priority recommendation based on respective a participant's familiarity to the topic, historical or potential impact on the effectiveness of a meeting having a particular topic, or individual participation score. The time stamp can be based on the timing of any type of input indicating an intent to speak, e.g., a user input on an input device or a gesture captured by video camera or an audio device. The input can include, but is not limited to, video data showing a person raising a hand raise, video data showing a movement indicating a person's intent to speak, audio data containing spoken words or a vocal request to speak, etc.)

With respect to Claim 20, Desai et al. “0022” in view of Deng et al. teach
wherein making the evaluation of one or more indication criteria based on the speaking intent information comprises:
determining a priority metric for the participant computing device based on the evaluation of the one or more indication criteria for the participant computing device (Deng et al. [0086] The top recommendation 203D can include the top system recommended speaker. It can be one of the three candidates in the time ordered queue, or a candidate who raised his/her hand later than the first three candidates, or someone who didn't raise his/her hand. The top recommendation speaker is chosen from the waiting queue with the highest recommendation score. The icon for the top recommendation can also be labeled in a similar fashion explained before. With a up arrow from bottom to top of the icon to illustrate recommendation strength is the highest. See paragraphs [0085 and 0092]);
determining a priority metric for the third participant computing device based on an evaluation of the one or more indication criteria for the third participant computing device (Deng et al. [0086] The top recommendation 203D can include the top system recommended speaker. It can be one of the three candidates in the time ordered queue, or a candidate who raised his/her hand later than the first three candidates, or someone who didn't raise his/her hand. The top recommendation speaker is chosen from the waiting queue with the highest recommendation score. The icon for the top recommendation can also be labeled in a similar fashion explained before. With a up arrow from bottom to top of the icon to illustrate recommendation strength is the highest. See paragraphs [0085 and 0092]); and
based on the priority metric for the participant computing device and the priority metric for the third participant computing device, selecting the participant computing device for indication of speaking intent (Deng et al. [0036] A dynamic meeting moderation system uses these determined values and recommendations to assist a meeting organizer, which can be an active speaker, in moderating the meeting. This enables a system to dynamically encourage remote participation. For example, even if a moderator can't see every aspect of remote user, the techniques disclosed herein can raise awareness of remote speaker's intention to speak, e.g., flash the intended speaker's image, increase his/her volume settings automatically to an appropriate level etc. The system can also send a notification to the meeting organizer to actively moderate the conversation flow to allow remote speaker to chime in. The system can also provide dynamic interactive enhancement. For example, the system can present an ordered list of names to assist the meeting organizer to find the right online speaker. In large online meetings, when a question is raised to the audience, the system can display an ordered list of names, e.g., a queue, to the people who asked the question. The person who raised the question can use the list to seek answers, [0086] The top recommendation 203D can include the top system recommended speaker. It can be one of the three candidates in the time ordered queue, or a candidate who raised his/her hand later than the first three candidates, or someone who didn't raise his/her hand. The top recommendation speaker is chosen from the waiting queue with the highest recommendation score. The icon for the top recommendation can also be labeled in a similar fashion explained before. With a up arrow from bottom to top of the icon to illustrate recommendation strength is the highest. See paragraphs [0085 and 0092].)

Allowable Subject Matter
16. Claim 5 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims, and if the said 101 Abstract idea rejection is overcome. Claim 5 stands rejected under 101 Abstract idea, and for the application to pass to allowance this rejection need to be overcome. Any amendments to overcome the 101 rejection that results in any change in scope require further search and/or consideration in order to determine it allowability.
The following is a statement of reasons for the indication of allowable subject matter: the prior art(s) fail(s) to teach the following element(s) in combination with the other recited elements in the claim(s).
“receiving, by the participant computing device, audio data comprising audio captured by at least one of the one or more other participant computing devices; and
processing, by the participant computing device, the audio data with a machine-learned speech recognition model to obtain a speech recognition output indicating whether a conversation between participants has ended; and
wherein determining that the participant associated with the participant computing device intends to speak comprises determining, by the participant computing device, that the participant associated with the participant computing device intends to speak based on the speech recognition output and the speaking intent output.” as recited in Claim 5.

Conclusion
17. The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. See PTO-892.
a. Jorasch et al. (US 2021/0399911 A1.) In this reference, Jorasch et al. disclose a method and a system for meeting management.
b. Sanaullah et al. (US 2015/0085064 A1.) In this reference, Sanaullah et al. disclose a method and a system for managing teleconference participant mute state.
c. Kanevsky et al. (US 2008/0167868 A1.) In this reference, Kanevsky et al. disclose a method and a system for controlling of microphones for speech recognition applications.

18. Any inquiry concerning this communication or earlier communications from the examiner should be directed to THUYKHANH LE whose telephone number is (571)272-6429. The examiner can normally be reached Mon-Fri: 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew C. Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/THUYKHANH LE/Primary Examiner, Art Unit 2655

Patent Application 18297288 - Mitigating Speech Collision by Predicting - Rejection

Patent Application 18297288 - Mitigating Speech Collision by Predicting

Application Information

Rejection Summary

Cited Patents

Office Action Text

Transform your business with AI in minutes, not months