Patent Application 18304064 - Joint Segmenting and Automatic Speech Recognition

Title: Joint Segmenting and Automatic Speech Recognition
Application Information

Invention Title: Joint Segmenting and Automatic Speech Recognition
Application Number: 18304064
Submission Date: 2025-05-22T00:00:00.000Z
Effective Filing Date: 2023-04-20T00:00:00.000Z
Filing Date: 2023-04-20T00:00:00.000Z
National Class: 704
National Sub-Class: 240000
Examiner Employee Number: 99867
Art Unit: 2659
Tech Center: 2600
Rejection Summary

102 Rejections: 1
103 Rejections: 4
Cited Patents

The following patents were cited in the rejection:
US 0335091🔗
US 0346374🔗
Office Action Text


    DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
	The information disclosure statement (IDS) submitted on November 28th, 2023 is being considered by the Examiner.

Drawings
1. The drawings are objected to as failing to comply with 37 CFR 1.84(p)(4) because:
Reference character “324” in Fig. 3 has been used to designate both the output of “Head Average” module 322 and the output of “Projection” layer 326. The Examiner notes that in paragraph 0040 of the Specification, the output of “Projection” layer 326 is referred to with reference character “328”.  
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.

2. Claims 1, 7-12, and 18-22 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 17, 19-23, 31, and 33-37 of copending Application No. 18/512,110 (reference application). Although the claims at issue are not identical, they are not patentably distinct from each other because the claims of the co-pending application are narrower in scope than that of the instant application.
This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.
Instant Application 18/304,064
Co-pending Application 18/512,110
Claim 1: 

A joint segmenting and automated speech recognition (ASR) model comprising: 

an encoder configured to: receive, as input, a sequence of acoustic frames characterizing one or more utterances; and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; and 


a decoder configured to: receive, as input, the higher order feature representation generated by the encoder at each of the plurality of output steps; and generate, at each of the plurality of output steps: a probability distribution over possible speech recognition hypotheses; and an indication of whether the corresponding output step corresponds to an end of speech segment, 






















wherein the joint segmenting and ASR model is trained on a set of training samples, each training sample in the set of training samples comprising: 

audio data characterizing a spoken utterance; and a corresponding transcription of the spoken utterance, the corresponding transcription having an end of speech segment ground truth token inserted into the corresponding transcription automatically based on a set of heuristic-based rules and exceptions applied to the training sample.

Claim 7:
The joint segmenting and ASR model of claim 1, wherein the end of speech segment ground truth token is inserted into the corresponding transcription automatically without any human annotation.

Claim 8:
The joint segmenting and ASR model of claim 1, wherein the set of heuristic- based rules and exceptions applied to each training sample in the set of training samples comprises:inserting the ground truth end of speech segment token at the end of the corresponding transcription; and inserting the ground truth end of speech segment token into the corresponding transcription at a location aligned with a non-speech segment of the audio data having a duration that satisfies a threshold duration unless:the non-speech segment of the audio data follows a word in the spoken utterance that is identified as a lengthened word; or the non-speech segment of the audio data follows a word in the spoken utterance that is identified as a filler word.

Claim 9:
The joint segmenting and ASR model of claim 8, wherein the word in the spoken utterance is identified as the lengthened word when a phoneme duration of the word satisfy a standard deviation threshold.

Claim 10: 
The joint segmenting and ASR model of claim 8, wherein, after training the joint segmenting and ASR model, the decoder is configured to emit the indication that the corresponding output step corresponds to the end of speech segment sooner than identifying a number of consecutive non-speech acoustic frames in the sequence of acoustic frames that satisfy the threshold duration.

Claim 11:
The joint segmenting and ASR model of claim 1, wherein the joint segmenting and ASR model is trained to maximize a probability of emitting the end of speech segment ground truth label.

Claim 12:
A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising:

receiving a sequence of acoustic frames characterizing one or more utterances; and at each of a plurality of output steps:





generating, by an encoder of a joint segmenting and automated speech recognition (ASR) model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; and 

generating, by a decoder of the joint segmenting and ASR model: a probability distribution over possible speech recognition hypotheses; and an indication of whether the corresponding output step corresponds to an end of speech segment, 






























wherein the joint segmenting and ASR model is trained on a set of training samples, each training sample in the set of training samples comprising: 


audio data characterizing a spoken utterance; and a corresponding transcription of the spoken utterance, the corresponding transcription having an end of speech segment ground truth token inserted into the corresponding transcription automatically based on a set of heuristic-based rules and exceptions applied to the training sample.

Claim 18:
The computer-implemented method of claim 12, wherein the end of speech segment ground truth token is inserted into the corresponding transcription automatically without any human annotation. 

Claim 19:
The computer-implemented method of claim 12, wherein the set of heuristic- based rules and exceptions applied to each training sample in the set of training samples comprises:inserting the ground truth end of speech segment token at the end of the corresponding transcription; andinserting the ground truth end of speech segment token into the corresponding transcription at a location aligned with a non-speech segment of the audio data having a duration that satisfies a threshold duration unless:the non-speech segment of the audio data follows a word in the spoken utterance that is identified as a lengthened word; or the non-speech segment of the audio data follows a word in the spoken utterance that is identified as a filler word.

Claim 20:
The computer-implemented method of claim 19, wherein the word in the spoken utterance is identified as the lengthened word when a phoneme duration of the word satisfy a standard deviation threshold.

Claim 21:
The computer-implemented method of claim 19, wherein, after training the joint segmenting and ASR model, the operations further comprise emitting, by the decoder, the indication that the corresponding output step corresponds to the end of speech segment sooner than identifying a number of consecutive non-speech acoustic frames in the sequence of acoustic frames that satisfy the threshold duration.




Claim 22:
The computer-implemented method of claim 12, wherein the joint segmenting and ASR model is trained to maximize a probability of emitting the end of speech segment ground truth label.
Claim 1:

A unified end-to-end segmenter and two-pass automatic speech recognition (ASR) model comprising: 

a first encoder configured to: receive, as input, a sequence of acoustic frames, the sequence of acoustic frames characterizing an utterance; and generate, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames;

a first decoder configured to: receive, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps; generate, at each of the plurality of output steps: a first probability distribution over possible speech recognition hypotheses; and an indication of whether the corresponding output step corresponds to an end of speech segment;

and at a corresponding output step among the plurality of output steps that corresponds to the end of speech segment, emit an end of speech timestamp; a second encoder configured to: receive, as input: the first higher order feature representation generated by the first encoder at each of the plurality of output steps; and the end of speech timestamp emitted by the first decoder; and generate, at each of the plurality of output steps, a second higher order feature representation for a corresponding first higher order feature representation; and a second decoder configured to: receive, as input, the second higher order feature representation generated by the second encoder at each of the plurality of output steps; and generate, at each of the plurality of output steps, a second probability distribution over possible speech recognition hypotheses.

Claim 17: The model of claim 1, wherein the unified end-to-end segmenter and two-pass ASR model is trained on a set of training samples, each training sample in the set of training samples comprising:
audio data characterizing a spoken utterance; and a corresponding transcription of the spoken utterance, the corresponding transcription having an end of speech ground truth label inserted into the corresponding transcription automatically based on a set of heuristic-based rules and exceptions applied to the training sample.

Claim 19:
The model of claim 17, wherein the end of speech ground truth label is inserted into the corresponding transcription automatically without any human annotation.


Claim 20:
The model of claim 17, wherein the set of heuristic-based rules and exceptions applied to each training sample in the set of training samples comprise:inserting the end of speech ground truth label at the end of the corresponding transcription; andinserting the end of speech ground truth label into the corresponding transcription at a location aligned with a non-speech segment of the audio data having a duration that satisfies a threshold duration unless: the non-speech segment of the audio data follows a word in the spoken utterance that is identified as a lengthened word; or the non-speech segment of the audio data follows a word in the spoken utterance that is identified as a filler word.


Claim 21:
The model of claim 20, wherein the word in the spoken utterance is identified as the lengthened word when a phoneme duration of the word satisfies a standard deviation threshold.

Claim 22:
The model of claim 20, wherein, after training the unified end-to-end segmenter and two-pass ASR model, the first decoder is configured to emit the indication that the corresponding output step corresponds to the end of speech segment sooner than identifying a number of consecutive non-speech acoustic frames in the sequence of acoustic frames that satisfy the threshold duration.

Claim 23:
The model of claim 17, wherein the unified end-to-end segmenter and two-pass ASR model is trained to maximize a probability of emitting the end of speech ground truth label.



Claim 24:
A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising:

receiving a sequence of acoustic frames characterizing an utterance; at each of a plurality of output steps: 



generating, by a first encoder of a unified end-to-end segmenter and two- pass ASR model, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; and 

generating, by a first decoder of the unified end-to-end segmenter and two-pass ASR model, based on the first higher order feature representation generated by the first encoder for the corresponding acoustic frame: a first probability distribution over possible speech recognition hypotheses; and an indication of whether the corresponding output step corresponds to an end of speech segment; 


and at a corresponding output step among the plurality of output steps that corresponds to the end of speech segment, emitting, by the first decoder, an end of speech timestamp; and at each of the plurality of output steps:generating, by a second encoder of the unified end-to-end segmenter and two-pass ASR model, based on the first higher order feature representations generated by the first encoder, a second higher order feature representation, wherein, at the corresponding output step among the plurality of output steps that corresponds to the end of speech segment, the second higher order feature representation generated by the second encoder is further based on the end of speech timestamp emitted by the first decoder; andgenerating, by a second decoder of the unified end-to-end segmenter and two-pass ASR model, based on the second higher order feature representation generated by the second encoder at the corresponding output step, a second probability distribution over possible speech recognition hypotheses.

Claim 31: The computer-implemented method of claim 24, wherein the unified end-to-end segmenter and two-pass ASR model is trained on a set of training samples, each training sample in the set of training samples comprising:

audio data characterizing a spoken utterance; and a corresponding transcription of the spoken utterance, the corresponding transcription having an end of speech ground truth label inserted into the corresponding transcription automatically based on a set of heuristic-based rules and exceptions applied to the training sample.

Claim 33:
The computer-implemented method of claim 31, wherein the end of speech ground truth label is inserted into the corresponding transcription automatically without any human annotation.

Claim 34:
The computer-implemented method of claim 31, wherein the set of heuristic- based rules and exceptions applied to each training sample in the set of training samples comprise:inserting the end of speech ground truth label at the end of the corresponding transcription; and inserting the end of speech ground truth label into the corresponding transcription at a location aligned with a non-speech segment of the audio data having a duration that satisfies a threshold duration unless:the non-speech segment of the audio data follows a word in the spoken utterance that is identified as a lengthened word; or the non-speech segment of the audio data follows a word in the spoken utterance that is identified as a filler word.


Claim 35:
The computer-implemented method of claim 34, wherein the word in the spoken utterance is identified as the lengthened word when a phoneme duration of the word satisfy a standard deviation threshold.

Claim 36:
he computer-implemented method of claim 34, wherein, after training the unified end-to-end segmenter and two-pass ASR model, generating the indication that the corresponding output step corresponds to the end of speech segment by the first decoder comprises emitting the indication that the corresponding output step corresponds to the end of speech segment sooner than identifying a number of consecutive non-speech acoustic frames in the sequence of acoustic frames that satisfy the threshold duration.

Claim 37:
The computer-implemented method of claim 31, wherein the unified end-to-end segmenter and two-pass ASR model is trained to maximize a probability of emitting the end of speech ground truth label.


Claim Objections
3. Claims 8-9 and 19-20 are objected to because of the following informalities: 
In claims 8 and 19, “the ground truth end of speech segment token” should instead be “the end of speech segment ground truth token” to match claim language introduced in independent claims 1 and 12 respectively.
In claims 9 and 20, “when a phoneme duration of the word satisfy a standard deviation threshold” should instead be “satisfies”.
Appropriate correction is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


4. Claims 1-22 rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 

	Regarding claim 1, “A joint segmenting and automated speech recognition (ASR) model” is recited, which is directed to one of the four statutory categories of invention (machine). However, the claims limitations, under their broadest reasonable interpretation, recite mental processes and mathematical concepts which fall into the category of abstract idea.
	The following limitations, under their broadest reasonable interpretation, recite mental processes and mathematical concepts:
receive, as input, a sequence of acoustic frames characterizing one or more utterances: a person listens to audio including utterances, and writes down a sequence of acoustic frames, each frame containing data reflecting characteristics of the utterances, using pen and paper.
generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames: a person takes the acoustic frames, and writes down a higher order feature of each acoustic frame in the sequence, using pen and paper.
receive, as input, the higher order feature representation generated…at each of the plurality of output steps: a person obtains the higher order feature for a plurality of output steps
generate, at each of the plurality of output steps: a probability distribution over possible speech recognition hypotheses; and an indication of whether the corresponding output step corresponds to an end of speech segment: a person can use each written higher order feature, and determine a probability for what speech was said and determine if end of speech was reached.
…trained on a set of training samples, each training sample in the set of training samples comprising: audio data characterizing a spoken utterance; and a corresponding transcription of the spoken utterance, the corresponding transcription having an end of speech segment ground truth token inserted into the corresponding transcription …based on a set of heuristic-based rules and exceptions applied to the training example: a person uses training data to learn how to predict speech hypotheses and end of speech segments, and uses training data consisting of listening to audio data of an utterance, and a corresponding transcript where end of speech segments have been place according to heuristic-based rules and exceptions.

Claim 1 does not contain any additional limitations which integrate the judicial exception into a practical application. The only additional limitation are “an encoder configured to”, “a decoder configured to”, “the joint segmenting and ASR model”, and “…inserted into the corresponding transcription automatically…”. These limitations are recited at a high level of generality, and amount to mere instructions to implement the judicial exception using a generic computer. Mere instructions to implement the judicial exception using a generic computer do not integrate the judicial exception into a practical application. Accordingly, the claim is directed to an abstract idea.
Claim 1 does not contain any additional limitations which amount to significantly more than the judicial exception. As discussed above, the additional limitations amount to mere instructions to implement the judicial exception using a generic computer. Mere instructions to implement the judicial exception using a generic computer do not amount to significantly more than the judicial exception. Therefore, claim 1 is not patent eligible.

Regarding dependent claims 2-11, “The joint segmenting and ASR model” is recited, which is directed to one of the four statutory categories of invention (machine). However, the claims limitations, under their broadest reasonable interpretation, recite mental processes which fall into the category of abstract idea.
	The following limitations, under their broadest reasonable interpretation, recite mental processes:

	Claim 2:
receive, as input, a sequence of non-blank symbols output by a final softmax layer: a person can obtain a sequence of non-blank symbols output from a softmax layer
generate a hidden representation: a person can use the sequence to encode the sequence into a hidden representation.
receive, as input, the hidden representation generated…at each of the plurality of output steps and the higher order feature representation generated…at each of the plurality of output steps: a person uses the hidden representation and the higher order features
generate, at each of the plurality of output steps, the indication of whether the corresponding output step corresponds to an end of speech segment: a person can use the hidden representation and higher order features to make a decision as to whether the particular output step corresponds to an end of speech segment
receive, as input, the hidden representation generated …at each of the plurality of output steps and the higher order feature representation generated by the encoder at each of the plurality of output steps: a person uses the hidden representation and higher order features they obtained
generate, at each of the plurality of steps, the probability distribution over possible speech recognition hypotheses: a person determines a probability of different speech hypotheses for what they heard
Claim 2 contains the additional limitations: “a prediction network”, “a first joint network”, “by the prediction network”, “by the encoder”, “a second joint network”. These limitations are recited broadly and amount to mere instructions to implement the judicial exception using a generic computer.

Claim 3: 
wherein, at each of the plurality of output steps: the sequence of previous non-blank symbols received as input…comprises a sequence of N previous non-blank symbols output by the final softmax layer: a person can obtain a sequence of non-blank symbols output from a softmax layer
generate the hidden representation by: for each non-blank symbol of the sequence of N previous non-blank symbols, generate a respective embedding; and generating an average embedding by averaging the respective embeddings, the average embedding comprising the hidden representation: a person can use the sequence to encode each symbol into a particular embedding vector, and obtain an average embedding by averaging the respective embeddings, such as by taking the mean of each dimension, to obtain a hidden representation.
Claim 3 contains the additional limitations: “receive at the prediction network” and “the prediction network is configured to”. These limitations are recited broadly and amount to mere instructions to implement the judicial exception using a generic computer.

	Claim 4:
Claim 4 contains the additional limitations: “wherein the prediction network comprises a V2 embedding look-up table”. These limitations are recited broadly and amount to mere instructions to implement the judicial exception using a generic computer.

Claim 5:
wherein a training process trains…on the set of training samples by: a person uses training data
training, during a first stage, the second joint network to learn how to predict the corresponding transcription of the spoken utterance characterized by the audio data of each training sample: a person uses the training data to learn how to write a corresponding transcription for each sample
after training…during a second stage: initializing, … with the same parameters as the trained …; and using the end of speech segment ground truth token inserted into the corresponding transcription of the spoken utterance characterized by the audio data of each training sample: a person uses parameters learned in first stage, and uses end of speech segment ground truth tokens inserted into the transcription.
Claim 5 contains the additional limitations: “the joint segmenting and ASR model”, “the second joint network”, and “the first joint network”. These limitations are recited broadly and amount to mere instructions to implement the judicial exception using a generic computer.

Claim 6: 
Claim 6 contains the additional limitations: “wherein the encoder comprises a causal encoder comprising a stack of conformer layers or transformer layers”. These limitations are recited broadly and amount to mere instructions to implement the judicial exception using a generic computer.

Claim 7: 
wherein the end of speech segment ground truth token is inserted into the corresponding transcription: a person can place a ground truth token into a written transcription using pen and paper.
Claim 7 contains the additional limitations: “automatically without any human annotation”. These limitations are recited broadly and amount to mere instructions to implement the judicial exception using a generic computer.

Claim 8: 
inserting the ground truth end of speech segment token at the end of the corresponding transcription: a person can place a token at the end of the written transcription using pen and paper.
inserting the ground truth end of speech segment token into the corresponding transcription at a location aligned with a non-speech segment of the audio data having a duration that satisfies a threshold duration: a person can insert further ground truth tokens into a transcript if at the corresponding time they hear non speech segment longer than a threshold.
Unless: the non-speech segment of the audio data follows a word in the spoken utterance that is identified as a lengthened word; or the non-speech segment of the audio data follows a word in the spoken utterance that is identified as a filler word: a person chooses not to place the token if the non-speech segment is preceded by a lengthened or filler word.
Claim 8 contains no additional limitations.

Claim 9:
wherein the word in the spoken utterance is identified as the lengthened word when a phoneme duration of the word satisfy a standard deviation threshold: a person determines a phoneme duration is longer than a standard deviation threshold, and determines that the word is lengthened.
Claim 9 contains no additional limitations.

Claim 10:
after training…emit the indication that the corresponding output step corresponds to the end of speech segment sooner than identifying a number of consecutive non-speech acoustic frames in the sequence of acoustic frames that satisfy the threshold duration: a person can indicate that a segment is end of speech sooner than identifying a consecutive number of non-speech frames satisfies a threshold duration.
Claim 10 contains the additional limitations: “the joint segmenting and ASR model” and “the decoder”. These limitations are recited broadly and amount to mere instructions to implement the judicial exception using a generic computer.

Claim 11:
trained to maximize a probability of emitting the end of speech segment ground truth label: maximizing a probability amounts to a mathematical concept.
Claim 11 contains no additional limitations.

	Claims 2-11 do not contain any additional elements which integrate the judicial exception into a practical application. As discussed above, the additional limitations of claims 2-11 amount to mere instructions to implement the judicial exception using a generic computer. Mere instructions to implement the judicial exception using a generic computer do not integrate the judicial exception into a practical application. Accordingly, claims 2-11 are directed to an abstract idea.
	Claim 2-11 do not contain any additional elements which amount to significantly more than the judicial exception. As discussed above, the additional limitations of claims 2-11 amount to mere instructions to implement the judicial exception using a generic computer. Mere instructions to implement the judicial exception using a generic computer do not amount to significantly more than the judicial exception. Therefore, claims 2-11 are not patent eligible.

	Regarding claim 12, “A computer-implemented method” is recited, which is directed to one of the four statutory categories of invention (process). However, the claims limitations, under their broadest reasonable interpretation, recite mental processes and mathematical concepts which fall into the category of abstract idea.
	The following limitations, under their broadest reasonable interpretation, recite mental processes and mathematical concepts:
receiving a sequence of acoustic frames characterizing one or more utterances: a person listens to audio including utterances, and writes down a sequence of acoustic frames, each frame containing data reflecting characteristics of the utterances, using pen and paper.
at each of a plurality of output steps: generating …a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames: a person takes the acoustic frames, and writes down a higher order feature of each acoustic frame in the sequence, using pen and paper.
generating…a probability distribution over possible speech recognition hypotheses; and an indication of whether the corresponding output step corresponds to an end of speech segment: a person can use each written higher order feature, and determine a probability for what speech was said and determine if end of speech was reached.
…trained on a set of training samples, each training sample in the set of training samples comprising: audio data characterizing a spoken utterance; and a corresponding transcription of the spoken utterance, the corresponding transcription having an end of speech segment ground truth token inserted into the corresponding transcription …based on a set of heuristic-based rules and exceptions applied to the training example: a person uses training data to learn how to predict speech hypotheses and end of speech segments, and uses training data consisting of listening to audio data of an utterance, and a corresponding transcript where end of speech segments have been place according to heuristic-based rules and exceptions.

Claim 12 does not contain any additional limitations which integrate the judicial exception into a practical application. The only additional limitation are “by an encoder of a joint segmenting and automated speech recognition (ASR) model”, “by a decoder of the joint segmenting and ASR model”, and “…inserted into the corresponding transcription automatically…”. These limitations are recited at a high level of generality, and amount to mere instructions to implement the judicial exception using a generic computer. Mere instructions to implement the judicial exception using a generic computer do not integrate the judicial exception into a practical application. Accordingly, the claim is directed to an abstract idea.
Claim 12 does not contain any additional limitations which amount to significantly more than the judicial exception. As discussed above, the additional limitations amount to mere instructions to implement the judicial exception using a generic computer. Mere instructions to implement the judicial exception using a generic computer do not amount to significantly more than the judicial exception. Therefore, claim 12 is not patent eligible.

Regarding dependent claims 13-22, “The computer-implemented method” is recited, which is directed to one of the four statutory categories of invention (machine). However, the claims limitations, under their broadest reasonable interpretation, recite mental processes which fall into the category of abstract idea.
	The following limitations, under their broadest reasonable interpretation, recite mental processes:

	Claim 13:
generating…a hidden representation based on a sequence of non-blank symbols output by a final softmax layer: a person can use a sequence of non-blank symbols output by a softmax layer to encode the sequence into a hidden representation.
generating the indication of whether the corresponding output step corresponds to an end of speech segment comprises generating, using … the indication of whether the corresponding output step corresponds to the end of speech segment based on the hidden representation generated … each of the plurality of output steps and the higher order feature representation generated … at each of the plurality of output steps: a person can use the hidden representation and higher order features to make a decision as to whether the particular output step corresponds to an end of speech segment
generating the probability distribution over possible speech recognition hypotheses comprises generating … the probability distribution over possible speech recognition hypotheses based on the hidden representation generated … at each of the plurality of output steps and the higher order feature representation generated …at each of the plurality of output steps: a person determines a probability of different speech hypotheses for what they heard using higher order features and hidden representations.
Claim 13 contains the additional limitations: “using a prediction network of the decoder”, “using a first joint network of the decoder”, “by the prediction network”, “by the encoder”, “using a second joint network of the deocder”. These limitations are recited broadly and amount to mere instructions to implement the judicial exception using a generic computer.

Claim 14: 
the sequence of previous non-blank symbols received as input…comprises a sequence of N previous non-blank symbols output by the final softmax layer: a person can obtain a sequence of non-blank symbols output from a softmax layer
generating the hidden representation … comprises: for each non-blank symbol of the sequence of N previous non-blank symbols, generating a respective embedding; and generating an average embedding by averaging the respective embeddings, the average embedding comprising the hidden representation: a person can use the sequence to encode each symbol into a particular embedding vector, and obtain an average embedding by averaging the respective embeddings, such as by taking the mean of each dimension, to obtain a hidden representation.
Claim 14 contains the additional limitations: “received as input at the prediction network” and “using the prediction network”. These limitations are recited broadly and amount to mere instructions to implement the judicial exception using a generic computer.

	Claim 15:
Claim 15 contains the additional limitations: “wherein the prediction network comprises a V2 embedding look-up table”. These limitations are recited broadly and amount to mere instructions to implement the judicial exception using a generic computer.

Claim 16:
training, during a first stage, … to learn how to predict the corresponding transcription of the spoken utterance characterized by the audio data of each training sample: a person uses the training data to learn how to write a corresponding transcription for each sample
after training…during a second stage: initializing, … with the same parameters as the trained …; and using the end of speech segment ground truth token inserted into the corresponding transcription of the spoken utterance characterized by the audio data of each training sample: a person uses parameters learned in first stage, and uses end of speech segment ground truth tokens inserted into the transcription.
Claim 5 contains the additional limitations: “trains the joint segmenting and ASR model”, “the second joint network”, and “the first joint network”. These limitations are recited broadly and amount to mere instructions to implement the judicial exception using a generic computer.

Claim 17: 
Claim 17 contains the additional limitations: “wherein the encoder comprises a causal encoder comprising a stack of conformer layers or transformer layers”. These limitations are recited broadly and amount to mere instructions to implement the judicial exception using a generic computer.

Claim 18: 
wherein the end of speech segment ground truth token is inserted into the corresponding transcription: a person can place a ground truth token into a written transcription using pen and paper.
Claim 18 contains the additional limitations: “automatically without any human annotation”. These limitations are recited broadly and amount to mere instructions to implement the judicial exception using a generic computer.

Claim 19:
inserting the ground truth end of speech segment token at the end of the corresponding transcription: a person can place a token at the end of the written transcription using pen and paper.
inserting the ground truth end of speech segment token into the corresponding transcription at a location aligned with a non-speech segment of the audio data having a duration that satisfies a threshold duration: a person can insert further ground truth tokens into a transcript if at the corresponding time they hear non speech segment longer than a threshold.
unless: the non-speech segment of the audio data follows a word in the spoken utterance that is identified as a lengthened word; or the non-speech segment of the audio data follows a word in the spoken utterance that is identified as a filler word: a person chooses not to place the token if the non-speech segment is preceded by a lengthened or filler word.
Claim 19 contains no additional limitations.

Claim 20:
wherein the word in the spoken utterance is identified as the lengthened word when a phoneme duration of the word satisfy a standard deviation threshold: a person determines a phoneme duration is longer than a standard deviation threshold, and determines that the word is lengthened.
Claim 20 contains no additional limitations.

Claim 21:
after training…emitting… the indication that the corresponding output step corresponds to the end of speech segment sooner than identifying a number of consecutive non-speech acoustic frames in the sequence of acoustic frames that satisfy the threshold duration: a person can indicate that a segment is end of speech sooner than identifying a consecutive number of non-speech frames satisfies a threshold duration.
Claim 21 contains the additional limitations: “the joint segmenting and ASR model” and “by the decoder”. These limitations are recited broadly and amount to mere instructions to implement the judicial exception using a generic computer.

Claim 22:
trained to maximize a probability of emitting the end of speech segment ground truth label: maximizing a probability amounts to a mathematical concept.
Claim 22 contains no additional limitations.

	Claims 13-22 do not contain any additional elements which integrate the judicial exception into a practical application. As discussed above, the additional limitations of claims 13-22 amount to mere instructions to implement the judicial exception using a generic computer. Mere instructions to implement the judicial exception using a generic computer do not integrate the judicial exception into a practical application. Accordingly, claims 13-22 are directed to an abstract idea.
	Claim 13-22 do not contain any additional elements which amount to significantly more than the judicial exception. As discussed above, the additional limitations of claims 13-22 amount to mere instructions to implement the judicial exception using a generic computer. Mere instructions to implement the judicial exception using a generic computer do not amount to significantly more than the judicial exception. Therefore, claims 13-22 are not patent eligible.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


5. Claims 1, 7, 11-12, 18, and 22 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Chang et al. (PGPUB No. 2020/0335091, hereinafter Chang).

	Regarding claim 1, Chang discloses A joint segmenting and automated speech recognition (ASR) model (Fig. 1, 140; para. 0037 “FIG. 1 illustrates an example speech recognizer 100 that uses a joint ASR and endpointing model 140”) comprising: an encoder (Fig. 4, 410) configured to: receive, as input, a sequence of acoustic frames characterizing one or more utterances (para. 0040 “The feature extraction module 130 processes the audio data 125 by identifying audio features 135 representing the acoustic characteristics of the audio data 125. For example, the feature extraction module 130 produces audio feature vectors for different time windows of audio, often referred to as frames. The series of feature vectors can then serve as input to various models. The audio feature vectors contain information on the characteristics of the audio data 125, such as mel-frequency ceptral coefficients (MFCCs). The audio features may indicate any of various factors, such as the pitch, loudness, frequency, and energy of audio. The audio features 135 are provided as input the joint model 140”; para. 0059 “FIG. 4 illustrates the architecture for an RNN-T of the joint model 400. In the architecture, the encoder 410 is analogous to an acoustic model that receives acoustic feature vectors”); and generate, at each of a plurality of output steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames (para. 0059 “For each combination of acoustic frame input t and label u, the encoder 410 outputs h.sub.i”); and a decoder (Fig. 1, components 140 and 150; Fig. 4, components 420, 430, and 440) configured to: receive, as input, the higher order feature representation generated by the encoder at each of the plurality of output steps (para. 0059 “For each combination of acoustic frame input t and label u, the encoder 410 outputs h.sub.i and the prediction outputs p.sub.u are passed to a joint network 430”); and generate, at each of the plurality of output steps: a probability distribution over possible speech recognition hypotheses (para. 0059 “For each combination of acoustic frame input t and label u, the encoder 410 outputs h.sub.i and the prediction outputs p.sub.u are passed to a joint network 430 to compute output logits n fed in a soft max layer 440 which defines a probability distribution over the set of output targets. Hence, the RNN-T is often described as an end-to-end model because it can be configured to directly output graphemes directly without the aid of an additional external language model.”); and an indication of whether the corresponding output step corresponds to an end of speech segment (para. 0038 “The speech recognizer 100 may trigger endpoint detection 160 responsive to receiving an endpoint signal from either one of the joint model 140 or the EOQ endpointer 150, whichever occurs first. The endpoint signal corresponds to an endpoint indication output by the joint model 140 or the EOQ endpointer 150 that indicates an end of an utterance 120. In some examples, the endpoint indication (e.g., endpoint signal) may include an endpoint token 175 in the transcription 165 selected by the beam search 145.”),wherein the joint segmenting and ASR model is trained on a set of training samples (para. 0050 “The training process 200 trains the joint ASR and endpointing model 140 on training data 235 including a plurality of training samples 236”), each training sample in the set of training samples comprising: audio data characterizing a spoken utterance (para. 0050 “The training process 200 trains the joint ASR and endpointing model 140 on training data 235 including a plurality of training samples 236 that each include a training utterance 220”); and a corresponding transcription of the spoken utterance (para. 0050 “The training process 200 trains the joint ASR and endpointing model 140 on training data 235 including a plurality of training samples 236 that each include … a corresponding transcription 220 for the training utterance 220”), the corresponding transcription having an end of speech segment ground truth token inserted into the corresponding transcription automatically based on a set of heuristic-based rules and exceptions applied to the training sample (para. 0050 “The training process 200 trains the joint ASR and endpointing model 140 on training data 235 including a plurality of training samples 236 that each include a training utterance 220, a corresponding transcription 220 for the training utterance 220, and a sequence of reference output labels 222 for the corresponding transcription 220. The training process 200 may generate the training data 235 by collecting acoustic data 210 from spoken utterances. … The training utterances 211 are anonymized and transcribed by a transcription process 215 to produce the corresponding text transcriptions 220. The transcription process 215 may be performed by a trained speech recognition system or manually by a human. … As a result, each training sample 236 can include audio data for the training utterance 211, the corresponding reference transcription 220, and a sequence of reference output labels 222.”; para. 0051 “The output label set 265 includes linguistic units, graphemes in the example, as well as an endpoint token </s> 275.”).

	Regarding claim 7, Chang discloses wherein the end of speech segment ground truth token is inserted into the corresponding transcription automatically without any human annotation (para. 0051 “The output label set 265 includes linguistic units, graphemes in the example, as well as an endpoint token </s> 275.”; para. 0050 “The training process 200 trains the joint ASR and endpointing model 140 on training data 235 including a plurality of training samples 236 that each include a training utterance 220, a corresponding transcription 220 for the training utterance 220, and a sequence of reference output labels 222 for the corresponding transcription 220. The training process 200 may generate the training data 235 by collecting acoustic data 210 from spoken utterances. … The training utterances 211 are anonymized and transcribed by a transcription process 215 to produce the corresponding text transcriptions 220. The transcription process 215 may be performed by a trained speech recognition system or manually by a human. … As a result, each training sample 236 can include audio data for the training utterance 211, the corresponding reference transcription 220, and a sequence of reference output labels 222.”).

Regarding claim 11, Chang discloses wherein the joint segmenting and ASR model is trained to maximize the probability of emitting the end of speech segment ground truth label (para. 0051 “During the training process 200, a training module 270 adjusts parameters of the joint model 140. For instance, the training module 270 may feed feature vectors associated with one training utterance 211 at time as input to the joint model 140 and the joint model 140 may generate/predict, as output, different sets of output scores 260... The output label set 265 includes linguistic units, graphemes in the example, as well as an endpoint token </s> 275. The output scores 260 respectively represent the relatively likelihood of the corresponding symbol should be added to the decoded sequence representing the utterance...The training module 270 is configured to compare the predicted output labels 265 and associated output scores 260 with the reference output labels 222 for the corresponding reference transcription 211 and adjust parameters of the joint model 140, e.g., neural network weights, to improve the accuracy of the predictions.”).

Regarding claim 12, Chang discloses A computer-implemented method executed on data processing hardware (para. 0074, Fig. 7) that causes the data processing hardware to perform operations comprising: receiving a sequence of acoustic frames characterizing one or more utterances (para. 0040 “The feature extraction module 130 processes the audio data 125 by identifying audio features 135 representing the acoustic characteristics of the audio data 125. For example, the feature extraction module 130 produces audio feature vectors for different time windows of audio, often referred to as frames. The series of feature vectors can then serve as input to various models. The audio feature vectors contain information on the characteristics of the audio data 125, such as mel-frequency ceptral coefficients (MFCCs). The audio features may indicate any of various factors, such as the pitch, loudness, frequency, and energy of audio. The audio features 135 are provided as input the joint model 140”; para. 0059 “FIG. 4 illustrates the architecture for an RNN-T of the joint model 400. In the architecture, the encoder 410 is analogous to an acoustic model that receives acoustic feature vectors”;); and at each of a plurality of output steps: generating, by an encoder of a joint segmenting and automated speech recognition (ASR) model (Fig. 4, 410), a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames (para. 0059 “For each combination of acoustic frame input t and label u, the encoder 410 outputs h.sub.i”); and generating, by a decoder of the joint segmenting and ASR model (Fig. 1, components 140 and 150; Fig. 4, components 420, 430, and 440): a probability distribution over possible speech recognition hypotheses (para. 0059 “For each combination of acoustic frame input t and label u, the encoder 410 outputs h.sub.i and the prediction outputs p.sub.u are passed to a joint network 430 to compute output logits n fed in a soft max layer 440 which defines a probability distribution over the set of output targets. Hence, the RNN-T is often described as an end-to-end model because it can be configured to directly output graphemes directly without the aid of an additional external language model.”); and an indication of whether the corresponding output step corresponds to an end of speech segment (para. 0038 “The speech recognizer 100 may trigger endpoint detection 160 responsive to receiving an endpoint signal from either one of the joint model 140 or the EOQ endpointer 150, whichever occurs first. The endpoint signal corresponds to an endpoint indication output by the joint model 140 or the EOQ endpointer 150 that indicates an end of an utterance 120. In some examples, the endpoint indication (e.g., endpoint signal) may include an endpoint token 175 in the transcription 165 selected by the beam search 145.”), wherein the joint segmenting and ASR model is trained on a set of training samples (para. 0050 “The training process 200 trains the joint ASR and endpointing model 140 on training data 235 including a plurality of training samples 236”), each training sample in the set of training samples comprising: audio data characterizing a spoken utterance (para. 0050 “The training process 200 trains the joint ASR and endpointing model 140 on training data 235 including a plurality of training samples 236 that each include a training utterance 220”); and a corresponding transcription of the spoken utterance (para. 0050 “The training process 200 trains the joint ASR and endpointing model 140 on training data 235 including a plurality of training samples 236 that each include … a corresponding transcription 220 for the training utterance 220”), the corresponding transcription having an end of speech segment ground truth token inserted into the corresponding transcription automatically based on a set of heuristic-based rules and exceptions applied to the training sample (para. 0050 “The training process 200 trains the joint ASR and endpointing model 140 on training data 235 including a plurality of training samples 236 that each include a training utterance 220, a corresponding transcription 220 for the training utterance 220, and a sequence of reference output labels 222 for the corresponding transcription 220. The training process 200 may generate the training data 235 by collecting acoustic data 210 from spoken utterances. … The training utterances 211 are anonymized and transcribed by a transcription process 215 to produce the corresponding text transcriptions 220. The transcription process 215 may be performed by a trained speech recognition system or manually by a human. … As a result, each training sample 236 can include audio data for the training utterance 211, the corresponding reference transcription 220, and a sequence of reference output labels 222.”; para. 0051 “The output label set 265 includes linguistic units, graphemes in the example, as well as an endpoint token </s> 275.”).

Regarding claim 18, Chang discloses wherein the end of speech segment ground truth token is inserted into the corresponding transcription automatically without any human annotation (para. 0051 “The output label set 265 includes linguistic units, graphemes in the example, as well as an endpoint token </s> 275.”; para. 0050 “The training process 200 trains the joint ASR and endpointing model 140 on training data 235 including a plurality of training samples 236 that each include a training utterance 220, a corresponding transcription 220 for the training utterance 220, and a sequence of reference output labels 222 for the corresponding transcription 220. The training process 200 may generate the training data 235 by collecting acoustic data 210 from spoken utterances. … The training utterances 211 are anonymized and transcribed by a transcription process 215 to produce the corresponding text transcriptions 220. The transcription process 215 may be performed by a trained speech recognition system or manually by a human. … As a result, each training sample 236 can include audio data for the training utterance 211, the corresponding reference transcription 220, and a sequence of reference output labels 222.”).

Regarding claim 22, Chang discloses wherein the joint segmenting and ASR model is trained to maximize a probability of emitting the end of speech segment ground truth label (para. 0051 “During the training process 200, a training module 270 adjusts parameters of the joint model 140. For instance, the training module 270 may feed feature vectors associated with one training utterance 211 at time as input to the joint model 140 and the joint model 140 may generate/predict, as output, different sets of output scores 260... The output label set 265 includes linguistic units, graphemes in the example, as well as an endpoint token </s> 275. The output scores 260 respectively represent the relatively likelihood of the corresponding symbol should be added to the decoded sequence representing the utterance...The training module 270 is configured to compare the predicted output labels 265 and associated output scores 260 with the reference output labels 222 for the corresponding reference transcription 211 and adjust parameters of the joint model 140, e.g., neural network weights, to improve the accuracy of the predictions.”).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

6. Claims 2 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Chang in view of Li et al. (NPL Towards Fast and Accurate Streaming End-to-end ASR, hereinafter Li).

Regarding claim 2, Chang discloses wherein the decoder comprises: a prediction network (Fig. 4, 420) configured to, at each of the plurality of output steps: receive, as input, a sequence of non-blank symbols output by a final softmax layer (Fig. 4, prediction network “420” receives yu,i-1, derived from softmax layer “440” output; para. 0059 “the prediction network 420 acts as a language model that accepts the previous grapheme label prediction as input…the RNN-T is often described as an end-to-end model because it can be configured to directly output graphemes”; para. 0060 “The ground-truth label sequence of length U is denoted as y.sub.1, y.sub.2, . . . , y.sub.u where y.sub.u ∈ S (S is the set of grapheme symbols).”; para. 0061 “The probability of seeing some label in an alignment ŷ.sub.t conditioned on the acoustic features up to time t and the history of non-blank labels, y.sub.1, . . . y.sub.u(t−1), emitted so far.”); and generate a hidden representation (para. 0059 “the prediction network 420 acts as a language model that…computes an output vector p.sub.u”);…a second joint network (Fig. 4, joint network “430”) configured to: receive, as input, the hidden representation generated by the prediction network at each of the plurality of output steps and the higher order feature representation generated by the encoder at each of the plurality of output steps (para. 0059 “the encoder 410 outputs h.sub.i and the prediction outputs p.sub.u are passed to a joint network 430…”); and generate, at each of the plurality of output steps, the probability distribution over possible speech recognition hypotheses (para. 0059 “For each combination of acoustic frame input t and label u, the encoder 410 outputs h.sub.i and the prediction outputs p.sub.u are passed to a joint network 430 to compute output logits n fed in a soft max layer 440 which defines a probability distribution over the set of output targets.”).
Chang does not specifically disclose a first joint network configured to: receive, as input, the hidden representation generated by the prediction network at each of the plurality of output steps and the higher order feature representation generated by the encoder at each of the plurality of output steps; and generate, at each of the plurality of output steps, the indication of whether the corresponding output step corresponds to an end of speech segment.
Li teaches a first joint network (pg. 2, Fig. 1, “Joint Network”) configured to: receive, as input, the hidden representation generated by the prediction network at each of the plurality of output steps (pg. 2, Fig. 1, output of “Prediction Network” passed to “Joint Network”) and the higher order feature representation generated by the encoder at each of the plurality of output steps (pg. 2, Fig. 1, output of “RNN-T Encoder” passed to “Joint Network”; pg. 2 section 2. 1st para. “Let us denote the input acoustic frames as X…and T is the number of frames in X. Each acoustic frame Xt is first passed through the RNN-T encoder”); and generate, at each of the plurality of output steps, the indication of whether the corresponding output step corresponds to an end of speech segment (pg. 2, section 2, 1st para. “In this work, RNN-T is trained to directly predict word piece token sequence Y, where the last label Yu is the special token </s>”, section 2.1, 1st para. “…the endpointing decision is made jointly with the model rather than with a separate endpointer.”).
Chang and Li are considered to be analogous to the claimed invention as
they both are in the same field of automated speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Chang to incorporate the teachings of Li in order to include a first joint network which generates an indication of whether the corresponding output step corresponds to an end of speech segment. Doing so would beneficial, as joint optimization of endpointing allows for useful information captured from the ASR model to be shared with the endpointer, leading to better endpointing decisions (pg. 1, section 1, 4th para. “Information captured by ASR models is not shared to the endpointer, which may be useful for making endpointing decision. It would be better to optimize the endpointer and ASR models together.”).

	Regarding claim 13, Chang discloses the operations further comprise, at each of the plurality of output steps, generating, using a prediction network of the decoder, a hidden representation based on a sequence of non-blank symbols output by a final softmax layer (Fig. 4, 420; Fig. 4, prediction network “420” receives yu,i-1, derived from softmax layer “440” output; para. 0059 “the prediction network 420 acts as a language model that accepts the previous grapheme label prediction as input…the RNN-T is often described as an end-to-end model because it can be configured to directly output graphemes”; para. 0060 “The ground-truth label sequence of length U is denoted as y.sub.1, y.sub.2, . . . , y.sub.u where y.sub.u ∈ S (S is the set of grapheme symbols).”; para. 0061 “The probability of seeing some label in an alignment ŷ.sub.t conditioned on the acoustic features up to time t and the history of non-blank labels, y.sub.1, . . . y.sub.u(t−1), emitted so far.”; para. 0059 “the prediction network 420 acts as a language model that…computes an output vector p.sub.u”); and generating the probability distribution over possible speech recognition hypotheses comprises generating, using a second joint network of the decoder (Fig. 4, joint network “430”), the probability distribution over possible speech recognition hypothesis based on the hidden representation generated by the prediction network at each of the plurality of output steps and the higher order feature representation generated by the encoder at each of the plurality of output steps (para. 0059 “the encoder 410 outputs h.sub.i and the prediction outputs p.sub.u are passed to a joint network 430…”; para. 0059 “For each combination of acoustic frame input t and label u, the encoder 410 outputs h.sub.i and the prediction outputs p.sub.u are passed to a joint network 430 to compute output logits n fed in a soft max layer 440 which defines a probability distribution over the set of output targets.”).
	Chang does not specifically disclose generating the indication of whether the corresponding output step corresponds to the end of speech segment comprises generating, using a first joint network of the decoder, the indication of whether the corresponding output step corresponds to the end of speech segment based on the hidden representation generated by the prediction network at each of the plurality of output steps and the higher order feature representation generated by the encoder at each of the plurality of output steps.
	Li teaches generating the indication of whether the corresponding output step corresponds to the end of speech segment comprises generating, using a first joint network of the decoder (pg. 2, Fig. 1, “Joint Network”), the indication of whether the corresponding output step corresponds to the end of speech segment based on the hidden representation generated by the prediction network at each of the plurality of output steps and the higher order feature representation generated by the encoder at each of the plurality of output steps (pg. 2, Fig. 1, output of “Prediction Network” passed to “Joint Network”; pg. 2, Fig. 1, output of “RNN-T Encoder” passed to “Joint Network”; pg. 2 section 2. 1st para. “Let us denote the input acoustic frames as X…and T is the number of frames in X. Each acoustic frame Xt is first passed through the RNN-T encoder”; pg. 2, section 2, 1st para. “In this work, RNN-T is trained to directly predict word piece token sequence Y, where the last label Yu is the special token </s>”, section 2.1, 1st para. “…the endpointing decision is made jointly with the model rather than with a separate endpointer.”).
Chang and Li are considered to be analogous to the claimed invention as
they both are in the same field of automated speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Chang to incorporate the teachings of Li in order to include a first joint network which generates an indication of whether the corresponding output step corresponds to an end of speech segment. Doing so would beneficial, as joint optimization of endpointing allows for useful information captured from the ASR model to be shared with the endpointer, leading to better endpointing decisions (pg. 1, section 1, 4th para. “Information captured by ASR models is not shared to the endpointer, which may be useful for making endpointing decision. It would be better to optimize the endpointer and ASR models together.”).

	7. Claims 3 and 14 are rejected under 25 U.S.C. 103 as being unpatentable over Chang in view of Li, and in further view of Liu et al. (US Patent 12,002,451, hereinafter Liu).

	Regarding claim 3, Chang in view of Li discloses the sequence of previous non-blank symbols received as input at the prediction network comprises a sequence of N previous non-blank symbols output by the final softmax layer (Chang, Fig. 4, input sequence “yn,i-1”, softmax layer “440”; para. “…the prediction network 420 acts as a language model that accepts the previous grapheme label prediction as input,”; para. 0059 “output logits n fed in a soft max layer 440 which defines a probability distribution over the set of output targets. Hence, the RNN-T is often described as an end-to-end model because it can be configured to directly output graphemes directly”); and the prediction network is configured to generate the hidden representation by: for each non-blank symbol of the sequence of N previous non-blank symbols, generating a respective embedding (Fig. 4, output of prediction network “pui”; para. 0059 “while the prediction network 420 acts as a language model that accepts the previous grapheme label prediction as input, and computes an output vector p.sub.u. For each combination of acoustic frame input t and label u, the encoder 410 outputs h.sub.i and the prediction outputs p.sub.u are passed to a joint network 430”) and the hidden representation (para. 0059 “the prediction network 420 acts as a language model that…computes an output vector p.sub.u”).
	Chang in view of Li does not specifically disclose generating an average embedding by averaging the respective embeddings, the average embedding comprising the hidden representation.
	Liu teaches generating an average embedding by averaging the respective embeddings, the average embedding comprising the hidden representation (Fig. 9; Col. 17 Lines 28-40 “As shown in FIG. 9, a combiner component 980 may process the weight data 976 and the audio encoding data 162 to determine updated audio encoding data 982. In the example embodiment shown in FIG. 9, the combiner component 980 may integrate the contextual information, from the user profile data for the user 105, into the audio encoding data 162. The updated audio encoding data 982 may be the audio encoding data 162 with certain portions being weighted higher than other portions, where the higher weighted portions may correspond to one or more words from the user profile data. The combiner component 980 may be similar to the combiner component 180 described above in relation to FIG. 1.”).
Chang, Li, and Liu are considered to be analogous to the claimed invention as
they both are in the same field of automatic speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Chang in view of Li to incorporate the teachings of Liu in order to generate an average embedding by averaging respective embeddings. Doing so would be beneficial, as this would allow for the ASR model to determine relevance of context data to a spoken input from a user (Col. 3 Lines 5-23).
	
	Regarding claim 14, Chang in view of Li discloses the sequence of previous non-blank symbols received as input at the prediction network comprises a sequence of N previous non-blank symbols output by the final softmax layer (Chang, Fig. 4, input sequence “yn,i-1”, softmax layer “440”; para. “…the prediction network 420 acts as a language model that accepts the previous grapheme label prediction as input,”; para. 0059 “output logits n fed in a soft max layer 440 which defines a probability distribution over the set of output targets. Hence, the RNN-T is often described as an end-to-end model because it can be configured to directly output graphemes directly”); and generating the hidden representation using the prediction network comprises generating the hidden representation by: for each non-blank symbol of the sequence of N previous non-blank symbols, generating a respective embedding (Fig. 4, output of prediction network “pui”; para. 0059 “while the prediction network 420 acts as a language model that accepts the previous grapheme label prediction as input, and computes an output vector p.sub.u. For each combination of acoustic frame input t and label u, the encoder 410 outputs h.sub.i and the prediction outputs p.sub.u are passed to a joint network 430”) and the hidden representation (para. 0059 “the prediction network 420 acts as a language model that…computes an output vector p.sub.u”).
	Chang in view of Li does not specifically disclose generating an average embedding by averaging the respective embeddings, the average embedding comprising the hidden representation.
	Liu teaches generating an average embedding by averaging the respective embeddings, the average embedding comprising the hidden representation (Fig. 9; Col. 17 Lines 28-40 “As shown in FIG. 9, a combiner component 980 may process the weight data 976 and the audio encoding data 162 to determine updated audio encoding data 982. In the example embodiment shown in FIG. 9, the combiner component 980 may integrate the contextual information, from the user profile data for the user 105, into the audio encoding data 162. The updated audio encoding data 982 may be the audio encoding data 162 with certain portions being weighted higher than other portions, where the higher weighted portions may correspond to one or more words from the user profile data. The combiner component 980 may be similar to the combiner component 180 described above in relation to FIG. 1.”).
Chang, Li, and Liu are considered to be analogous to the claimed invention as
they both are in the same field of automatic speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Chang in view of Li to incorporate the teachings of Liu in order to generate an average embedding by averaging respective embeddings. Doing so would be beneficial, as this would allow for the ASR model to determine relevance of context data to a spoken input from a user (Col. 3 Lines 5-23).

8. Claims 4 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Chang in view of Li, and in further view of Xu et al. (NPL LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition, hereinafter Xu) and Variani et al. (NPL Hybrid Autoregressive Transducer (HAT), hereinafter Variani).

	Regarding claim 4, Chang in view of Li does not specifically disclose wherein the prediction network comprises a V2 embedding look-up table.
	Xu teaches wherein the prediction network comprises a … embedding look-up table (Fig. 2(b); pg. 5, section 2.6 “Input/Output Module”: “To enable the Transformer model to support ASR and TTS, we need different input and output modules for speech and text…For the ASR model: 1) The input module of the encoder consists of multiple convolutional layers, which reduce the length of the speech sequence; 2) The input module of the decoder is a character/phoneme embedding lookup table; 3) The output module of the decoder consists of a linear layer and a softmax function, where the linear layer shares the same weights with the character/phoneme embedding lookup table in the decoder input module.”)
Chang, Li, and Xu are considered to be analogous to the claimed invention as
they are all in the same field of automatic speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Chang in view of Li to incorporate the teachings of Xu in order to use a prediction network comprising an embedding lookup table. Doing so would beneficial, as it would enable ASR for low resource-languages with low data costs (Xu, Abstract).
	Chang in view of Li and further in view of Xu does not specifically disclose [wherein the prediction network comprises] a V2 embedding [look-up table].
	Variani teaches a V2 embedding (Table 2; pg. 5, para. below Table 2: “Since the finite context of 2 is sufficient to perform as well as an infinite context, one can simply replace all the expensive RNN kernels in the decoder network with a |V|2 embedding vector corresponding to all the possible permutations of a finite context of 2.”).
Chang, Li, Xu, and Variani are considered to be analogous to the claimed invention as they are all in the same field of automatic speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Chang in view of Li and further in view of Xu to incorporate the teachings of Variani in order to specifically use a V2 embedding for the embedding lookup table. Doing so would beneficial, as this would reduce total training and inference costs (Variani, pg. 5, para. below Table 2: “In other words, trading computation with memory, which can significantly reduce total training and inference cost.”).

	Regarding claim 15, Chang in view of Li does not specifically disclose wherein the prediction network comprises a V2 embedding look-up table.
	Xu teaches wherein the prediction network comprises a … embedding look-up table (Fig. 2(b); pg. 5, section 2.6 “Input/Output Module”: “To enable the Transformer model to support ASR and TTS, we need different input and output modules for speech and text…For the ASR model: 1) The input module of the encoder consists of multiple convolutional layers, which reduce the length of the speech sequence; 2) The input module of the decoder is a character/phoneme embedding lookup table; 3) The output module of the decoder consists of a linear layer and a softmax function, where the linear layer shares the same weights with the character/phoneme embedding lookup table in the decoder input module.”)
Chang, Li, and Xu are considered to be analogous to the claimed invention as
they are all in the same field of automatic speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Chang in view of Li to incorporate the teachings of Xu in order to use a prediction network comprising an embedding lookup table. Doing so would beneficial, as it would enable ASR for low resource-languages with low data costs (Xu, Abstract).
	Chang in view of Li and further in view of Xu does not specifically disclose [wherein the prediction network comprises] a V2 embedding [look-up table].
	Variani teaches a V2 embedding (Table 2; pg. 5, para. below Table 2: “Since the finite context of 2 is sufficient to perform as well as an infinite context, one can simply replace all the expensive RNN kernels in the decoder network with a |V|2 embedding vector corresponding to all the possible permutations of a finite context of 2.”).
Chang, Li, Xu, and Variani are considered to be analogous to the claimed invention as they are all in the same field of automatic speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Chang in view of Li and further in view of Xu to incorporate the teachings of Variani in order to specifically use a V2 embedding for the embedding lookup table. Doing so would beneficial, as this would reduce total training and inference costs (Variani, pg. 5, para. below Table 2: “In other words, trading computation with memory, which can significantly reduce total training and inference cost.”).

9. Claims 5 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Chang in view of Li, and in further view of Ni et al. (US 2024/0346374, hereinafter Ni).

Regarding claim 5, Chang in view of Li discloses wherein a training process trains the joint segmenting and ASR model on the set of training samples (Chang, Fig. 2) by: training, during a first stage, the second joint network to learn how to predict the corresponding transcription of the spoken utterance characterized by the audio data of each training sample (Chang, para. 0051 “During the training process 200, a training module 270 adjusts parameters of the joint model 140…The training module 270 is configured to compare the predicted output labels 265 and associated output scores 260 with the reference output labels 222 for the corresponding reference transcription 211 and adjust parameters of the joint model 140, e.g., neural network weights, to improve the accuracy of the predictions. The training to improve prediction accuracy for the endpoint token </s> 275 can be done jointly with and at the same time as training for output labels for linguistic units. This process of adjusting the model parameters can repeat for many different training samples 236 to train the joint model 140 to make accurate predictions for both speech decoding and endpointing.”). Chang in view of Li further discloses during a second stage: … using the end of speech segment ground truth token inserted into the corresponding transcription of the spoken utterance characterized by the audio data of each training sample (Li, end of speech ground truth tokens within transcription are used for training the first joint network (Fig. 1, “Joint Network”) within RNN-T model; pg. 2, section 2 “In this work, RNN-T is trained to directly predict word piece token sequence y, where the last label yu is the special token </s>.”; pg. 2, section 2.1 “Specifically, during training for every input frame… and every label…RNN-T computes a UxT matrix…which is used in the training loss computation.”; pg. 1, 2nd col. 3rd para. “First, we introduce penalties for emitting </s> too early or late in training…These penalties are applied to the </s> token, where the ground truth is obtained from a forced alignment between the transcript and audio signals.”).
Chang and Li are considered to be analogous to the claimed invention as
they both are in the same field of automated speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Chang to incorporate the teachings of Li in order to include a first joint network which generates an indication of whether the corresponding output step corresponds to an end of speech segment. Doing so would beneficial, as joint optimization of endpointing allows for useful information captured from the ASR model to be shared with the endpointer, leading to better endpointing decisions (pg. 1, section 1, 4th para. “Information captured by ASR models is not shared to the endpointer, which may be useful for making endpointing decision. It would be better to optimize the endpointer and ASR models together.”) Additionally, it would have been obvious to use the end of speech segment ground truth token inserted into the corresponding transcription of the spoken utterance. Doing so would be beneficial, as using ground truth tokens for end of speech allows for penalties which introduce a training loss for either prematurely predicting or lately predicting end of speech, preventing both potential deletion errors and increased latency (Li, pg. 2, section 2.1).
Chang in view of Li does not specifically disclose after training the second joint network, during a second stage: initializing, the first joint network with the same parameters as the trained second joint network.
Ni teaches a first and second machine learning model, wherein after training the second joint network, during a second stage: initializing, the first joint network with the same parameters as the trained second joint network (para. 0135 “The first machine learning model as shown in FIG. 3a and the second machine learning model as shown in FIG. 3b have parts having the same structure. Therefore, the parts having the same structure may use the same model parameters. In this way, when the first machine learning model is trained, at least a part of the model parameters of the second machine learning model is used, and when the second machine learning model is trained, at least a part of the model parameters of the first machine learning model is used. [0135] For example, the first model training terminal in the model training system performs a round of model training to obtain the model parameters of the first machine learning model, and then the model parameters of the first machine learning model are transferred to the second model training terminal in the model training system. The second model training terminal performs a round of model training of the second machine learning model using the model parameters of the first machine learning model.”).
Chang, Li, and Ni are considered to be analogous to the claimed invention as
Chang and Li are in the same field of automatic speech recognition, and Ni is in the same field of machine learning model training. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Chang in view of Li to incorporate the teachings of Ni in order to initialize the first joint network with the same parameters as the trained second joint network model during a second stage after the training of the second joint network model. Doing so would be beneficial, as reusing model parameters would reduce computational time spent training the second model.

Regarding claim 16, Chang in view of Li discloses wherein a training process trains the joint segmenting and ASR model on the set of training samples (Chang, Fig. 2) by: training, during a first stage, the second joint network to learn how to predict the corresponding transcription of the spoken utterance characterized by the audio data of each training sample (Chang, para. 0051 “During the training process 200, a training module 270 adjusts parameters of the joint model 140…The training module 270 is configured to compare the predicted output labels 265 and associated output scores 260 with the reference output labels 222 for the corresponding reference transcription 211 and adjust parameters of the joint model 140, e.g., neural network weights, to improve the accuracy of the predictions. The training to improve prediction accuracy for the endpoint token </s> 275 can be done jointly with and at the same time as training for output labels for linguistic units. This process of adjusting the model parameters can repeat for many different training samples 236 to train the joint model 140 to make accurate predictions for both speech decoding and endpointing.”). Chang in view of Li further discloses during a second stage: … using the end of speech segment ground truth token inserted into the corresponding transcription of the spoken utterance characterized by the audio data of each training sample (Li, end of speech ground truth tokens within transcription are used for training the first joint network (Fig. 1, “Joint Network”) within RNN-T model; pg. 2, section 2 “In this work, RNN-T is trained to directly predict word piece token sequence y, where the last label yu is the special token </s>.”; pg. 2, section 2.1 “Specifically, during training for every input frame… and every label…RNN-T computes a UxT matrix…which is used in the training loss computation.”; pg. 1, 2nd col. 3rd para. “First, we introduce penalties for emitting </s> too early or late in training…These penalties are applied to the </s> token, where the ground truth is obtained from a forced alignment between the transcript and audio signals.”).
Chang and Li are considered to be analogous to the claimed invention as
they both are in the same field of automated speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Chang to incorporate the teachings of Li in order to include a first joint network which generates an indication of whether the corresponding output step corresponds to an end of speech segment. Doing so would beneficial, as joint optimization of endpointing allows for useful information captured from the ASR model to be shared with the endpointer, leading to better endpointing decisions (pg. 1, section 1, 4th para. “Information captured by ASR models is not shared to the endpointer, which may be useful for making endpointing decision. It would be better to optimize the endpointer and ASR models together.”) Additionally, it would have been obvious to use the end of speech segment ground truth token inserted into the corresponding transcription of the spoken utterance. Doing so would be beneficial, as using ground truth tokens for end of speech allows for penalties which introduce a training loss for either prematurely predicting or lately predicting end of speech, preventing both potential deletion errors and increased latency (Li, pg. 2, section 2.1).
Chang in view of Li does not specifically disclose after training the second joint network, during a second stage: initializing, the first joint network with the same parameters as the trained second joint network.
Ni teaches a first and second machine learning model, wherein after training the second joint network, during a second stage: initializing, the first joint network with the same parameters as the trained second joint network (para. 0135 “The first machine learning model as shown in FIG. 3a and the second machine learning model as shown in FIG. 3b have parts having the same structure. Therefore, the parts having the same structure may use the same model parameters. In this way, when the first machine learning model is trained, at least a part of the model parameters of the second machine learning model is used, and when the second machine learning model is trained, at least a part of the model parameters of the first machine learning model is used. [0135] For example, the first model training terminal in the model training system performs a round of model training to obtain the model parameters of the first machine learning model, and then the model parameters of the first machine learning model are transferred to the second model training terminal in the model training system. The second model training terminal performs a round of model training of the second machine learning model using the model parameters of the first machine learning model.”).
Chang, Li, and Ni are considered to be analogous to the claimed invention as
Chang and Li are in the same field of automatic speech recognition, and Ni is in the same field of machine learning model training. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Chang in view of Li to incorporate the teachings of Ni in order to initialize the first joint network with the same parameters as the trained second joint network model during a second stage after the training of the second joint network model. Doing so would be beneficial, as reusing model parameters would reduce computational time spent training the second model.

10 Claims 6 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Chang in view of Narayanan et al. (NPL Cascaded Encoders for Unifying Streaming and Non-streaming ASR, hereinafter Narayanan).

	Regarding claim 6, Chang does not specifically disclose wherein the encoder comprises a causal encoder comprising a stack of conformer layers or transformer layers.
	Narayanan teaches wherein the encoder (Fig. 1, “Causal Encoder” together with “Non-Causal Encoder”; pg. 2, section 2.1, 1st para. “…the proposed cascaded encoders model consists of both causal and non-causal layers.”) comprising a causal encoder (Fig. 1, “Causal Encoder”; pg. 2, section 2.1, 1st para. “The input features, x, are first passed to a causal encoder, which transforms the features to a higher-level representation…”) comprising a stack of conformer layers or transformer layers (pg. 2, section 2.2, “Apart from the widely used LSTM model, we also present results using the conformer architecture in Sec. 4…In this work, we examine using conformer layers to implement either the causal encoder or the non-causal encoder or both…”; pg. 3, section 3.2, 3rd para. “When using conformers as the causal encoder in C-T, we use 17 layers. Each layer has 512-units, and uses 8 attention heads and a convolutional kernel of size 15…the output features are stacked and subsampled by a factor of 2 after the 4th layer…”).
Chang and Narayanan are considered to be analogous to the claimed invention as they both are in the same field of automatic speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Chang to incorporate the teachings of Narayanan in order to specifically use an encoder comprising a causal encoder comprising a stack of conformer layers. Doing so would beneficial, as conformer-based encoders outperform LSTM-based models in causal modes (pg. 4, section 4.2, 2nd para. “Overall, conformer-based cascaded encoders outperform LSTM-based models in both causal and non-causal mode.”).
	
	Regarding claim 17, Chang does not specifically disclose wherein the encoder comprises a causal encoder comprising a stack of conformer layers or transformer layers.
	Narayanan teaches wherein the encoder (Fig. 1, “Causal Encoder” together with “Non-Causal Encoder”; pg. 2, section 2.1, 1st para. “…the proposed cascaded encoders model consists of both causal and non-causal layers.”) comprising a causal encoder (Fig. 1, “Causal Encoder”; pg. 2, section 2.1, 1st para. “The input features, x, are first passed to a causal encoder, which transforms the features to a higher-level representation…”) comprising a stack of conformer layers or transformer layers (pg. 2, section 2.2, “Apart from the widely used LSTM model, we also present results using the conformer architecture in Sec. 4…In this work, we examine using conformer layers to implement either the causal encoder or the non-causal encoder or both…”; pg. 3, section 3.2, 3rd para. “When using conformers as the causal encoder in C-T, we use 17 layers. Each layer has 512-units, and uses 8 attention heads and a convolutional kernel of size 15…the output features are stacked and subsampled by a factor of 2 after the 4th layer…”).
Chang and Narayanan are considered to be analogous to the claimed invention as they both are in the same field of automatic speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Chang to incorporate the teachings of Narayanan in order to specifically use an encoder comprising a causal encoder comprising a stack of conformer layers. Doing so would beneficial, as conformer-based encoders outperform LSTM-based models in causal modes (pg. 4, section 4.2, 2nd para. “Overall, conformer-based cascaded encoders outperform LSTM-based models in both causal and non-causal mode.”).

Allowable Subject Matter
11. Claims 8-10 and 19-21 would be allowable if rewritten or amended in independent form including all of the limitations of the base claim and any intervening claims and if rewritten or amended to overcome the rejection under 35 U.S.C. § 101.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Maas et al. (US Patent 12,211,517) discloses a method for predicting potential endpoints in speech (Fig. 3).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CODY DOUGLAS HUTCHESON whose telephone number is (703)756-1601. The examiner can normally be reached M-F 8:00AM-5:00PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at (571)-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/C.D.H./           Examiner, Art Unit 2659                                                                                                                                                                                             
/PIERRE LOUIS DESIR/           Supervisory Patent Examiner, Art Unit 2659
Patent Application 18304064 - Joint Segmenting and Automatic Speech Recognition - Rejection

Patent Application 18304064 - Joint Segmenting and Automatic Speech Recognition

Application Information

Rejection Summary

Cited Patents

Office Action Text

Transform your business with AI in minutes, not months