Patent Application 18303296 - Detecting Unintended Memorization in

Title: Detecting Unintended Memorization in Language-Model-Fused ASR Systems

Application Information

Invention Title: Detecting Unintended Memorization in Language-Model-Fused ASR Systems
Application Number: 18303296
Submission Date: 2025-05-23T00:00:00.000Z
Effective Filing Date: 2023-04-19T00:00:00.000Z
Filing Date: 2023-04-19T00:00:00.000Z
National Class: 704
National Sub-Class: 232000
Examiner Employee Number: 99179
Art Unit: 2654
Tech Center: 2600

Rejection Summary

102 Rejections: 0
103 Rejections: 9

Cited Patents

No patents were cited in this rejection.

Office Action Text

Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
Claims 1-30 are pending. Claims 1 and 16 are independent.
This Application was published as US 20230335126.
Apparent priority is 19 April 2022.
The instant Application is directed to a method of detecting memorization of unintended (i.e. sensitive) information by a language model by inserting “canary” text into the training dataset and testing the system with synthesized audio of the canary text.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim(s) 1-2, 6, 10, 16-17, 21, and 25 is/are rejected under 35 U.S.C. 103 as being unpatentable over Iwama et al. ("Automated Testing of Basic Recognition Capability for Speech Recognition Systems") in view of Parikh et al. ("Canary Extraction in Natural Language Understanding Models").

Regarding claim 1, Iwama discloses: 1. A computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations comprising: (Pg. 13, Section I discloses a software engineering project which implies the method is computer-implemented. )
training an external language model on the corpus of training text samples and the set of canary text samples inserted into the corpus of training text samples; ("Language Model" Fig. 1. Iwana does not specifically disclose training the language model. Pg. 23, para 3 mentions updating a language model.)
for each canary text sample in the set of canary text samples:
generating, using a text-to-speech (TTS) system, a corresponding synthetic speech utterance; and ("(b) synthesizing a set of audio data files from each test sentence by using multiple text-to-speech (TTS) synthesizers with different speech characteristics," Pg. 14, para 3, (II))
generating, using a trained automatic speech recognition (ASR) model configured to receive the corresponding synthetic speech utterance as input, an initial transcription for the corresponding synthetic speech utterance; ("It transforms this speech into several candidate sentences with scores such as [She is the flower to his house: 0.8] and [She is the flour to his house: 0.1]" Pg. 14, Section A.)
rescoring, using the external language model trained on the corpus of training text samples and the set of canary text samples inserted into the corpus of training text samples, the initial transcription generated for each corresponding synthetic speech utterance; ("… a language model, which expresses the grammar of the recognition text (i.e., word sequences) and the probability of each text matching the grammar. … The decoder actually computes the word sequence wˆ, given by the following equation, which maximizes the posteriori probability PA(v|x)PD(x|w)PL(w) by simultaneously considering the three models at the same time" Pg. 15, para 1)
determining a word error rate (WER) of the external language model based on the rescored initial transcriptions and the canary text samples; and ("For example, word error rate WER, which is a commonly used metric for the performance of speech recognition, can be calculated for each test sentence w by WER(w) = (S(w) + D(w) + I(w)) / N(w), (5)" Pg. 16, Col 2, para 4)
detecting memorization of the canary text samples by the external language model based on the WER of the external language model. (as discussed in the previous claim, Iwama teaches WER for detecting performance.)
Iwama does not disclose: training an external language on canary text samples which are inserted into the training corpus, or detecting memorization of the canary samples based on the WER.
Parikh discloses: inserting a set of canary text samples into a corpus of training text samples;
(“In this work, we start with inserting potentially sensitive target utterances called ‘canaries’1 along with their corresponding output labels into the training data.” Pg. 1, para 3)
training an external language model on the corpus of training text samples and the set of canary text samples inserted into the corpus of training text samples; (“We use this augmented dataset to train an NLU model fθ” Pg. 1 para 3)
and detecting memorization of the canary text samples by the external language model based on the WER of the external language model. (“A successful attack on fθ reconstructs all the tokens of an inserted canary.” Pg. 1 para 3 – one of ordinary skill in the art would understand that a low Word Error Rate means the model accurately recognized (reconstructed) the input speech.)
Iwama and Parikh are considered analogous art to the claimed invention because they disclose methods for testing machine learning systems. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Iwama to detect potentially sensitive target utterances as taught by Parikh. Doing so would have been beneficial to ensure that private information is not leaked. (Parikh pg. 1 para 2)

Regarding claim 2, Iwama discloses: 2. The computer-implemented method of claim 1, wherein a lower WER of the external language model corresponds to an increased memorization of the canary text samples by the external language model. (“Note that this test includes checking for recognition robustness and accuracy because it verifies that various positive audio data can be correctly recognized for each test sentence. It can actually produce data for evaluating recognition robustness/accuracy. For example, word error rate WER, which is a commonly used metric for the performance of speech recognition, can be calculated for each test sentence w by WER(w) = (S(w) + D(w) + I(w)) / N(w), (5) where S(w) / D(w) / I(w) are the numbers of wrong substitution / deletion / insertion words of the recognition results dec(ext(dw)) compared with correct sentence w up to the relation . N(w) is the number of words in the correct sentence.” Pg. 16, Col 2, para 4 – as shown in equation (5), a lower WER corresponds to a better accuracy.)
Iwama does not disclose: detecting memorization of the canary samples.
Parikh discloses: detecting memorization of the canary text samples. (“A successful attack on fθ reconstructs all the tokens of an inserted canary.” Pg. 1 para 3; “Average Accuracy (Acc): Fraction of the trials where the attack correctly reconstructs the entire canary sequence in the correct order. A higher Accuracy indicates better reconstruction. Accuracy is 1 if we can reconstruct all n tokens in each of the 10 trials.” Pg. 3 section 4.3 )
See claim 1 for motivation statement.

Regarding claim 6, Iwama does not disclose the additional limitations.
Parikh discloses: 6. The computer-implemented method of claim 1, wherein each canary text sample in the set of canary text samples comprises a fixed-length sequence of random alphanumeric characters each separated by a space. (Table 1 discloses pin code as the canary pattern which is a random numeric code of length “n”)
Parikh does not specifically disclose that the characters include letters and can be separated by a space.
Official notice is taken that passwords can contain alphanumeric characters and spaces. Parikh mentions detecting sensitive information such as PIN numbers. It would have been obvious to one of ordinary skill in the art that passwords would also be sensitive information worth protecting. Therefore, it would have been obvious to one or ordinary skill in the art to use a password as the canary phrase, which could have included spaces between each alphanumeric character. This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding claim 10, Iwama discloses: 10. The computer-implemented method of claim 1, wherein the operations further comprise integrating the trained external language model with the trained ASR model, the trained external language model configured to rescore probability distributions over possible speech recognition hypotheses predicted by the trained ASR model. ("…its process consists of three models: an acoustic model, which involves stochastic mapping from phonemes to the time series of feature vectors, a lexicon, which contains information on stochastic mapping from words to phoneme strings (indicating the pronunciations of each word), and a language model, which expresses the grammar of the recognition text (i.e., word sequences) and the probability of each text matching the grammar. … The decoder actually computes the word sequence wˆ, given by the following equation, which maximizes the posteriori probability PA(v|x)PD(x|w)PL(w) by simultaneously considering the three models at the same time" Pg. 15, para 1)

Claim 16 is a system claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale. Additionally, “data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions” of the Claim are taught by Iwama (“This section first outlines the processing and components of ASR systems from the software engineering perspective and then introduces and formalizes the test viewpoint for checking the basic recognition capabilities of ASR.” Pg. 14 section II; Pg. 14 section A further describes that the speech is digitalized which would imply a processor and memory.)

Claim 17 is a system claim with limitations corresponding to the limitations of Claim 2 and is rejected under similar rationale.

Claim 21 is a system claim with limitations corresponding to the limitations of Claim 6 and is rejected under similar rationale.

Claim 25 is a system claim with limitations corresponding to the limitations of Claim 10 and is rejected under similar rationale.

Claim(s) 3 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Iwama in view of Parikh, in further view of Song et al. ("Auditing Data Provenance in Text-Generation Models").

Regarding claim 3, as discussed in claim 1, Iwama in view of Parikh discloses: 3. The computer-implemented method of claim 1, wherein the operations further comprise:
inserting a set of extraneous text samples into a second corpus of training text samples;
training a second external language model on the second corpus of training text samples and the set of extraneous text samples inserted into the second corpus of training text samples;
for each canary text sample in the set of canary text samples, receiving the initial transcription generated by the trained ASR model for the corresponding synthetic speech utterance;
rescoring, using the second external language model trained on the second corpus of training text samples and the set of extraneous text samples inserted into the second corpus of training text samples, the initial transcription generated for each corresponding synthetic speech utterance;
determining a second WER of the second external language model based on the initial transcriptions rescored by the second external language model and the canary text samples; and
detecting memorization of the canary text samples by the external language model by comparing the WER of the external language model and the second WER of the second external language model.
Iwama and Parikh do not disclose training a second language model on a second corpus containing extraneous text samples, and comparing the first and second language models.
Song discloses:
inserting a set of extraneous text samples into a second corpus of training text samples; (“Our shadow training technique is inspired by [28], but one essential distinction is that in our case the shadow-training data does not need to be drawn from the same distribution as the training data of the target model In Section 4.3, we show that public sources can be used for Dref and the loss in audit accuracy is negligible when Dtrain and Dref are drawn from different domains.” Pg. 3, col 2, para 1)
training a second external language model on the second corpus of training text samples and the set of extraneous text samples inserted into the second corpus of training text samples; (Fig. 1 shows training of Shadow models f’1 to f’k.)
comparing the WER of the external language model and the second WER of the second external language model. (“The auditor then queries the shadow models with Dref,u for each u in Uref and labels the resulting outputs as “member” if u was part of the shadow’s training data, “non-member” otherwise. The next step is to use these labeled predictions to train a binary membership classifier.” Pg. 3, col 2, para 2 – See also Fig. 1. The Audit model compares the Shadow model and Target model to determine if the sample was in the audit data. Determining that the sample was in the audit data would indicate that it was memorized by the Target model.)
Iwama, Parikh, and Song are considered analogous art to the claimed invention because they disclose methods for testing machine learning systems. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Iwama in view of Parikh to use a shadow model as taught by Song. Doing so would have been beneficial to detect memorization effectively if only black-box access to the model is available. (Song pg. 2 para 2)

Claim 18 is a system claim with limitations corresponding to the limitations of Claim 3 and is rejected under similar rationale.

Claim(s) 4 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Iwama in view of Parikh, in further view of Chen et al. (“Understanding Gradient Clipping in Private SGD: A Geometric Perspective”).

Regarding claim 4, Iwama does not disclose the additional limitations.
Parikh discloses methods of mitigating memorization (Section 6), but does not disclose gradient clipping.
Chen discloses: 4. The computer-implemented method of claim 1, wherein the operations further comprise, when training the external language model, mitigating the detected memorization of the canary text samples by the external language model by applying per- sample gradient clipping by clipping a gradient (“To protect the private information of individual citizens, many machine learning systems now train their models subject to the constraint of differential privacy … To achieve this formal privacy guarantee, one of the most popular training methods, especially for deep learning, is differentially private stochastic gradient descent (DP-SGD)” Pg. 1 Section 1)
from a prescribed number of the canary text samples. (“In practice, the bounded l2-sensitivity is ensured by gradient clipping [Abadi et al., 2016b] that shrinks an individual gradient whenever its l2 norm exceeds certain threshold c.” Pg. 2, section I – the samples that exceed the l2 threshold can be considered a prescribed number of samples.)
Iwama, Parikh, and Chen are considered analogous art to the claimed invention because they disclose methods for testing machine learning systems. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Iwama in view of Parikh to use gradient clipping as taught by Chen. Doing so would have been beneficial to “protect the private information of individual citizens”. (Chen pg. 1 Section 1)

Claim 19 is a system claim with limitations corresponding to the limitations of Claim 4 and is rejected under similar rationale.

Claim(s) 5 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Iwama in view of Parikh, in further view of Srinivasan et al. (“Looking Enhances Listening: Recovering Missing Speech Using Images”).

Regarding claim 5, Iwama in view of Parikh discloses: 5. The computer-implemented method of claim 1, wherein the operations further comprise, for each canary text sample in the set of canary text samples:
adding noise to a suffix portion of the corresponding synthetic speech utterance without adding any noise to a prefix portion of the corresponding synthetic speech utterance; and
determining, using a classifier, that the corresponding canary text sample was used to train the external language model based on the rescored initial transcription generated for the corresponding synthetic speech utterance matching the corresponding canary text sample. (disclosed in claim 1)
Iwama does not specifically disclose masking the suffix portion of canary text samples.
Parikh further discloses: adding noise to a suffix portion of the corresponding synthetic speech utterance without adding any noise to a prefix portion of the corresponding synthetic speech utterance; and (“The attack takes the form of text completion, where the adversary provides the start of a canary sentence (e.g., ‘my pin code is’) and tries to reconstruct the remaining, private tokens of an inserted canary (e.g., a sequence of 4 digit tokens).” Pg. 1 Section 1 – Parikh discloses that the prefix is provided; however the suffix is simply removed instead of masked with noise.)
Parikh does not specifically disclose that the suffix of the sentence is masked with noise.
Srinivasan discloses: adding noise to a suffix portion of the corresponding synthetic speech utterance without adding any noise to a prefix portion of the corresponding synthetic speech utterance; (“White Noise Masking: We substitute the masked word with white noise. This more realistic scenario is an approximation to a noisy-ASR problem where speech is corrupted by some stochastic signal.” Pg. 2, Section 2.2. 2.)
Iwama, Parikh, and Srinivasan are considered analogous art to the claimed invention because they disclose methods for evaluating machine learning systems. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Iwama in view of Parikh to mask the suffix of the sentence to be reconstructed with noise, as taught by Srinivasan. Doing so would have been beneficial to create a more realistic scenario. (Srinivasan Pg. 2 section 2.2.)

Claim 20 is a system claim with limitations corresponding to the limitations of Claim 5 and is rejected under similar rationale.

Claim(s) 7 and 22 is/are rejected under 35 U.S.C. 103 as being unpatentable over Iwama in view of Parikh, in further view of Brownlee (“Random Oversampling and Undersampling for Imbalanced Classification”).

Regarding claim 7, Iwama does not specifically disclose the additional limitations.
Parikh discloses: 7. The computer-implemented method of claim 1, wherein inserting the set of canary text samples into the corpus of training text samples comprises: inserting each canary text sample of a first portion of canary text samples in the set of canary text samples a single time into the corpus of training text samples; and (Pg. 8, Table 4 “Reconstruction metrics for inserted utterances appearing only once in the training data, i.e R = 1.”)
inserting each canary text sample of a second portion of canary text samples in the set of canary text samples two or more times into the corpus of training text samples, (Table 5 shows different sets of canary text is inserted different numbers of times in different experiments)
the second portion of canary text samples including different canary text samples than the first portion of canary text samples. (Table 5 shows different sets of canary samples)
Parikh does not specifically disclose that the second portion of text samples is inserted two or more times into the same corpus of training samples as the first portion.
Brownlee discloses: inserting each canary text sample of a second portion of canary text samples in the set of canary text samples two or more times into the corpus of training text samples (“duplicate examples from the minority class, called oversampling” Pg. 1, para 3)
Iwama, Parikh, and Brownlee are considered analogous art to the claimed invention because they disclose methods for machine learning systems. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Iwama in view of Parikh to oversample minority classes in the canary dataset, as taught by Brownlee. Doing so would have been beneficial to address data imbalance. (Brownlee Pg. 1, para 3)

Claim 22 is a system claim with limitations corresponding to the limitations of Claim 7 and is rejected under similar rationale.

Claim(s) 8-9 and 23-24 is/are rejected under 35 U.S.C. 103 as being unpatentable over Iwama in view of Parikh, in further view of Lee et al. (“Adaptable Multi-Domain Language Model for Transformer ASR”).

Regarding claim 8, Iwama does not specifically disclose the additional limitations. Neither does Parikh.
Lee discloses: 8. The computer-implemented method of claim 1, wherein the external language model comprises an external neural language model. (Fig. 1 shows an external Transformer LM. A transformer LM is a type of neural LM.)
Iwama, Parikh, and Lee are considered analogous art to the claimed invention because they disclose methods for machine learning systems for natural language. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Iwama in view of Parikh to use the external Transformer LM as taught by Lee. Doing so would have been beneficial to take advantage of a transformer’s computation and parallelism. (Lee Pg. 1, Section I)

Regarding claim 9, Iwama does not specifically disclose the additional limitations. Neither does Parikh.
Lee discloses: 9. The computer-implemented method of claim 8, wherein the external neural language model comprises a stack of transformer layers or Conformer layers. (Fig. 2 shows that the Transformer LM has a stack of transformer layers.)
See motivation statement for Claim 8.

Claim 23 is a system claim with limitations corresponding to the limitations of Claim 8 and is rejected under similar rationale.

Claim 24 is a system claim with limitations corresponding to the limitations of Claim 9 and is rejected under similar rationale.

Claim(s) 11-12 and 26-27 is/are rejected under 35 U.S.C. 103 as being unpatentable over Iwama in view of Parikh, in further view of Gowda et al. (US Patent No. 11,302,331).

Regarding claim 11, Iwama discloses: 11. The computer-implemented method of claim 1, wherein the trained ASR model comprises: a first encoder configured to: receive, as input, a sequence of acoustic frames; (Fig. 1 shows “Feature Extractor” (encoder) which takes in Audio “sequences” (acoustic frames).
and generate, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; (Fig. 1 shows the output of the Feature Extractor is “Time series of feature vectors of input speech” which is a higher order feature representation.)
a second encoder configured to: receive, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps; and generate, at each of the plurality of output steps, a second higher order feature representation for a corresponding first higher order feature frame; (not explicitly disclosed)
and a decoder configured to: receive, as input, the second higher order feature representation generated by the second encoder at each of the plurality of output steps; (“The recognizer component in Fig.1 is also called a decoder.” Pg. 14 last line)
and generate, at each of the plurality of output steps, a first probability distribution over possible speech recognition hypotheses. (“It transforms this speech into several candidate sentences with scores such as [She is the flower to his house: 0.8] and [She is the flour to his house: 0.1].” Pg. 14, Section II. A. )
Iwama does not disclose a second encoder. Neither does Parikh.
Gowda discloses: a first encoder (Column 4, lines 26-30, "The end-to-end ASR model may have a structure including a recurrent network, and may include an encoder for encoding a speech input and a decoder for estimating a character string from an output value of the encoder.") configured to:
receive, as input, a sequence of acoustic frames (Column 4, lines 31-34, "The encoder included in the electronic device 1000 may determine acoustic information about a phonetic feature represented by a user's speech, by encoding an audio signal including a speech input of the user."; Column 15, lines 56-57, "In operation S920, the electronic device 1000 may split an audio signal in units of frames.");
and generate, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames (Column 16, lines 50-53, "According to an embodiment of the disclosure, the first encoder 1010 may receive and encode the feature values of the plurality of frames of the audio signal."; The feature values encoded by the first encoder read on the first higher order feature representation.);
a second encoder (Column 13, lines 1-4, "For example, when the ASR model 540 includes the first encoder 502 and the second encoder 504, the electronic device 1000 may obtain the character string at the second level from the output value of the second encoder 504") configured to:
receive, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps (Column 18, line 65 - Column 19, line 3, "According to an embodiment of the disclosure, the second encoder 1040 may encode the feature values of the plurality of frames of the audio signal, based on the output value of the first encoder 1010 that encodes the audio signal such that the first ASR model 310 outputs the character string at the first level."; The output value of the first encoder reads on the higher order feature representation generated by the first encoder.);
and generate, at each of the plurality of output steps, a second higher order feature representation for a corresponding first higher order feature frame (Column 18, line 65 - Column 19, line 3, "According to an embodiment of the disclosure, the second encoder 1040 may encode the feature values of the plurality of frames of the audio signal, based on the output value of the first encoder 1010 that encodes the audio signal such that the first ASR model 310 outputs the character string at the first level."; The feature values encoded by the second encoder and based on the output value of the first encoder read on the second higher order feature representation for a corresponding first higher order feature frame.);
and a decoder (Column 4, lines 26-30, "The end-to-end ASR model may have a structure including a recurrent network, and may include an encoder for encoding a speech input and a decoder for estimating a character string from an output value of the encoder.") configured to:
receive, as input, the second higher order feature representation generated by the second encoder at each of the plurality of output steps (Column 19, lines 11-14, "The second encoder 1040 may transmit, to the second decoder 1070, an output value 1042 of the second encoder 1040 generated by encoding the feature values of the plurality of frames of the audio signal 1002."; The output value of the second encoder reads on the second higher order feature representation generated by the second encoder.);
and generate, at each of the plurality of output steps, a first probability distribution over possible speech recognition hypotheses (Column 21, lines 56-62, "According to an embodiment of the disclosure, the decision layer 1064 may determine probability values that the second and third context vectors output by the soft max layer 1062 correspond to certain labels within the decision layer 1064, and may output a character string at a second level corresponding to a label representing a highest probability value."; Column 6, lines 21-26, "However, like the above-described first ASR model 110, the projection layer and the soft max layer may be included in the decoder 122, and the electronic device 1000 may obtain the character string at the second level directly from the output value of the decoder 122 including the projection layer and the soft max layer."; Determining the probability that the second and third context vectors output by the soft max layer correspond to certain labels, and outputting a character string corresponding to a label representing a highest probability value, reads on generating a probability distribution over possible speech recognition hypotheses.).
Iwama, Parikh, and Gowda are considered analogous art to the claimed invention because they disclose methods for machine learning systems for natural language. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Iwama in view of Parikh to use a second encoder as taught by Gowda. Doing so would have been beneficial to encode the audio signal so that the feature of the user is more well reflected. (Gowda col 19, lines 30-39)

Regarding claim 12, Iwama discloses: 12. The computer-implemented method of claim 11, wherein the decoder or another decoder is further configured to: receive, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps; (Fig. 1 shows the Recognizer (decoder) receives the higher order feature representation)
and generate, at each of the plurality of output steps, a second probability distribution over possible speech recognition hypotheses. (“It transforms this speech into several candidate sentences with scores such as [She is the flower to his house: 0.8] and [She is the flour to his house: 0.1].” Pg. 14, Section II. A. – In this example, the second candidate sentence has a second probability distribution.)

Claim 26 is a system claim with limitations corresponding to the limitations of Claim 11 and is rejected under similar rationale.

Claim 27 is a system claim with limitations corresponding to the limitations of Claim 12 and is rejected under similar rationale.

Claim(s) 14 and 29 is/are rejected under 35 U.S.C. 103 as being unpatentable over Iwama in view of Parikh, in further view of Moritz et al. ("Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition").

Regarding claim 14, Iwama and Parikh do not disclose the additional limitations.
Moritz discloses: 14. The computer-implemented method of claim 11, wherein: the first encoder comprises a causal encoder (Fig. 1 shows “Causal Self-Attention” on the left) comprising an initial stack of Conformer layers; (“The conformer encoder is composed of e = 1, . . . , E conformer blocks…” Pg. 2, para 5)
and the second encoder comprises a non-causal encoder (Fig. 1 shows “Non-Causal Self-Attention” on the right side) comprising a final stack of Conformer layers (“The conformer encoder is composed of e = 1, . . . , E conformer blocks…” Pg. 2, para 5) overlain on the initial stack of Conformer layers. (Figure 1 shows the “causal frames” and “non-causal frames” overlap.)
Iwama, Parikh, and Moritz are considered analogous art to the claimed invention because they disclose methods for machine learning systems for natural language. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Iwama in view of Parikh to use a dual causal/non-causal architecture as disclosed by Moritz. Doing so would have been beneficial to achieve state of the art results. (Moritz Pg. 1, Section 1, last para)

Claim 29 is a system claim with limitations corresponding to the limitations of Claim 14 and is rejected under similar rationale.

Claim(s) 15 and 30 is/are rejected under 35 U.S.C. 103 as being unpatentable over Iwama in view of Parikh, in further view of Variani et al. (“HYBRID AUTOREGRESSIVE TRANSDUCER (HAT)”).

Regarding claim 15, Iwama and Parikh do not disclose the additional limitations.
Variani discloses: 15. The computer-implemented method of claim 11, wherein the first encoder and the second encoder of the ASR model are trained using Hybrid Autoregressive Transducer Factorization to facilitate an integration of the external language model trained on text- only data comprising the corpus of training text samples and the set of canary text samples inserted into the corpus of training text samples. (See Pg. 3, Section 3 “HYBRID AUTOREGRESSIVE TRANSDUCER”. See also: Pg. 1, Section 1, paras 2 “The probabilistic factorization of ASR allows a modular design that provides practical advantages for training and inference.”.)
Iwama, Parikh, and Variani are considered analogous art to the claimed invention because they disclose methods for machine learning systems for natural language. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Iwama in view of Parikh to use a Hybrid Autoregressive Transducer model as disclosed by Variani. Doing so would have been beneficial to decide if the external language model is beneficial or not. (Variani Pg. 1, Abstract)

Claim 30 is a system claim with limitations corresponding to the limitations of Claim 15 and is rejected under similar rationale.

Allowable Subject Matter
Claims 13 and 28 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. No prior art was found that teaches all the additional limitations of claims 13 and 28.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JON C MEIS whose telephone number is (703)756-1566. The examiner can normally be reached Monday - Thursday, 8:30 am - 5:30 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/JON CHRISTOPHER MEIS/Examiner, Art Unit 2654

/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654

Patent Application 18303296 - Detecting Unintended Memorization in - Rejection

Patent Application 18303296 - Detecting Unintended Memorization in

Application Information

Rejection Summary

Cited Patents

Office Action Text

Transform your business with AI in minutes, not months