Jump to content

Patent Application 18078782 - FEDERATED KNOWLEDGE DISTILLATION ON AN ENCODER - Rejection

From WikiPatents

Patent Application 18078782 - FEDERATED KNOWLEDGE DISTILLATION ON AN ENCODER

Title: FEDERATED KNOWLEDGE DISTILLATION ON AN ENCODER OF A GLOBAL ASR MODEL AND/OR AN ENCODER OF A CLIENT ASR MODEL

Application Information

  • Invention Title: FEDERATED KNOWLEDGE DISTILLATION ON AN ENCODER OF A GLOBAL ASR MODEL AND/OR AN ENCODER OF A CLIENT ASR MODEL
  • Application Number: 18078782
  • Submission Date: 2025-04-08T00:00:00.000Z
  • Effective Filing Date: 2022-12-09T00:00:00.000Z
  • Filing Date: 2022-12-09T00:00:00.000Z
  • National Class: 704
  • National Sub-Class: 232000
  • Examiner Employee Number: 100879
  • Art Unit: 2653
  • Tech Center: 2600

Rejection Summary

  • 102 Rejections: 0
  • 103 Rejections: 3

Cited Patents

No patents were cited in this rejection.

Office Action Text


    DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . This communication is in response to the application filed on 09 December 2022. Claims 1-20 are pending and have been examined. 

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 09 December 2022 is being considered by the examiner.

Drawings
The drawings are objected to because "PROTIONS" should read as "PORTIONS" in reference character 512.  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Specification
The disclosure is objected to because of the following informalities: 
In paragraph 0034, line 1-2, "may include may include" should read as "may include."
In paragraph 0086, line 3, a spacing/indentation error is present.  
Appropriate correction is required.

Claim Objections
Claim 12 objected to because of the following informalities:
 In the 3rd line, "joining network" should be "joint network."  
Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 2, 4, 6-11, 13, and 15-17, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over WO 2022019885 A1, hereinafter referred to as Beaufays et al., in view of US 20220309340 A1, hereinafter referred to as Leal et al. and further in view of US 20150161994 A1, hereinafter referred to as Tang et al.

Regarding claim 1, Beaufays discloses a method implemented by one or more processors (“Various implementations can include a non-transitory computer readable storage medium storing instructions executable by one or more processors,” Beaufays et al. para [0102].), the method comprising: 

distilling information from a global automatic speech recognition (“ASR”) model to generate a client ASR model (“Turning back to FIG. 1A, an update distribution engine 166 can, responsive to one or more of the conditions being satisfied for the client device 110 or one or more of the plurality of additional client devices 170, transmit the combined machine learning model(s) 106 to the client device 110 and/or one or more of the plurality of additional client devices 170.,” Beaufays et al. para [0043].);

for each of a plurality of training instances in the set of training instances and until one or more conditions are satisfied: selecting a given training instance, wherein the given training instance includes an instance of audio data capturing a spoken utterance (“In some of those implementations, in generating a gradient at a client device, the client device can: detect, via corresponding microphone(s), audio data that captures a spoken utterance of a corresponding user of the client device…,” Beaufays et al. para [0003]); 

processing the instance of audio data capturing the spoken utterance using the client ASR model to generate one or more predicted coefficients corresponding to the given training instance (“, …process the audio data (e.g., features thereof), using a local machine learning model that includes ML model layers that correspond to global ML model layers and that are used in generating an encoding of the audio data, to generate predicted output(s); and generate, using unsupervised (or self-supervised) learning, the gradient based on the predicted output(s),” Beaufays et al. para [0003]); and 

updating one or more portions of the client encoder based on comparing the loss and the one or more predicted coefficients (“In some implementations, the system may also transmit, to the client device, the updated global machine learning model layers. The client device may then use the updated global machine learning model layers in generating encodings of further sensor data generated at the client device, thereby updating the local machine learning model used in generating the predicted output,” Beaufays et al. para [0073]). 

However, Beaufays et al. does not disclose the usage of a client encoder, prediction model, and joint network, the usage of a global encoder, prediction model, and joint network, nor a global ASR model producing global output. Leal et al. teaches a knowledge distillation method for automatic speech recognition (ASR) systems. 

Continuing claim 1, Leal et al. teaches wherein the client ASR model includes a client encoder, the prediction model, and the joint network (“Generally speaking, when the adaptive model 200 is a particular type of speech recognition model, both the teacher model 210 and the student model (i.e., the adaptive model 200) are the same type of model for purposes of distillation,” Leal et al. para [0031] and Leal et al. Fig. 3 shows the encoder 310, the prediction network 330, and the joint network 320); 

wherein the global ASR model includes a global encoder, a prediction model, and a joint network (“Generally speaking, when the adaptive model 200 is a particular type of speech recognition model, both the teacher model 210 and the student model (i.e., the adaptive model 200) are the same type of model for purposes of distillation,” Leal et al. para [0031] and Leal et al. Fig. 3 shows the encoder 310, the prediction network 330, and the joint network 320); 

processing the instance of audio data capturing the spoken utterance using the global ASR model to generate global output (“Training one or more teacher automatic speech recognition (ASR) models using the plurality of teacher training examples, each teacher ASR model configured to output a respective textual representation of a respective audio input,” Leal et al. Fig. 4 reference character 404). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include Leal’s disclosure of client/global encoders, prediction models, and joint networks, as these are indicative of recurrent neural network transducers (RNN-Ts). It would have also been obvious to include Leal’s disclosure of a teacher ASR model producing output to Beaufays’s global model. RNN-Ts allow for the continuous processing of inputs while simultaneously streaming outputs, which one of ordinary skill in the art could add to Beaufays’s disclosure to allow for better real-time communication. Leal’s disclosure of a teacher model producing global output would achieve a more effective measure of the loss between the client and global models and therefore allow better training.

Neither Beaufays et al. nor Leal et al. disclose an ASR model that is processed using principal component analysis (PCA); or generating a loss based on the global output, the one or more coefficients, the mean vector for the set of training instances, and the set of principal directions for the set of training instances. 

Tang et al. teaches a method for ASR that utilizes PCA. Continuing claim 1, Tang et al. teaches wherein distilling the global ASR model to generate the client ASR model comprises: processing a set of training instances using principal component analysis (“PCA”) to generate (a) a mean vector for the set of training instances and (b) a set of principal directions for the set of training instances (“The dimensionality of the speaker data may be reduced using, for example, principal component analysis (PCA) prior to applying the speaker data as input to the DNN,” Tang et al. para [0004]); 

generating a loss based on the global output, the one or more coefficients, the mean vector for the set of training instances, and the set of principal directions for the set of training instances (“In some examples, each of the one or more teacher ASR models and the multi-lingual student ASR models include a recurrent neural network-transducer (RNN-T) architecture. In these examples, the tunable distillation loss weight may include a decreasing function based on an RNN-T loss corresponding to the one or more teacher ASR models,” Leal et al. para [0010] and “The dimensionality of the speaker data may be reduced using, for example, principal component analysis (PCA) prior to applying the speaker data as input to the DNN,” Tang et al. para [0004]). PCA is performed by finding the mean vector and then identifying the principal components of maximum variance. This use of PCA by Tang et al. combined with the teachings of Beaufays et al. and Leal et al. is used to compress the transferable data/knowledge and therefore lead to a simpler RNN-T with fewer input nodes compared to when PCA is not used. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include this dimension reduction technique in order to simplify their ASR models. Leal et al. teaches a loss function based on the RNN-T loss corresponding to remote ASR models. Tang et al. teaches performing PCA, which results in mean vectors and a set of principal directions. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include both the tunable distillation loss function with the mean vectors and set of principal directions generated by PCA to provide a more representative loss value in order to then update the associated models.

Regarding claim 2, Beaufays et al., as modified by Leal et al. and Tang et al., discloses the method of claim 1, further comprising: processing a further instance of audio data capturing a further spoken utterance using the global ASR model to generate a global text representation of the spoken utterance (“Accordingly, in this example, the global ML model layers of the speech recognition model can process training instance input (e.g., audio data correspond to speech) to generate a feature representation corresponding to the training instance input, and the additional layer(s) of the given speech recognition model can process the feature representation corresponding to the speech to generate predicted output(s) (e.g., predicted phoneme(s), predicted token(s), and/or recognized text corresponding to the speech of the training instance input,” Beaufays et al. para [0010]); and 

updating one or more portions of the client encoder based on comparing the global text representation of the spoken utterance and the client text representation of the spoken utterance (“The client device can then replace the local machine learning model with the global machine learning model, or replace the weights of the local machine learning model with the updated weights of the global machine learning model, thereby updating the local machine learning model,” Beaufays et al. para [0001]).

Beaufays et al. does not disclose using the client ASR model to generate a text representation of the spoken utterance. Leal teaches a knowledge distillation method for automatic speech recognition (ASR) systems. 

Continuing claim 2, processing the further instance of audio data using the client ASR model to generate a client text representation of the spoken utterance (“A speech recognition system 140 receives audio data 14 as an input and transcribes that audio signal into a transcription 142 as an output using an adaptive automatic speech recognition (ASR) model 200 (also referred to as the adaptive model 200),” Leal et al. para [0024]). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine these and include a client-side generation of a text representation of the spoken utterance. This would benefit the training of both the client and global models by allowing better calculations of loss between the two models.

Regarding claim 4, Beaufays et al., as modified by Leal et al. and Tang et al., discloses a method wherein the global ASR model is a recurrent neural network transformer and wherein the client ASR model is an additional RNN-T. Leal et al. teaches the method of claim 1, wherein the global ASR model is a recurrent neural network transformer ("RNN-T") and wherein the client ASR model is an additional RNN-T (“In some examples, each of the one or more teacher ASR models and the multi-lingual student ASR models include a recurrent neural network-transducer (RNN-T) architecture,” Leal et al. para [0006]). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include recurrent neural network transducers (RNN-Ts). RNN-Ts allow for the continuous processing of inputs while simultaneously streaming outputs, which one of ordinary skill in the art could add to Beaufays’s disclosure to allow for better real-time communication.

Regarding claim 6, Beaufays et al., as modified by Leal et al. and Tang et al., discloses the method of claim 1, wherein the client ASR model is stored locally at a client device and wherein the global ASR model is stored at a server remote from the client device (“Each of the additional gradients are generated locally at the corresponding one of the plurality of client devices based on: unsupervised learning, at the corresponding one of the plurality of client devices, that is based on processing, using the respective local machine learning model stored locally on the corresponding one of the plurality of client devices, additional audio data that captures an additional spoken utterance, and the respective local machine learning model includes the respective portion used in generating the encoding of the additional audio data,” Beaufays et al. para [0093] and “The method further includes further updating, based on the received additional gradients, weights of the global machine learning model layers stored remotely at the remote system,” Beaufays et al. para [0093]).

Regarding claim 7, Beaufays et al., as modified by Leal et al. and Tang et al., discloses the method of claim 6, wherein storage of the global encoder takes a first value of memory, wherein storage of the client encoder takes a second value of memory, and wherein the first value of memory is greater than the second value of memory (“These peripheral devices may include a storage subsystem 624, including, for example, a memory subsystem 625 and a file storage subsystem 626,user interface output devices 620, user interface input devices 622, and a network interface subsystem 616,” Beaufays et al. para [0075]). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention that storage of a global encoder would require a larger value than that of a client encoder, and therefore should take the first value of memory while the client encoder takes the second value of memory.                                                                                                                                                                                  

Regarding claim 8, Beaufays et al., as modified by Leal et al. and Tang et al., discloses the method of claim 1, wherein the global ASR model is initially trained using a set of non-private training data (“The combined model can optionally be pre-trained, using proxy data, prior to its utilization in federated learning,” Beaufays et al. para [0001]).

Regarding claim 9, Beaufays et al., as modified by Leal et al. and Tang et al., discloses the method of claim 1 further comprising: distilling information from the client ASR model to the global ASR model (“The ML model layers can be trained at a remote system based on gradient(s) that are generated locally at client device(s) using unsupervised (or self-supervised) learning techniques, and that are transmitted to the remote system,” Beaufays et al. para [0002]); 

wherein distilling information from the client ASR model to the global ASR model comprises: processing an additional instance of audio data capturing an additional spoken utterance using the client ASR model to generate one or more additional predicted coefficients (“In some of those implementations, in generating a gradient at a client device, the client device can: detect, via corresponding microphone(s), audio data that captures a spoken utterance of a corresponding user of the client device; process the audio data(e.g., features thereof), using a local machine learning model that includes ML model layers that correspond to global ML model layers and that are used in generating an encoding of the audio data, to generate predicted output(s); and generate, using unsupervised (or self-supervised) learning, the gradient based on the predicted output(s),” Beaufays et al. para [0003]); and

updating one or more portions of the global encoder based on comparing the additional loss and the one or more additional predicted coefficients (“Update, based on the received gradients and/or the additional gradients, weights of global machine learning model layers stored remotely at the remote system,” Beaufays et al. Fig. 4 reference character 456).

Beaufays et al., as modified by Leal et al. and Tang et al., discloses generating additional global output based on the processing of one or more additional predicted coefficients, the mean vector, and the set of principal directions using the global output or generating an additional loss based on the one or more additional coefficients, additional global loss, the mean vector, and the set of principal directions. 

Continuing claim 9, Leal et al. and Tang et al. together teach generating additional global output based on processing the one or more additional predicted coefficients, the mean vector, and the set of principal directions using the global ASR model (“Training one or more teacher automatic speech recognition (ASR) models using the plurality of teacher training examples, each teacher ASR model configured to output a respective textual representation of a respective audio input,” Leal et al. Fig. 4 reference character 404 and “The dimensionality of the speaker data may be reduced using, for example, principal component analysis (PCA) prior to applying the speaker data as input to the DNN,” Tang et al. para [0004]); 

generating an additional loss based on the one or more additional predicted coefficients, the additional global loss, the mean vector, and the set of principal directions (“In some examples, each of the one or more teacher ASR models and the multi-lingual student ASR models include a recurrent neural network-transducer (RNN-T) architecture. In these examples, the tunable distillation loss weight may include a decreasing function based on an RNN-T loss corresponding to the one or more teacher ASR models,” Leal et al. para [0010] and “The dimensionality of the speaker data may be reduced using, for example, principal component analysis (PCA) prior to applying the speaker data as input to the DNN,” Tang et al. para [0004]). Leal et al. teaches a remote teacher ASR model that is configured to generate output, as well as a loss function based on the RNN-T loss corresponding to remote ASR models. Tang et al. teaches performing PCA, which results in mean vectors and a set of principal directions. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include both the remote ASR model that produces global output with the mean vectors and set of principal directions generated by PCA to provide a more representative loss value in order to then update the global model. It also would have been obvious to include both the tunable distillation loss function with the mean vectors and set of principal directions generated by PCA to provide additional, more representative loss values in order to then update the associated global model.

Regarding claim 10, Beaufays et al. discloses a method implemented by one or more processors (“Various implementations can include a non-transitory computer readable storage medium storing instructions executable by one or more processors,” Beaufays et al. para [0102].); 

the method comprising: distilling information from a client automatic speech recognition (“ASR”) model to a global ASR model (“Transmit, to a remote system and from the client device, the generated gradient to cause the remote system to utilize the generated gradient to update weights of global machine learning model layers stored remotely at the remote system,” Beaufays et al. Fig. 3 reference character 360 and “In some implementations, the combined machine learning model is an automatic speech recognition (ASR) model, and the at least one prediction includes a plurality of predicted phonemes, or a plurality of predicted tokens, that correspond to the further spoken utterance,” Beaufays et al. para [0086]); and 

wherein distilling information from the client ASR model to the global ASR model comprises: processing an instance of audio data capturing a spoken utterance using the client ASR model to generate one or more predicted coefficients corresponding to the spoken utterance (“Detect audio data that captures a spoken utterance via one or more microphone(s) of the client device,” Beaufays et al. Fig. 3 reference character 352A and “Process, using a local machine learning model stored locally at the client device, the sensor data to generate predicted output,” Beaufays et al. Fig. 3 reference character 354); and 

updating one or more portions of the global encoder based on comparing the loss and the one or more predicted coefficients (“Update, based on the received gradients and/or the additional gradients, weights of global machine learning model layers stored remotely at the remote system,” Beaufays et al. Fig. 4 reference character 456).

Beaufays et al. does not disclose the usage of a client encoder, prediction model, and joint network or the usage of a global encoder, prediction model, and joint network. Leal et al. teaches a knowledge distillation method for automatic speech recognition (ASR) systems. 

Continuing claim 10, Leal et al. teaches wherein the client ASR model includes a client encoder, the prediction model, and the joint network (“Generally speaking, when the adaptive model 200 is a particular type of speech recognition model, both the teacher model 210 and the student model (i.e., the adaptive model 200) are the same type of model for purposes of distillation,” Leal et al. para [0031] and Leal et al. Fig. 3 shows the encoder 310, the prediction network 330, and the joint network 320);

wherein the global ASR model includes a global encoder, a prediction model, and a joint network (“Generally speaking, when the adaptive model 200 is a particular type of speech recognition model, both the teacher model 210 and the student model (i.e., the adaptive model 200) are the same type of model for purposes of distillation,” Leal et al. para [0031] and Leal et al. Fig. 3 shows the encoder 310, the prediction network 330, and the joint network 320). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include Leal’s disclosure of client/global encoders, prediction models, and joint networks, as these are indicative of recurrent neural network transducers (RNN-Ts). RNN-Ts allow for the continuous processing of inputs while simultaneously streaming outputs, which one of ordinary skill in the art could add to Beaufays’s disclosure to allow for better real-time communication.

Beaufays et al. does not disclose generating global output based on the processing of the one or more predicted coefficients, the mean vector of the global ASR model, the set of principal directions of the global ASR model using the global ASR model, where the mean vector of the global ASR model and the set of principal directions for the global ASR model are generated using principal component analysis (PCA) nor generating a loss based on the one or more predicted coefficients, the global loss, the mean vector of the global ASR model, and the set of principal directions of the global ASR model.

Continuing claim 10, Leal et al. and Tang et al. together teach generating global output based on processing the one or more predicted coefficients, a mean vector of the global ASR model, a set of principal directions of the global ASR model using the global ASR model, where the mean vector of the global ASR model and the set of principal directions for the global ASR model are generated using principal component analysis ("PCA") (“Training one or more teacher automatic speech recognition (ASR) models using the plurality of teacher training examples, each teacher ASR model configured to output a respective textual representation of a respective audio input,” Leal et al. Fig. 4 reference character 404 and “The dimensionality of the speaker data may be reduced using, for example, principal component analysis (PCA) prior to applying the speaker data as input to the DNN,” Tang et al. para [0004]); 

generating a loss based on the one or more predicted coefficients, the global loss, the mean vector of the global ASR model, and the set of principal directions of the global ASR model (“In some examples, each of the one or more teacher ASR models and the multi-lingual student ASR models include a recurrent neural network-transducer (RNN-T) architecture. In these examples, the tunable distillation loss weight may include a decreasing function based on an RNN-T loss corresponding to the one or more teacher ASR models,” Leal et al. para [0010] and “The dimensionality of the speaker data may be reduced using, for example, principal component analysis (PCA) prior to applying the speaker data as input to the DNN,” Tang et al. para [0004]). Leal et al. teaches a remote teacher ASR model that is configured to generate output, as well as a loss function based on the RNN-T loss corresponding to remote ASR models. Tang et al. teaches performing PCA, which results in mean vectors and a set of principal directions. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include both the remote ASR model that produces global output with the mean vectors and set of principal directions generated by PCA to provide a more representative loss value in order to then update the global model. It also would have been obvious to include both the tunable distillation loss function with the mean vectors and set of principal directions generated by PCA to provide additional, more representative loss values in order to then update the associated global model.

Regarding claim 11, Beaufays et al., as modified by Leal et al. and Tang et al., discloses the method of claim 10, further comprising: distilling information from an additional client ASR model to the global ASR model (“The ML model layers can be trained at a remote system based on gradient(s) that are generated locally at client device(s) using unsupervised (or self-supervised) learning techniques, and that are transmitted to the remote system,” Beaufays et al. para [0002]); 

wherein distilling information from the additional client ASR model to the global ASR model comprises: processing an additional instance of audio data capturing an additional spoken utterance using the additional client ASR model to generate one or more additional predicted coefficients corresponding to the additional spoken utterance (“In some of those implementations, in generating a gradient at a client device, the client device can: detect, via corresponding microphone(s), audio data that captures a spoken utterance of a corresponding user of the client device; process the audio data(e.g., features thereof), using a local machine learning model that includes ML model layers that correspond to global ML model layers and that are used in generating an encoding of the audio data, to generate predicted output(s); and generate, using unsupervised (or self-supervised) learning, the gradient based on the predicted output(s),” Beaufays et al. para [0003]); and 

updating one or more portions of the global encoder based on comparing the additional loss and the one or more additional predicted coefficients (“Update, based on the received gradients and/or the additional gradients, weights of global machine learning model layers stored remotely at the remote system,” Beaufays et al. Fig. 4 reference character 456).

Beaufays et al. does not disclose generating additional global output based on the processing of one or more additional predicted coefficients, the mean vector of the global ASR model, and the set of principal directions of the global ASR model using the global output nor disclose generating an additional loss based on the one or more additional coefficients, the global loss, the mean vector of the global ASR model, and the set of principal directions of the global ASR model. 

Continuing claim 11, Leal et al. and Tang et al. together teach generating additional global output based on processing the one or more additional predicted coefficients, the mean vector of the global ASR model, and the set of principal directions of the global ASR model using the global ASR model (“Training one or more teacher automatic speech recognition (ASR) models using the plurality of teacher training examples, each teacher ASR model configured to output a respective textual representation of a respective audio input,” Leal et al. Fig. 4 reference character 404 and “The dimensionality of the speaker data may be reduced using, for example, principal component analysis (PCA) prior to applying the speaker data as input to the DNN,” Tang et al. para [0004]); 

generating an additional loss based on the one or more additional predicted coefficients, the global loss, the mean vector of the global ASR model, and the set of principal directions of the global ASR model (“In some examples, each of the one or more teacher ASR models and the multi-lingual student ASR models include a recurrent neural network-transducer (RNN-T) architecture. In these examples, the tunable distillation loss weight may include a decreasing function based on an RNN-T loss corresponding to the one or more teacher ASR models,” Leal et al. para [0010] and “The dimensionality of the speaker data may be reduced using, for example, principal component analysis (PCA) prior to applying the speaker data as input to the DNN,” Tang et al. para [0004]). Leal et al. teaches a remote teacher ASR model that is configured to generate output, as well as a loss function based on the RNN-T loss corresponding to remote ASR models. Tang et al. teaches performing PCA, which results in mean vectors and a set of principal directions. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include both the remote ASR model that produces global output with the mean vectors and set of principal directions generated by PCA to provide a more representative loss value in order to then update the global model. It also would have been obvious to include both the tunable distillation loss function with the mean vectors and set of principal directions generated by PCA to provide additional, more representative loss values in order to then update the associated global model.

Regarding claim 13, Beaufays et al., as modified by Leal et al. and Tang et al., does not disclose a method wherein the global ASR model is a recurrent neural network transformer and wherein the client ASR model is an additional RNN-T. Leal et al. teaches a knowledge distillation method for automatic speech recognition (ASR) systems. Leal et al. teaches the method of claim 1, wherein the global ASR model is a recurrent neural network transformer ("RNN-T") and wherein the client ASR model is an additional RNN-T (“In some examples, each of the one or more teacher ASR models and the multi-lingual student ASR models include a recurrent neural network-transducer (RNN-T) architecture,” Leal et al. para [0006]). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include recurrent neural network transducers (RNN-Ts). RNN-Ts allow for the continuous processing of inputs while simultaneously streaming outputs, which one of ordinary skill in the art could add to Beaufays’s disclosure to allow for better real-time communication.

Regarding claim 15, Beaufays et al., as modified by Leal et al. and Tang et al., discloses the method of claim 10, wherein the client ASR model is stored locally at a client device and wherein the global ASR model is stored at a server remote from the client device (“Each of the additional gradients are generated locally at the corresponding one of the plurality of client devices based on: unsupervised learning, at the corresponding one of the plurality of client devices, that is based on processing, using the respective local machine learning model stored locally on the corresponding one of the plurality of client devices, additional audio data that captures an additional spoken utterance, and the respective local machine learning model includes the respective portion used in generating the encoding of the additional audio data,” Beaufays et al. para [0093] and “The method further includes further updating, based on the received additional gradients, weights of the global machine learning model layers stored remotely at the remote system,” Beaufays et al. para [0093]).

As to claim 16, system claim 16 and method claim 1 are related as method and system of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 16 is similarly rejected under the same rationale as applied above with respect to the method claim.

As to claim 17, system claim 17 and method claim 2 are related as method and system of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 17 is similarly rejected under the same rationale as applied above with respect to the method claim.

As to claim 19, system claim 19 and method claim 4 are related as method and system of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 17 is similarly rejected under the same rationale as applied above with respect to the method claim.

Claim(s) 3, 12, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over WO 2022019885 A1, hereinafter referred to as Beaufays et al., in view of US 20220309340 A1, hereinafter referred to as Leal et al. and further in view of US 20150161994 A1, hereinafter referred to as Tang et al. and further in view of US 20220383887 A1, hereinafter referred to as Wang et al.

Regarding claim 3, neither Beaufays et al., Leal et al., nor Tang et al. disclose the freezing of a joint network of a client ASR model and freezing a prediction network of the client ASR model. Wang et al. teaches a system and method for generating and operating a speech enhancement model. Wang et al. teaches the method of claim 2, wherein updating the one or more proportions of the client encoder based on comparing the global text representation of the spoken utterance and the client text representation of the spoken utterance comprises: freezing the joint network of the client ASR model and freezing the prediction network of the client ASR model (“The updating engine 154 is also configured to freeze a set of internal layers of the automatic speech recognition model prior to updating the speech enhancement model,” Wang et al. para [0047]). Freezing internal layers (such as the joint network and prediction network) of the client ASR model would allow the parameters of such layers from being updated during the comparison of the global and client text representation of the spoken utterance. This prevents overfitting of the data and improves model stability. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Wang’s disclosure of freezing ASR layers with Beaufays’s remote and client ASR models.

Regarding claim 12, Beaufays et al., as modified by Leal et al., Tang et al., and Wang et al., discloses the method of claim 10, wherein updating the one or more proportions of the client encoder based on comparing the global text representation of the spoken utterance and the client text representation of the spoken utterance comprises: freezing the joint network of the client ASR model and freezing the prediction network of the client ASR model (“The updating engine 154 is also configured to freeze a set of internal layers of the automatic speech recognition model prior to updating the speech enhancement model,” Wang et al. para [0047]). Freezing internal layers (such as the joint network and prediction network) of the client ASR model would allow the parameters of such layers from being updated during the comparison of the global and client text representation of the spoken utterance. This prevents overfitting of the data and improves model stability. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Wang’s disclosure of freezing ASR layers with Beaufays’s remote and client ASR models.

As to claim 18, system claim 18 and method claim 3 are related as method and system of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 18 is similarly rejected under the same rationale as applied above with respect to the method claim.

Claim(s) 5, 14, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over WO 2022019885 A1, hereinafter referred to as Beaufays et al., in view of US 20220309340 A1, hereinafter referred to as Leal et al., further in view of US 20150161994 A1, hereinafter referred to as Tang et al., further in view of US 20220383887 A1, hereinafter referred to as Wang et al., and further in view of “Bregman Divergence-Based Regularization for Transfer Subspace Learning,” hereinafter referred to as Si Si et al.

Regarding claim 5, neither Beaufays et al., Leal et al., Tang et al., nor Wang et al. disclose processing the set of training instances using Bregman PCA. Si Si et al. teaches using Bregman divergence between the distribution of training and testing samples in order to boost performance of popular subspace learning algorithms. Si Si et al. teaches the method of claim 1, wherein processing the set of training instances using PCA to generate (a) the mean vector for the set of training instances and (b) the set of principal directions for the set of training instances comprises: processing the set of training instances using Bregman PCA to generate (a) the mean vector for the set of training instances and (b) the set of principal directions for the set of training instances (“…, the new regularization minimizes Bregman divergence between the distribution of training samples and that of testing samples in the selected subspace, so it boosts the performance when training and testing samples are not independent and identically distributed. To test the effectiveness of the proposed regularization, we introduce it to popular subspace learning algorithms, e.g., principal components analysis (PCA) for cross-domain face modeling…,” Si Si et al. pg. 1 Abstract). Traditional PCA utilizes a squared loss function to minimize distance between data points. Bregman divergences, or distances, perform better with data that is not real-valued. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Bregman PCA in lieu of traditional PCA, as this would allow for more a more accurate loss representation between the client ASR model and the global ASR model.

Regarding claim 14, neither Beaufays et al., Leal et al., Tang et al., nor Wang et al. disclose generating the mean vector of the global ASR model and the set of principal directions of the global ASR model using Bregman PCA. Si Si et al. teaches the method of claim 10, wherein the mean vector of the global ASR model and the set of principal directions of the global ASR model are generated using Bregman PCA. (“…, the new regularization minimizes Bregman divergence between the distribution of training samples and that of testing samples in the selected subspace, so it boosts the performance when training and testing samples are not independent and identically distributed. To test the effectiveness of the proposed regularization, we introduce it to popular subspace learning algorithms, e.g., principal components analysis (PCA) for cross-domain face modeling…,” Si Si et al. pg. 1 Abstract). Traditional PCA utilizes a squared loss function to minimize distance between data points. Bregman divergences, or distances, perform better with data that is not real-valued. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Bregman PCA in lieu of traditional PCA, as this would allow for more a more accurate loss representation between the client ASR model and the global ASR model.

As to claim 20, system claim 20 and method claim 5 are related as method and system of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 20 is similarly rejected under the same rationale as applied above with respect to the method claim.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ADAM MICHAEL WEAVER whose telephone number is (571)272-7062. The examiner can normally be reached Monday-Friday, 8AM-5PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Joshua Schwartz can be reached on 571-270-7494. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ADAM MICHAEL WEAVER/Examiner, Art Unit 4167                                                                                                                                                                                                         



/ANNE L THOMAS-HOMESCU/Primary Examiner, Art Unit 2656                                                                                                                                                                                                        






    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    


(Ad) Transform your business with AI in minutes, not months

Custom AI strategy tailored to your specific industry needs
Step-by-step implementation with measurable ROI
5-minute setup that requires zero technical skills
Get your AI playbook

Trusted by 1,000+ companies worldwide

Cookies help us deliver our services. By using our services, you agree to our use of cookies.