Patent Application 16695629 - TRAINING APPROACH DETERMINATION FOR LARGE DEEP

Title: TRAINING APPROACH DETERMINATION FOR LARGE DEEP LEARNING MODELS
Application Information

Invention Title: TRAINING APPROACH DETERMINATION FOR LARGE DEEP LEARNING MODELS
Application Number: 16695629
Submission Date: 2025-05-21T00:00:00.000Z
Effective Filing Date: 2019-11-26T00:00:00.000Z
Filing Date: 2019-11-26T00:00:00.000Z
National Class: 706
National Sub-Class: 025000
Examiner Employee Number: 97442
Art Unit: 3686
Tech Center: 3600
Rejection Summary

102 Rejections: 0
103 Rejections: 2
Cited Patents

No patents were cited in this rejection.
Office Action Text


    DETAILED ACTION
This is responsive to amendments filed on 02/28/2025 in which claims 1-20 are presented for examination; Claims 1,10 and 15 have been amended. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4, and 6-20 are rejected under 35 U.S.C. 103 as being unpatentable over Liberty In view of Shoeybi et al.( “MEGATRON-LM: TRAINING MULTI-BILLION PARAMETER LANGUAGE MODELS USING MODEL PARALLELISM”, September 17, 2019) in view of Venkatraman et al. (US 20190370009 A1) and further in view of Song et al. (“Retraining Strategy-Based Domain Adaption Network for Intelligent Fault Diagnosis”, 31 October 2019)

Regarding claim 1, Liberty teaches a computer-implemented method (see, Fig. 1) comprising: 
creating, by one or more computer processors, a large model predictor trained with a rule-based database containing one or more sets of chain-based rules dictating a respective training approach (Fig. 5:

    PNG
    media_image1.png
    419
    379
    media_image1.png
    Greyscale

Note: instant specification, paragraph 0022 discloses “Program 150 trains a large model predictor (step 202). In an embodiment, program 150 creates LMP 154 (a large model predictor) as a rule-based database containing one or more sets of chain-based rules (e.g., configuration and training approach pairs) dictating which features and categories result in an optimal training approach.” Note: here, chain-based rules are configuration and training approach pairs.  As can be seen, in Fig. 5 above, we can see that type of model is identified in request, and different configuration for that model are tried, and one or more model and configuration pair is recommended based on performance.  Thus, we have model and configuration pairs, that can be used to execute the ML training job (step 525).  Also, note, there can be multiple models, and configuration pairs, and they are stored in database (see, Fig. 3, ML training job metadata store 120)) , 
wherein the chain-based rules comprise configuration and training approach pairs, wherein the large model predictor is a neural network; (Fig. 3:

    PNG
    media_image2.png
    455
    766
    media_image2.png
    Greyscale

Note: in Fig. 3 above, we can see multiple configuration (model size, instance type, number of instances etc…) and training approach (algorithm type).  Also, note that training approach can also be determination of number of instances to be used as that will result in approaches such as parallelism, large model support, etc…
Also, col,**** line ****  (“(45)The job type characteristics 328 include a model size 302 and an algorithm type 304. The model size 302 can indicate how the large of a model is to be used for training (e.g., a size of ML training logic, a size of a container encapsulating ML training logic, etc.). In some embodiments, this value is provided explicitly by a user (e.g., in a drop-down user interface (UI) element allowing the user to select between a variety of choices, and/or provided in an API call), and in some embodiments the value is inferred. In some embodiments, the algorithm type 304 identifies a particular type of ML algorithm to be trained, e.g., a linear classifier, neural network (e.g., convolutional neural network (CNN)), a particular Apache MXNet algorithm, an XGBoost classifier, a support vector machine, etc. Similarly, in some embodiments, this value is provided explicitly by a user and in some embodiments the value is inferred.”); 
monitoring, by one or more computer processors, a model training service for a training of a deep learning model (Col 20, lines 36-44, “In some embodiments, the model training system 124 includes a ML model evaluator 1028. The ML model evaluator 1028 can monitor virtual machine instances 1022 as machine learning models are being trained, obtaining the generated model data and processing the obtained model data to generate model metrics.” Also, see Fig. 5));
responsive to the training of the deep learning model, identifying, by one or more computer processors, one or more model characteristics associated with the deep learning model (Fig. 5 teaches all the identification of configuration is taking place in response to ML model to be trained (step 505). 
Liberty, Col 18, lines 59-67, Col 19, lines 1-10, “In some embodiments, executing the executable instructions causes the virtual machine instance 1022 (e.g., the ML training container 1030) to generate model data. For example, the ML training container 1030 generates model data and stores the model data in a file system of the ML training container 1030. The model data includes characteristics of the machine learning model being trained, such as a number of layers in the machine learning model, hyperparameters of the machine learning model, coefficients of the machine learning model, weights of the machine learning model, and/or the like. In particular, the generated model data includes values for the characteristics that define a machine learning model being trained. In some embodiments, executing the executable instructions causes a modification to the ML training container 1030 such that the model data is written to the top container layer of the ML training container 1030 and/or the container image(s) that forms a portion of the ML training container 1030 is modified to include the model data.”  
Note: here model characteristic associated with learning model are generated (identified), such as number of layers) ; 
identifying, by one or more computer processors, one or more system configurations associated with a system training the deep learning model (Liberty, Col 2, lines 50-67, “In some embodiments, characteristics of a desired machine learning (ML) training job can be automatically determined and used to provide configuration recommendations to a user and/or select optimal configurations for the ML training job according to some desired performance characteristic. In some embodiments, at least a portion of a ML training job is executed a plurality of times using a plurality of different resource configurations, where each of the plurality of resource configurations includes at least a different type or amount of compute instances. A performance metric can be measured for each of the plurality of the executions, and used (at least in part) with a desired performance characteristic to generate one or more recommended resource configurations for the ML training job. The one or more recommended resource configurations can be provided to the user as recommendations, one of which could be selected by the user to be utilized for the training job. Alternatively or additionally, the ML training job can be executed using one of the one or more recommended resource configurations.” Note:  here, different system configuration are tried, and optimal configuration for the learning model is determined.);
 determining, by one or more computer processors, the training approach for the deep learning model utilizing the trained large model predictor fed with the one or more identified model characteristics and the one or more identified system configurations comprising:(Liberty, Col 6, lines 57-67, Col 7, lines 1-3, “In some embodiments, some or all of these techniques can be performed ahead of time for particular ML algorithms (e.g., containers including ML code) that may be used by multiple users of the ML service 118, so that suggested configurations and execution characteristics can be presented when a user wishes to use one of these algorithms.
In some embodiments, some or all of these techniques can be performed responsive to a user request involving a particular algorithm (e.g., a container, which may be a customer container provided/created by the customer). Thus, before the user uses that container, or perhaps after the user uses that container for an ML task, suggested configurations and execution characteristics can be generated and presented to the user.”  Note: here, different techniques (approaches) can be performed so the suggested configurations and execution characteristics can be presented to the user. Also note, that when one is trying different configurations to select optimal configuration, one is using different approaches to determine the optimal approach; also this configuration re specific to model characteristics.); 
verifying that the training approach conforms with the one or more identified  system specifications (Col 11 , line 50, “Thus, in some embodiments, the training control system 122 can monitor the execution and progress of the current utilized resources (e.g., the small VM), determine that the performance/progress of the ML job is not satisfactory (e.g., is greater than or less than some defined threshold), and add additional resources such as additional virtual machines (to execute additional containers for the ML task) to “scale up” the task…” Note: here the training approach is being verified to see if it conforms with system specification (available resources), as if the training approach does not conform, then additional resources can be added.),
and training, by one or more computer processors, the deep learning model utilizing the verified training approach (Liberty, Col 2, lines 50-67, “In some embodiments, characteristics of a desired machine learning (ML) training job can be automatically determined and used to provide configuration recommendations to a user and/or select optimal configurations for the ML training job according to some desired performance characteristic. In some embodiments, at least a portion of a ML training job is executed a plurality of times using a plurality of different resource configurations, where each of the plurality of resource configurations includes at least a different type or amount of compute instances. A performance metric can be measured for each of the plurality of the executions, and used (at least in part) with a desired performance characteristic to generate one or more recommended resource configurations for the ML training job. The one or more recommended resource configurations can be provided to the user as recommendations, one of which could be selected by the user to be utilized for the training job. Alternatively or additionally, the ML training job can be executed using one of the one or more recommended resource configurations.” Note:  here, training job is executed by using one or more recommended resource configuration. Also see Fig. 3, that teaches ML training that utilizes different training approaches).
Liberty does not explicitly teach:
responsive to insufficient graphics processing unit (GPU) memory based on a determined largest tensor size for a layer, requiring, by one or more computer processors model parallelism;
restricting, by one or more parallelism computer processors, one or more other applications from utilizing the GPU memory until a training of the deep learning model is complete;
loging, by one or more computer processors, results associated with the trained deep learning model into the rule-based database; 
and retraining, by one or more computer processors, the large model predictor with the logged results.
Shoeybi teaches:
responsive to insufficient graphics processing unit (GPU) memory based on a determined largest tensor size for a layer , requiring, by one or more computer processors model parallelism (Abstract, “However, for very large models, memory constraints limit the size of models that can be practically trained. Model parallelism allows us to train larger models, because the parameters can be split across multiple processors.”
Pg. 3, section 2.3, “By increasing the minibatch size proportionally to the number of available workers, one observes near linear scaling in training data throughput.”)
It would have been obvious for a person of ordinary skill in the art to apply data parallelism teaching of Shoeybi into the teachings of Liberty at the time the application was filed in order to train the large models that can be constrained by available resources such as memory (Abstract, “However, for very large models, memory constraints limit the size of models that can be practically trained. Model parallelism allows us to train larger models, because the parameters can be split across multiple processors.”)
Liberty as modified by Shoeybi doesn’t explicitly teach:
restricting, by one or more computer processors, one or more other applications from utilizing the GPU memory until a training of the deep learning model is complete.
loging, by one or more computer processors, results associated with the trained deep learning model into the rule-based database; 
and retraining, by one or more computer processors, the large model predictor with the logged results.
Venkatraman teaches restricting, by one or more computer processors, one or more other applications from utilizing the GPU memory until a training of the deep learning model is complete (Para 0051, “…Alternatively, the application can be terminated, and the allocated memory of the application can be reclaimed. Swapping or terminating idle, suspended, and background applications is performed at block 630 to avoid the need to terminate one or more active foreground applications at block 640 if logic 600 determines at block 635 that the available memory remains low. The determination of whether to swap or terminate an application can be made based on a variety of factors.”
Also, para “[0066] As shown at block 904, after some period of time an application launcher can receive an indication to launch the swapped application. Upon receipt of such indication, the process 900 includes to copy compressed application memory and state from non-volatile storage to system memory, as shown at block 906. The process 900 additionally includes to resume execution of the application based on the stored application state, as shown at block 908.”)
It would have been obvious for a person of ordinary skill in the art to apply memory management teaching of Venkatraman into the teachings of Liberty as modified by Shoeybi at the time the application was filed in order to manage the memory usage (“[0038] In one embodiment, the operating environment 201 includes a memory usage management module 215 to manage memory usage of the operating environment 201. The memory usage management module 215 can perform various operations to increase the amount of physical memory available to executing applications based on the current state of system memory usage. The various operations can include terminating applications to reclaim memory space and, according to embodiments described herein, swap applications and application memory to the mass storage devices 221…”)
Liberty as modified by Shoeybi  and Venkatraman  does not explicitly teach:
loging, by one or more computer processors, results associated with the trained deep learning model into the rule-based database; 
and retraining, by one or more computer processors, the large model predictor with the logged results.
Song teaches:
loging, by one or more computer processors, results associated with the trained deep learning model into the rule-based database (Pg. 6166-6167 , Table V, VI shows logging of results associated with the model); 
and retraining, by one or more computer processors, the large model predictor with the logged results (Pg. 6165, “After the training of DAN, we can obtain the predicted results of the testing data. The results are noisy but informative, which means that they can be used to optimize DAN model. Therefore, we try to use the predicted results to retrain the DAN (DAN-R) network.”)
It would have been obvious for a person of ordinary skill in the art to retrain using predicted results teaching of Song into the teachings of Liberty as modified by Shoeybi and  Venkatraman  at the time the application was filed in order to optimize the model. (“After the training of DAN, we can obtain the predicted results of the testing data. The results are noisy but informative, which means that they can be used to optimize DAN model.”) 

Examiner Note:  The examiner have addressed the claim language to ensure compact prosecution; however, as claimed, the entire claim language after “responsive to insufficient….” Is contingent claim language, and doesn’t have a patentable weight.  For example, if the memory for the tensor size is sufficient, then we don’t need to perform the remaining steps. Also, the amended limitations are very broad, as even learning will read on using the results to retrain, as the training is done based on the results.

Regarding claim 2, Liberty as modified by Shoeybi , Venkatraman and Song teaches the method of claim 1.
Liberty further teaches wherein determining the training approach for the deep learning model utilizing the trained large model predictor fed with the one or more identified model characteristics and the one or more identified system configurations, comprises:
generating, by one or more computer processors, a plurality of probabilities associated with one or more respective training approaches utilizing the trained large model predictor (Liberty, Fig. 5:

    PNG
    media_image3.png
    836
    471
    media_image3.png
    Greyscale

Note: here, performance metric is measured for each of the plurality of executions (step 515), here, performance metrics is interpreted to be the probability associated with respective training approach, as it tells how each execution (approach based on different configuration, see step 510) performs; probability is how likely something is to happen, in instance case performance tells the likelihood of how accurate the model is (see, Col 10, lines 44-35));
ranking, by one or more computer processors, one or more training approaches based on respective generated probability (Liberty, Col 10, lines 56-65, “Optionally, the flow of operations 500 can continue via path 550 to block 535, and executing the ML training job using one of the one or more recommended resource configurations. This block may be performed by selecting one of the one or more recommended resource configurations, which may occur according to a preference (e.g., indicated in the request of optional block 505) indicating a desired performance characteristic (e.g., a fastest training time, a lowest cost).” Note: here, the recommended approaches are selected based on ranking, such model with lowest cost.); 
and automatically selecting, by one or more computer processors, a highest ranked training approach (Liberty, Col 2, lines 50-67, “In some embodiments, characteristics of a desired machine learning (ML) training job can be automatically determined and used to provide configuration recommendations to a user and/or select optimal configurations for the ML training job according to some desired performance characteristic. In some embodiments, at least a portion of a ML training job is executed a plurality of times using a plurality of different resource configurations, where each of the plurality of resource configurations includes at least a different type or amount of compute instances. A performance metric can be measured for each of the plurality of the executions, and used (at least in part) with a desired performance characteristic to generate one or more recommended resource configurations for the ML training job. The one or more recommended resource configurations can be provided to the user as recommendations, one of which could be selected by the user to be utilized for the training job. Alternatively or additionally, the ML training job can be executed using one of the one or more recommended resource configurations.” Note:  the above citation teaches the optimal configuration (highest ranked approached) can be automatically determined, and provided to user; alternatively, it training job can be executed.).

Regarding claim 3, Liberty as modified by Shoeybi , Venkatraman, and Song teaches the method of claim 1.
 Liberty further teaches further comprising: 
responsive to a training completion, deploying, by one or more computer processors, the trained deep learning model to one or more environments (Liberty, Col 21, lines 49-64, “(109) As described below, in some embodiments, the model data stored in the training model data store 1075 is used by the model hosting system 1040 to deploy machine learning models. Alternatively or in addition, a user device 1002 or another computing device (not shown) can retrieve the model data from the training model data store 1075 to implement a learning algorithm in an external device. As an illustrative example, a robotic device can include sensors to capture input data. A user device 1002 can retrieve the model data from the training model data store 1075 and store the model data in the robotic device. The model data defines a machine learning model. Thus, the robotic device can provide the captured input data as an input to the machine learning model, resulting in an output. The robotic device can then perform an action (e.g., move forward, raise an arm, generate a sound, etc.) based on the resulting output.”)

Regarding claim 4, Liberty as modified by Shoeybi  and Venkatraman, and Song teaches the method of claim 1.
Liberty further teach wherein training approaches are selected from the group consisting of:  model parallelism, data parallelism, large model support, gradient checkpointing, large model supports with parallelism, gradient checkpointing with model parallelism, and utilizing host memory as swap space (Liberty, Col 15, lines 46-62, “The user devices 1002 can interact with the model training system 124 via frontend 1029 of the model training system 124. For example, a user device 1002 can provide a training request to the frontend 1029 that includes a container image (or multiple container images, or an identifier of one or multiple locations where container images are stored), an indicator of input data (e.g., an address or location of input data), one or more hyperparameter values (e.g., values indicating how the algorithm will operate, how many algorithms to run in parallel[model parallelism], how many clusters into which to separate data, etc.), and/or information describing the computing machine on which to train a machine learning model (e.g., a graphical processing unit (GPU) instance type, a central processing unit (CPU) instance type, an amount of memory to allocate, a type of virtual machine instance to use for training, etc.).”
Liberty, Col 19, lines 53-67, Col 20, lines 1-5, “In some embodiments, a plurality of virtual machine instances 1022 execute code 1036 stored in a plurality of ML training containers 1030. For example, the resources used to train a particular machine learning model can exceed the limitations of a single virtual machine instance 1022. However, the algorithm included in the container image can be in a format that allows for the parallelization of the training process. Thus, the model training system 124 can create multiple copies of the container image provided in a training request, initialize multiple virtual machine instances 1022, and cause each virtual machine instance 1022 to load a container image copy in one or more separate ML training containers 1030 [data parallelism]. The virtual machine instances 1022 can then each execute the code 1036 stored in the ML training containers 1030 in parallel. The model training system 124 can further provide configuration information to each ML training container 1030 via the virtual machine instances 1022 (e.g., information indicating that N ML training containers 1030 are collectively training a machine learning model and that a particular ML training container 1030 receiving the configuration information is ML training container 1030 number X of N, information indicating that M virtual machine instances 1022 are collectively training a machine learning model and that a particular ML training container 1030 receiving the configuration information is initialized in virtual machine instance 1022 number Y of M, etc.), which can be included in the resulting model data. As described above, by parallelizing the training process, the model training system 124 can significantly reduce the training time in some embodiments.”)

Regarding claim 6, Liberty as modified by Shoeybi , Venkatraman , and Song teaches the method of claim 1.
Liberty further teaches wherein model characteristics are model information selected from the group consisting of: a number of neurons, a number of layers, a tensor size, a number of activations, a parameter size, trainable parameters, and non-trainable parameters; model execution information regarding a CPU utilization, a GPU utilization, a GPU memory utilization, a CPU memory utilization, and a number of spawned CPU processes; model considerations regarding a time per iteration, a CPU-GPU communication time, a GPU compute time, a CPU time utilization, scaling efficiency for multiple GPUs, and a network latency; model convergence information regarding hyperparameters, a batch size, training samples, evaluation samples, a loss function, optimizer, a learning rate, and momentum; and data configuration containing information regarding a dataset size and a data processing time (Liberty, Col 4, lines 1-19, “In some embodiments, a best “hardware” setup (e.g., in terms of virtual machines and/or the underlying hosts) can be determined for a ML training job by a resource analysis engine 108 of a training configuration system 106, e.g., based on parameters of the ML training job, potentially before its initiation. Embodiments can predictively determine how much time and energy would be required for a particular job. In some embodiments, additional machine learning jobs (or similar statistical models) can be trained by the resource analysis engine 108 (e.g., a software module executed by one or more computing devices) to determine, for a particular ML training job having particular characteristics (e.g., using a type of data, for a particular type of algorithms, using particular types of machines, using particular numbers of machines), how much time the job may take for the ML training job to conclude[a data processing time], how much of cost is incurred, etc.”
Liberty, Col 5, lines 37-57, “For example, the ML job metadata store 120 can store some or all of a variety of types of information, such as the particular types and values of hyperparameters used in the modeling/learning task, features of the application area to which the model resulting from the desired ML job 116 pertains (such as natural language processing, imaging, risk modeling, demand forecasting, insurance, etc.), features of the training data (such as the size[information regarding a dataset size], number of rows, number of features, class-ratios, mean, standard deviation of features, etc.), features of the learning task (such as the type or methods of algorithm, e.g., classification, regression, clustering, linear discriminant analysis, logistic regression, support vector machines (SVM), neural networks), performance metrics (e.g., Area Under Curve (AUC), precision, recall, root mean square error (RMSE), accuracy, log-likelihood, training time, time for one epoch, hardware cost, etc.). In some embodiments, the ML job metadata store 120 can store data characteristics such as the size and/or type of data attributes, the number of data attributes, attribute-wise statistics (e.g., mean, percentiles, etc.). Some examples data collected and stored by ML job metadata store 120 are shown later herein with regard to FIG. 3.”)

Regarding claim 7, Liberty as modified by Shoeybi , Venkatraman , and Song teaches the method of claim 1.
Liberty further teaches wherein the large model predictor is a neural network.(Liberty,  Col 8, lines 55-67, Col 9, lines 1-3, “The job type characteristics 328 include a model size 302 and an algorithm type 304. The model size 302 can indicate how the large of a model is to be used for training (e.g., a size of ML training logic, a size of a container encapsulating ML training logic, etc.). In some embodiments, this value is provided explicitly by a user (e.g., in a drop-down user interface (UI) element allowing the user to select between a variety of choices, and/or provided in an API call), and in some embodiments the value is inferred. In some embodiments, the algorithm type 304 identifies a particular type of ML algorithm to be trained, e.g., a linear classifier, neural network (e.g., convolutional neural network (CNN)), a particular Apache MXNet algorithm, an XGBoost classifier, a support vector machine, etc. Similarly, in some embodiments, this value is provided explicitly by a user and in some embodiments the value is inferred.”)

Regarding claim 8 ,  Liberty as modified by Shoeybi , Venkatraman , and Song teaches the method of claim 7. 
Liberty further teaches wherein the neural network is trained utilizing historical model characteristics, historical system configurations, and associated training approach labels (Liberty, Col 7, lines 13-39, “For example, in some embodiments, an “analyze performance of job” routine can be utilized in the ML service 118 (e.g., responsive to a request submitted by an electronic device of a user) that accepts a particular ML algorithm and training data (identified and/or provided by a user request), and the resource analysis engine 108 can run experiments (e.g., using different combinations of resource configurations selected by variant exploration engine 109, analysis of other similar jobs previously run by the model training system 124, etc.) to determine how the ML training job could be run and what results would occur under these different configurations. As one example, a particular job could be run with a single compute instance of type ‘A’, with two compute instances of type ‘A’, and/or with four compute instances of type ‘A’, to determine whether (and how much) the job “scales” well, enabling the resource analysis engine 108 to make fairly accurate predictions about the performance of using different numbers of compute instances to perform the ML training job. As another example, a particular job could be run with variants utilizing different types of compute instances (having different resource characteristics) to determine how differing resource types/amounts (e.g., random access memory (RAM), available bandwidth, accelerators, etc.) affects the job, again enabling the resource analysis engine 108 to make fairly accurate predictions about the performance resulting from use of different types of compute instances to perform the ML training job.”  
Liberty, Col 20, lines 43-47, Col 21, lines 1-3 “In some embodiments, the model training system 124 includes a ML model evaluator 1028. The ML model evaluator 1028 can monitor virtual machine instances 1022 as machine learning models are being trained, obtaining the generated model data and processing the obtained model data to generate model metrics. For example, the model metrics can include quality metrics, such as an error rate of the machine learning model being trained, a statistical distribution of the machine learning model being trained, a latency of the machine learning model being trained, a confidence level of the machine learning model being trained (e.g., a level of confidence that the accuracy of the machine learning model being trained is known, etc. The ML model evaluator 1028 can obtain the model data for a machine learning model being trained and evaluation data from the training data store 1060. The evaluation data is separate from the data used to train a machine learning model and includes both input data and expected outputs (e.g., known results), and thus the ML model evaluator 1028 can define a machine learning model using the model data and execute the machine learning model by providing the input data as inputs to the machine learning model. The ML model evaluator 1028 can then compare the outputs of the machine learning model to the expected outputs, and determine one or more quality metrics of the machine learning model being trained based on the comparison (e.g., the error rate can be a difference or distance between the machine learning model outputs and the expected outputs).”
Note: here, engine is running experiments using analysis of other similar jobs previously run (historical)  by the model training system.  Note here, the training uses different configurations, and model characteristics to recommend optimal approach, and the citation above teaches system can rely on similar jobs previously performed.  In addition, it teaches associated training approach labels (known results).)

Regarding claim 9, Liberty as modified by Shoeybi , Venkatraman , and Song teaches the method of claim 1.
Liberty further teaches wherein determining the training approach for the deep learning model utilizing the trained large model predictor fed with the one or more identified model characteristics and the one or more identified system configurations, comprises: maintaining, by one or more computer processors, one or more sets of deep learning models wherein each set shares training sets, machine learning techniques, and deep learning structures but utilizes a distinct training approach.(Liberty, Fig. 3:

    PNG
    media_image4.png
    567
    971
    media_image4.png
    Greyscale

Note:  here, Fig. 3 shows different approaches that have deep learning models (algorithm types), training data (datasets), machine learning techniques (different combinations);  note here,, in each of these approaches different distinct combination is used;  for example different configuration on same data (shard dataset) are used to see model accuracy)

Regarding claim 10, Liberty  teaches a computer program product comprising (Col 33, lines 4-15): one or more computer readable storage media and program instructions stored on the one or more computer readable storage media (Col 33, lines 4-15), the stored program instructions comprising: 
program instructions to create a large model predictor  trained with a rule-based database containing one or more sets of chain-based rules dictating a respective training approach((Fig. 5:

    PNG
    media_image1.png
    419
    379
    media_image1.png
    Greyscale

Note: instant specification, paragraph 0022 discloses “Program 150 trains a large model predictor (step 202). In an embodiment, program 150 creates LMP 154 (a large model predictor) as a rule-based database containing one or more sets of chain-based rules (e.g., configuration and training approach pairs) dictating which features and categories result in an optimal training approach.” Note: here, chain-based rules are configuration and training approach pairs.  As can be seen, in Fig. 5 above, we can see that type of model is identified in request, and different configuration for that model are tried, and one or more model and configuration pair is recommended based on performance.  Thus, we have model and configuration pairs, that can be used to execute the ML training job (step 525).  Also, note, there can be multiple models, and configuration pairs, and they are stored in database (see, Fig. 3, ML training job metadata store 120)) , 

wherein the chain- based rules comprise configuration and training approach pairs, wherein the large model predictor is a neural network (Fig. 3:

    PNG
    media_image2.png
    455
    766
    media_image2.png
    Greyscale

Note: in Fig. 3 above, we can see multiple configuration (model size, instance type, number of instances etc…) and training approach (algorithm type).  Also, note that training approach can also be determination of number of instances to be used as that will result in approaches such as parallelism, large model support, etc…); 
program instructions to monitor a model training service for a training of a deep learning model(Col 20, lines 36-44, “In some embodiments, the model training system 124 includes a ML model evaluator 1028. The ML model evaluator 1028 can monitor virtual machine instances 1022 as machine learning models are being trained, obtaining the generated model data and processing the obtained model data to generate model metrics.” Also, see Fig. 5));
program instructions to, responsive to the training of the deep learning model,  identify one or more model characteristics associated with the deep learning model, ((Fig. 5 teaches all the identification of configuration is taking place in response to ML model to be trained (step 505). 
Liberty, Col 18, lines 59-67, Col 19, lines 1-10, “In some embodiments, executing the executable instructions causes the virtual machine instance 1022 (e.g., the ML training container 1030) to generate model data. For example, the ML training container 1030 generates model data and stores the model data in a file system of the ML training container 1030. The model data includes characteristics of the machine learning model being trained, such as a number of layers in the machine learning model, hyperparameters of the machine learning model, coefficients of the machine learning model, weights of the machine learning model, and/or the like. In particular, the generated model data includes values for the characteristics that define a machine learning model being trained. In some embodiments, executing the executable instructions causes a modification to the ML training container 1030 such that the model data is written to the top container layer of the ML training container 1030 and/or the container image(s) that forms a portion of the ML training container 1030 is modified to include the model data.”  Note: here model characteristic associated with learning model are generated (identified), such as number of layers);
program instructions to identify one or more system configurations associated with a system training the deep learning model((Liberty, Col 2, lines 50-67, “In some embodiments, characteristics of a desired machine learning (ML) training job can be automatically determined and used to provide configuration recommendations to a user and/or select optimal configurations for the ML training job according to some desired performance characteristic. In some embodiments, at least a portion of a ML training job is executed a plurality of times using a plurality of different resource configurations, where each of the plurality of resource configurations includes at least a different type or amount of compute instances. A performance metric can be measured for each of the plurality of the executions, and used (at least in part) with a desired performance characteristic to generate one or more recommended resource configurations for the ML training job. The one or more recommended resource configurations can be provided to the user as recommendations, one of which could be selected by the user to be utilized for the training job. Alternatively or additionally, the ML training job can be executed using one of the one or more recommended resource configurations.” Note:  here, different system configuration are tried, and optimal configuration for the learning model is determined.);
program instructions to determine the training approach for the deep learning model utilizing the trained large model predictor fed with the one or more identified model characteristics and the one or more identified system configurations, wherein the program instructions further comprise: (Liberty, Col 6, lines 57-67, Col 7, lines 1-3, “In some embodiments, some or all of these techniques can be performed ahead of time for particular ML algorithms (e.g., containers including ML code) that may be used by multiple users of the ML service 118, so that suggested configurations and execution characteristics can be presented when a user wishes to use one of these algorithms.

In some embodiments, some or all of these techniques can be performed responsive to a user request involving a particular algorithm (e.g., a container, which may be a customer container provided/created by the customer). Thus, before the user uses that container, or perhaps after the user uses that container for an ML task, suggested configurations and execution characteristics can be generated and presented to the user.”  Note: here, different techniques (approaches) can be performed so the suggested configurations and execution characteristics can be presented to the user. Also note, that when one is trying different configurations to select optimal configuration, one is using different approaches to determine the optimal approach; also this configuration re specific to model characteristics);  
program instructions to verify that the training approach conforms with system specifications(Col 11 , line 50 “Thus, in some embodiments, the training control system 122 can monitor the execution and progress of the current utilized resources (e.g., the small VM), determine that the performance/progress of the ML job is not satisfactory (e.g., is greater than or less than some defined threshold), and add additional resources such as additional virtual machines (to execute additional containers for the ML task) to “scale up” the task…” Note: here the training approach is being verified to see if it conforms with system specification (available resources), as if the training approach does not conform, then additional resources can be added.),
 program instructions to train the deep learning model utilizing the verified  training approach. (Liberty, Col 2, lines 50-67, “In some embodiments, characteristics of a desired machine learning (ML) training job can be automatically determined and used to provide configuration recommendations to a user and/or select optimal configurations for the ML training job according to some desired performance characteristic. In some embodiments, at least a portion of a ML training job is executed a plurality of times using a plurality of different resource configurations, where each of the plurality of resource configurations includes at least a different type or amount of compute instances. A performance metric can be measured for each of the plurality of the executions, and used (at least in part) with a desired performance characteristic to generate one or more recommended resource configurations for the ML training job. The one or more recommended resource configurations can be provided to the user as recommendations, one of which could be selected by the user to be utilized for the training job. Alternatively or additionally, the ML training job can be executed using one of the one or more recommended resource configurations.” Note:  here, training job is executed by using one or more recommended resource configuration. Also see Fig. 3, that teaches ML training that utilizes different training approaches).

Liberty does not explicitly teach:	
wherein the programs instructions further comprise:
program instructions to, responsive to insufficient graphics processing unit (GPU) memory based on a determined largest tensor size for a layer, require model parallelism;
program instructions to restrict one or more other applications from utilizing the GPU memory until a training of the deep learning model is complete;
program instructions to logging results associated with the trained deep learning model into the rule-based database; 
and program instructions to retrain the large model predictor with the logged results.

Shoeybi teaches:
program instructions to, responsive to insufficient graphics processing unit (GPU) memory based on a determined largest tensor size for a layer, require a higher batch size and model parallelism(Abstract, “However, for very large models, memory constraints limit the size of models that can be practically trained. Model parallelism allows us to train larger models, because the parameters can be split across multiple processors.”
Pg. 3, section 2.3, “By increasing the minibatch size proportionally to the number of available workers, one observes near linear scaling in training data throughput.”)
It would have been obvious for a person of ordinary skill in the art to apply model parallelism teaching of Shoeybi into the teachings of Liberty at the time the application was filed in order to train the large models that can be constrained by available resources such as memory (Abstract, “However, for very large models, memory constraints limit the size of models that can be practically trained. Model parallelism allows us to train larger models, because the parameters can be split across multiple processors.”)
Liberty as modified by Shoeybi doesn’t explicitly teach:
program instructions to restrict one or more other applications from utilizing the GPU memory until a training of the deep learning model is complete;
program instructions to logging results associated with the trained deep learning model into the rule-based database; 
and program instructions to retrain the large model predictor with the logged results.
Venkatraman teaches program instructions to restrict one or more other applications from utilizing the GPU memory until a training of the deep learning model is complete(Para 0051, “…Alternatively, the application can be terminated, and the allocated memory of the application can be reclaimed. Swapping or terminating idle, suspended, and background applications is performed at block 630 to avoid the need to terminate one or more active foreground applications at block 640 if logic 600 determines at block 635 that the available memory remains low. The determination of whether to swap or terminate an application can be made based on a variety of factors.”
“[0066] As shown at block 904, after some period of time an application launcher can receive an indication to launch the swapped application. Upon receipt of such indication, the process 900 includes to copy compressed application memory and state from non-volatile storage to system memory, as shown at block 906. The process 900 additionally includes to resume execution of the application based on the stored application state, as shown at block 908.”)
It would have been obvious for a person of ordinary skill in the art to apply memory management teaching of Venkatraman into the teachings of Liberty as modified by Shoeybi  at the time the application was filed in order to manage the memory usage (“[0038] In one embodiment, the operating environment 201 includes a memory usage management module 215 to manage memory usage of the operating environment 201. The memory usage management module 215 can perform various operations to increase the amount of physical memory available to executing applications based on the current state of system memory usage. The various operations can include terminating applications to reclaim memory space and, according to embodiments described herein, swap applications and application memory to the mass storage devices 221…”)

Liberty as modified by Shoeybi  and Venkatraman  does not explicitly teach:
program instructions to logging results associated with the trained deep learning model into the rule-based database; 
and program instructions to retrain the large model predictor with the logged results.
Song teaches:
program instructions to logging results associated with the trained deep learning model into the rule-based database Pg. 6166-6167 , Table V, VI shows logging of results associated with the model); 
and program instructions to retrain the large model predictor with the logged results(Pg. 6165, “After the training of DAN, we can obtain the predicted results of the testing data. The results are noisy but informative, which means that they can be used to optimize DAN model. Therefore, we try to use the predicted results to retrain the DAN (DAN-R) network.”)
It would have been obvious for a person of ordinary skill in the art to retrain using predicted results teaching of Song into the teachings of Liberty as modified by Shoeybi, and Venkatraman  at the time the application was filed in order to optimize the model. (“After the training of DAN, we can obtain the predicted results of the testing data. The results are noisy but informative, which means that they can be used to optimize DAN model.”) 

Examiner Note:  The examiner have addressed the claim language to ensure compact prosecution; however, as claimed, the entire claim language after “responsive to insufficient….” Is contingent claim language, and doesn’t have a patentable weight.  For example, if the memory for the tensor size is sufficient, then we don’t need to perform the remaining steps.

Regarding claim11, Liberty as modified by Shoeybi , Venkatraman, and Song teaches the computer program product of claim 10.
Liberty further teaches wherein the program instructions, to determine the training approach for the deep learning model utilizing the trained large model predictor fed with the one or more identified model characteristics and the one or more identified system configurations, comprise:
program instructions to generate a plurality of probabilities associated with one or more respective training approaches utilizing the trained large model predictor (Liberty, Fig. 5:

    PNG
    media_image3.png
    836
    471
    media_image3.png
    Greyscale

Note: here, performance metric is measured for each of the plurality of executions (step 515), here, performance metrics is interpreted to be the probability associated with respective training approach, as it tells how each execution (approach based on different configuration, see step 510) performs; probability is how likely something is to happen, in instance case performance tells the likelihood of how accurate the model is (see, Col 10, lines 44-35));
program instructions to rank one or more training approaches based on respective generated probability (Liberty, Col 10, lines 56-65, “Optionally, the flow of operations 500 can continue via path 550 to block 535, and executing the ML training job using one of the one or more recommended resource configurations. This block may be performed by selecting one of the one or more recommended resource configurations, which may occur according to a preference (e.g., indicated in the request of optional block 505) indicating a desired performance characteristic (e.g., a fastest training time, a lowest cost).” Note: here, the recommended approaches are selected based on ranking, such model with lowest cost.); 
and program instructions to automatically select a highest ranked training approach (Liberty, Col 2, lines 50-67, “In some embodiments, characteristics of a desired machine learning (ML) training job can be automatically determined and used to provide configuration recommendations to a user and/or select optimal configurations for the ML training job according to some desired performance characteristic. In some embodiments, at least a portion of a ML training job is executed a plurality of times using a plurality of different resource configurations, where each of the plurality of resource configurations includes at least a different type or amount of compute instances. A performance metric can be measured for each of the plurality of the executions, and used (at least in part) with a desired performance characteristic to generate one or more recommended resource configurations for the ML training job. The one or more recommended resource configurations can be provided to the user as recommendations, one of which could be selected by the user to be utilized for the training job. Alternatively or additionally, the ML training job can be executed using one of the one or more recommended resource configurations.” Note:  the above citation teaches the optimal configuration (highest ranked approached) can be automatically determined, and provided to user; alternatively, it training job can be executed.).

Regarding claim 12, Liberty as modified by Shoeybi , Venkatraman , and Song teaches the computer program product of claim 10.
 Liberty further teaches further comprising: wherein the program instructions, stored on the one or more computer readable storage media (Col 33, lines 4-15), comprise: program instructions to, responsive to a training completion, deploy the trained deep learning model to one or more environments (Liberty, Col 21, lines 49-64, “(109) As described below, in some embodiments, the model data stored in the training model data store 1075 is used by the model hosting system 1040 to deploy machine learning models. Alternatively or in addition, a user device 1002 or another computing device (not shown) can retrieve the model data from the training model data store 1075 to implement a learning algorithm in an external device. As an illustrative example, a robotic device can include sensors to capture input data. A user device 1002 can retrieve the model data from the training model data store 1075 and store the model data in the robotic device. The model data defines a machine learning model. Thus, the robotic device can provide the captured input data as an input to the machine learning model, resulting in an output. The robotic device can then perform an action (e.g., move forward, raise an arm, generate a sound, etc.) based on the resulting output.”)

Regarding claim 13, Liberty as modified by Shoeybi , Venkatraman , and Song teaches the computer program product of claim 10.
Liberty further teaches wherein the large model predictor is a neural network  (Liberty,  Col 8, lines 55-67, Col 9, lines 1-3, “The job type characteristics 328 include a model size 302 and an algorithm type 304. The model size 302 can indicate how the large of a model is to be used for training (e.g., a size of ML training logic, a size of a container encapsulating ML training logic, etc.). In some embodiments, this value is provided explicitly by a user (e.g., in a drop-down user interface (UI) element allowing the user to select between a variety of choices, and/or provided in an API call), and in some embodiments the value is inferred. In some embodiments, the algorithm type 304 identifies a particular type of ML algorithm to be trained, e.g., a linear classifier, neural network (e.g., convolutional neural network (CNN)), a particular Apache MXNet algorithm, an XGBoost classifier, a support vector machine, etc. Similarly, in some embodiments, this value is provided explicitly by a user and in some embodiments the value is inferred.”).

Regarding claim 14,  Liberty as modified by Shoeybi , Venkatraman and Song teaches the computer program product of claim 13. 
Liberty further teaches wherein the neural network is trained utilizing historical model characteristics, historical system configurations, and associated training approach labels (Liberty, Col 7, lines 13-39, “For example, in some embodiments, an “analyze performance of job” routine can be utilized in the ML service 118 (e.g., responsive to a request submitted by an electronic device of a user) that accepts a particular ML algorithm and training data (identified and/or provided by a user request), and the resource analysis engine 108 can run experiments (e.g., using different combinations of resource configurations selected by variant exploration engine 109, analysis of other similar jobs previously run by the model training system 124, etc.) to determine how the ML training job could be run and what results would occur under these different configurations. As one example, a particular job could be run with a single compute instance of type ‘A’, with two compute instances of type ‘A’, and/or with four compute instances of type ‘A’, to determine whether (and how much) the job “scales” well, enabling the resource analysis engine 108 to make fairly accurate predictions about the performance of using different numbers of compute instances to perform the ML training job. As another example, a particular job could be run with variants utilizing different types of compute instances (having different resource characteristics) to determine how differing resource types/amounts (e.g., random access memory (RAM), available bandwidth, accelerators, etc.) affects the job, again enabling the resource analysis engine 108 to make fairly accurate predictions about the performance resulting from use of different types of compute instances to perform the ML training job.”  
Liberty, Col 20, lines 43-47, Col 21, lines 1-3 “In some embodiments, the model training system 124 includes a ML model evaluator 1028. The ML model evaluator 1028 can monitor virtual machine instances 1022 as machine learning models are being trained, obtaining the generated model data and processing the obtained model data to generate model metrics. For example, the model metrics can include quality metrics, such as an error rate of the machine learning model being trained, a statistical distribution of the machine learning model being trained, a latency of the machine learning model being trained, a confidence level of the machine learning model being trained (e.g., a level of confidence that the accuracy of the machine learning model being trained is known, etc. The ML model evaluator 1028 can obtain the model data for a machine learning model being trained and evaluation data from the training data store 1060. The evaluation data is separate from the data used to train a machine learning model and includes both input data and expected outputs (e.g., known results), and thus the ML model evaluator 1028 can define a machine learning model using the model data and execute the machine learning model by providing the input data as inputs to the machine learning model. The ML model evaluator 1028 can then compare the outputs of the machine learning model to the expected outputs, and determine one or more quality metrics of the machine learning model being trained based on the comparison (e.g., the error rate can be a difference or distance between the machine learning model outputs and the expected outputs).”
Note: here, engine is running experiments using analysis of other similar jobs previously run (historical)  by the model training system.  Note here, the training uses different configurations, and model characteristics to recommend optimal approach, and the citation above teaches system can rely on similar jobs previously performed.  In addition, it teaches associated training approach labels (known results).)

Regarding claim 15, Liberty  teaches a computer system comprising: one or more computer processors; one or more computer readable storage media (Col 33, lines 4-15); and program instructions stored on the computer readable storage media for execution by at least one of the one or more processors (Col 33, lines 4-15), the stored program instructions comprising:
program instructions to create a large model predictor as a trained with  rule-based database containing one or more sets of chain-based rules dictating a respective training approach(Fig. 5:

    PNG
    media_image1.png
    419
    379
    media_image1.png
    Greyscale

Note: instant specification, paragraph 0022 discloses “Program 150 trains a large model predictor (step 202). In an embodiment, program 150 creates LMP 154 (a large model predictor) as a rule-based database containing one or more sets of chain-based rules (e.g., configuration and training approach pairs) dictating which features and categories result in an optimal training approach.” Note: here, chain-based rules are configuration and training approach pairs.  As can be seen, in Fig. 5 above, we can see that type of model is identified in request, and different configuration for that model are tried, and one or more model and configuration pair is recommended based on performance.  Thus, we have model and configuration pairs, that can be used to execute the ML training job (step 525).  Also, note, there can be multiple models, and configuration pairs, and they are stored in database (see, Fig. 3, ML training job metadata store 120)) , 
wherein the chain-based rules comprise configuration and training approach pairs, wherein the large model predictor is a neural network (Fig. 3:

    PNG
    media_image2.png
    455
    766
    media_image2.png
    Greyscale

Note: in Fig. 3 above, we can see multiple configuration (model size, instance type, number of instances etc…) and training approach (algorithm type).  Also, note that training approach can also be determination of number of instances to be used as that will result in approaches such as parallelism, large model support, etc…); 
program instructions to monitor a model training service for a training of a deep learning model(Col 20, lines 36-44, “In some embodiments, the model training system 124 includes a ML model evaluator 1028. The ML model evaluator 1028 can monitor virtual machine instances 1022 as machine learning models are being trained, obtaining the generated model data and processing the obtained model data to generate model metrics.” Also, see Fig. 5));
program instructions to,  responsive to the training of the deep learning models identify one or more model characteristics associated with the deep learning models (Fig. 5 teaches all the identification of configuration is taking place in response to ML model to be trained (step 505). 
Liberty, Col 18, lines 59-67, Col 19, lines 1-10, “In some embodiments, executing the executable instructions causes the virtual machine instance 1022 (e.g., the ML training container 1030) to generate model data. For example, the ML training container 1030 generates model data and stores the model data in a file system of the ML training container 1030. The model data includes characteristics of the machine learning model being trained, such as a number of layers in the machine learning model, hyperparameters of the machine learning model, coefficients of the machine learning model, weights of the machine learning model, and/or the like. In particular, the generated model data includes values for the characteristics that define a machine learning model being trained. In some embodiments, executing the executable instructions causes a modification to the ML training container 1030 such that the model data is written to the top container layer of the ML training container 1030 and/or the container image(s) that forms a portion of the ML training container 1030 is modified to include the model data.”  Note: here model characteristic associated with learning model are generated (identified), such as number of layers);
program instructions to identify one or more system configurations associated with a system training the deep learning model ((Liberty, Col 2, lines 50-67, “In some embodiments, characteristics of a desired machine learning (ML) training job can be automatically determined and used to provide configuration recommendations to a user and/or select optimal configurations for the ML training job according to some desired performance characteristic. In some embodiments, at least a portion of a ML training job is executed a plurality of times using a plurality of different resource configurations, where each of the plurality of resource configurations includes at least a different type or amount of compute instances. A performance metric can be measured for each of the plurality of the executions, and used (at least in part) with a desired performance characteristic to generate one or more recommended resource configurations for the ML training job. The one or more recommended resource configurations can be provided to the user as recommendations, one of which could be selected by the user to be utilized for the training job. Alternatively or additionally, the ML training job can be executed using one of the one or more recommended resource configurations.” Note:  here, different system configuration are tried, and optimal configuration for the learning model is determined.);
program instructions to determine the training approach for the deep learning model utilizing the trained large model predictor fed with the one or more identified model characteristics and the one or more identified system configurations, wherein the programs instructions further comprise: (Liberty, Col 6, lines 57-67, Col 7, lines 1-3, “In some embodiments, some or all of these techniques can be performed ahead of time for particular ML algorithms (e.g., containers including ML code) that may be used by multiple users of the ML service 118, so that suggested configurations and execution characteristics can be presented when a user wishes to use one of these algorithms.

In some embodiments, some or all of these techniques can be performed responsive to a user request involving a particular algorithm (e.g., a container, which may be a customer container provided/created by the customer). Thus, before the user uses that container, or perhaps after the user uses that container for an ML task, suggested configurations and execution characteristics can be generated and presented to the user.”  Note: here, different techniques (approaches) can be performed so the suggested configurations and execution characteristics can be presented to the user. Also note, that when one is trying different configurations to select optimal configuration, one is using different approaches to determine the optimal approach; also this configuration re specific to model characteristics);  
program instructions to verify that the training approach conforms with system specifications(Col 11 , line 50 “Thus, in some embodiments, the training control system 122 can monitor the execution and progress of the current utilized resources (e.g., the small VM), determine that the performance/progress of the ML job is not satisfactory (e.g., is greater than or less than some defined threshold), and add additional resources such as additional virtual machines (to execute additional containers for the ML task) to “scale up” the task…” Note: here the training approach is being verified to see if it conforms with system specification (available resources), as if the training approach does not conform, then additional resources can be added.),
and program instructions to train the deep learning model utilizing the verified determined training approach (Liberty, Col 2, lines 50-67, “In some embodiments, characteristics of a desired machine learning (ML) training job can be automatically determined and used to provide configuration recommendations to a user and/or select optimal configurations for the ML training job according to some desired performance characteristic. In some embodiments, at least a portion of a ML training job is executed a plurality of times using a plurality of different resource configurations, where each of the plurality of resource configurations includes at least a different type or amount of compute instances. A performance metric can be measured for each of the plurality of the executions, and used (at least in part) with a desired performance characteristic to generate one or more recommended resource configurations for the ML training job. The one or more recommended resource configurations can be provided to the user as recommendations, one of which could be selected by the user to be utilized for the training job. Alternatively or additionally, the ML training job can be executed using one of the one or more recommended resource configurations.” Note:  here, training job is executed by using one or more recommended resource configuration. Also see Fig. 3, that teaches ML training that utilizes different training approaches).
Liberty does not explicitly teach:
wherein the programs instructions further comprise:
program instructions to determine a largest tensor size;
program instructions to, responsive to insufficient graphics processing unit (GPU) memory based on a determined largest tensor size for a layer , require model parallelism;
program instructions to restrict one or more other applications from utilizing the GPU memory until a training of the deep learning model is complete;
program instructions to logging results associated with the trained deep learning model into the rule-based database; 
and program instructions to retrain the large model predictor with the logged results.

Shoeybi teaches:
program instructions to, responsive to insufficient graphics processing unit (GPU) memory based on a determined largest tensor size for a layer, require a higher batch size and model parallelism(Abstract, “However, for very large models, memory constraints limit the size of models that can be practically trained. Model parallelism allows us to train larger models, because the parameters can be split across multiple processors.”
Pg. 3, section 2.3, “By increasing the minibatch size proportionally to the number of available workers, one observes near linear scaling in training data throughput.”)
It would have been obvious for a person of ordinary skill in the art to apply model parallelism teaching of Shoeybi  into the teachings of Liberty at the time the application was filed in order to train the large models that can be constrained by available resources such as memory (Abstract, “However, for very large models, memory constraints limit the size of models that can be practically trained. Model parallelism allows us to train larger models, because the parameters can be split across multiple processors.”)
Liberty as modified by Shoeybi doesn’t explicitly teach:
program instructions to restrict one or more other applications from utilizing the GPU memory until a training of the deep learning model is complete;
program instructions to logging results associated with the trained deep learning model into the rule-based database; 
and program instructions to retrain the large model predictor with the logged results.

Venkatraman teaches program instructions to restrict one or more other applications from utilizing the GPU memory until a training of the deep learning model is complete(Para 0051, “…Alternatively, the application can be terminated, and the allocated memory of the application can be reclaimed. Swapping or terminating idle, suspended, and background applications is performed at block 630 to avoid the need to terminate one or more active foreground applications at block 640 if logic 600 determines at block 635 that the available memory remains low. The determination of whether to swap or terminate an application can be made based on a variety of factors.”
Also, para, “[0066] As shown at block 904, after some period of time an application launcher can receive an indication to launch the swapped application. Upon receipt of such indication, the process 900 includes to copy compressed application memory and state from non-volatile storage to system memory, as shown at block 906. The process 900 additionally includes to resume execution of the application based on the stored application state, as shown at block 908.”)
It would have been obvious for a person of ordinary skill in the art to apply memory management teaching of Venkatraman into the teachings of Liberty as modified by Shoeybiand Song  at the time the application was filed in order to manage the memory usage (“[0038] In one embodiment, the operating environment 201 includes a memory usage management module 215 to manage memory usage of the operating environment 201. The memory usage management module 215 can perform various operations to increase the amount of physical memory available to executing applications based on the current state of system memory usage. The various operations can include terminating applications to reclaim memory space and, according to embodiments described herein, swap applications and application memory to the mass storage devices 221…”)

Liberty as modified by Shoeybi, Venkatraman does not explicitly teach:
program instructions to logging results associated with the trained deep learning model into the rule-based database; 
and program instructions to retrain the large model predictor with the logged results.
Song teaches:
program instructions to logging results associated with the trained deep learning model into the rule-based database(Pg. 6166-6167 , Table V, VI shows logging of results associated with the model);  
and program instructions to retrain the large model predictor with the logged results. (Pg. 6165, “After the training of DAN, we can obtain the predicted results of the testing data. The results are noisy but informative, which means that they can be used to optimize DAN model. Therefore, we try to use the predicted results to retrain the DAN (DAN-R) network.”)
Examiner Note:  The examiner have addressed the claim language to ensure compact prosecution; however, as claimed, the entire claim language after “responsive to insufficient….” Is contingent claim language, and doesn’t have a patentable weight.  For example, if the memory for the tensor size is sufficient, then we don’t need to perform the remaining steps.

Regarding claim 16, Liberty as modified by Shoeybi , Venkatraman and Song teaches the computer system of claim 15.
Liberty further teaches wherein the program instructions, to determine the training approach for the deep learning model utilizing the trained large model predictor fed with the one or more identified model characteristics and the one or more identified system configurations, comprise:
program instructions to generate a plurality of probabilities associated with one or more respective training approaches utilizing the trained large model predictor (Liberty, Fig. 5:

    PNG
    media_image3.png
    836
    471
    media_image3.png
    Greyscale

Note: here, performance metric is measured for each of the plurality of executions (step 515), here, performance metrics is interpreted to be the probability associated with respective training approach, as it tells how each execution (approach based on different configuration, see step 510) performs; probability is how likely something is to happen, in instance case performance tells the likelihood of how accurate the model is (see, Col 10, lines 44-35));
program instructions to rank one or more training approaches based on respective generated probability (Liberty, Col 10, lines 56-65, “Optionally, the flow of operations 500 can continue via path 550 to block 535, and executing the ML training job using one of the one or more recommended resource configurations. This block may be performed by selecting one of the one or more recommended resource configurations, which may occur according to a preference (e.g., indicated in the request of optional block 505) indicating a desired performance characteristic (e.g., a fastest training time, a lowest cost).” Note: here, the recommended approaches are selected based on ranking, such model with lowest cost.); 
and program instructions to automatically select a highest ranked training approach (Liberty, Col 2, lines 50-67, “In some embodiments, characteristics of a desired machine learning (ML) training job can be automatically determined and used to provide configuration recommendations to a user and/or select optimal configurations for the ML training job according to some desired performance characteristic. In some embodiments, at least a portion of a ML training job is executed a plurality of times using a plurality of different resource configurations, where each of the plurality of resource configurations includes at least a different type or amount of compute instances. A performance metric can be measured for each of the plurality of the executions, and used (at least in part) with a desired performance characteristic to generate one or more recommended resource configurations for the ML training job. The one or more recommended resource configurations can be provided to the user as recommendations, one of which could be selected by the user to be utilized for the training job. Alternatively or additionally, the ML training job can be executed using one of the one or more recommended resource configurations.” Note:  the above citation teaches the optimal configuration (highest ranked approached) can be automatically determined, and provided to user; alternatively, it training job can be executed.).

Regarding claim 17, Liberty as modified by Shoeybi , Venkatraman and Song , teaches the computer system of claim 15.
 Liberty further teaches  wherein the program instructions, stored on the one or more computer readable storage media (Col 33, lines 4-15), comprise: program instructions to, responsive to a training completion, deploy the trained deep learning model to one or more environments (Liberty, Col 21, lines 49-64, “(109) As described below, in some embodiments, the model data stored in the training model data store 1075 is used by the model hosting system 1040 to deploy machine learning models. Alternatively or in addition, a user device 1002 or another computing device (not shown) can retrieve the model data from the training model data store 1075 to implement a learning algorithm in an external device. As an illustrative example, a robotic device can include sensors to capture input data. A user device 1002 can retrieve the model data from the training model data store 1075 and store the model data in the robotic device. The model data defines a machine learning model. Thus, the robotic device can provide the captured input data as an input to the machine learning model, resulting in an output. The robotic device can then perform an action (e.g., move forward, raise an arm, generate a sound, etc.) based on the resulting output.”)

Regarding claim 18, Liberty as modified by Shoeybi , Venkatraman and Song teaches the computer system of claim 15.
Liberty further teaches wherein the large model predictor is a neural network  (Liberty,  Col 8, lines 55-67, Col 9, lines 1-3, “The job type characteristics 328 include a model size 302 and an algorithm type 304. The model size 302 can indicate how the large of a model is to be used for training (e.g., a size of ML training logic, a size of a container encapsulating ML training logic, etc.). In some embodiments, this value is provided explicitly by a user (e.g., in a drop-down user interface (UI) element allowing the user to select between a variety of choices, and/or provided in an API call), and in some embodiments the value is inferred. In some embodiments, the algorithm type 304 identifies a particular type of ML algorithm to be trained, e.g., a linear classifier, neural network (e.g., convolutional neural network (CNN)), a particular Apache MXNet algorithm, an XGBoost classifier, a support vector machine, etc. Similarly, in some embodiments, this value is provided explicitly by a user and in some embodiments the value is inferred.”).

Regarding claim 19,  Liberty as modified by Shoeybi , Venkatraman and Song teaches the computer system of claim 18. 
Liberty further teaches wherein the neural network is trained utilizing historical model characteristics, historical system configurations, and associated training approach labels (Liberty, Col 7, lines 13-39, “For example, in some embodiments, an “analyze performance of job” routine can be utilized in the ML service 118 (e.g., responsive to a request submitted by an electronic device of a user) that accepts a particular ML algorithm and training data (identified and/or provided by a user request), and the resource analysis engine 108 can run experiments (e.g., using different combinations of resource configurations selected by variant exploration engine 109, analysis of other similar jobs previously run by the model training system 124, etc.) to determine how the ML training job could be run and what results would occur under these different configurations. As one example, a particular job could be run with a single compute instance of type ‘A’, with two compute instances of type ‘A’, and/or with four compute instances of type ‘A’, to determine whether (and how much) the job “scales” well, enabling the resource analysis engine 108 to make fairly accurate predictions about the performance of using different numbers of compute instances to perform the ML training job. As another example, a particular job could be run with variants utilizing different types of compute instances (having different resource characteristics) to determine how differing resource types/amounts (e.g., random access memory (RAM), available bandwidth, accelerators, etc.) affects the job, again enabling the resource analysis engine 108 to make fairly accurate predictions about the performance resulting from use of different types of compute instances to perform the ML training job.”  
Liberty, Col 20, lines 43-47, Col 21, lines 1-3 “In some embodiments, the model training system 124 includes a ML model evaluator 1028. The ML model evaluator 1028 can monitor virtual machine instances 1022 as machine learning models are being trained, obtaining the generated model data and processing the obtained model data to generate model metrics. For example, the model metrics can include quality metrics, such as an error rate of the machine learning model being trained, a statistical distribution of the machine learning model being trained, a latency of the machine learning model being trained, a confidence level of the machine learning model being trained (e.g., a level of confidence that the accuracy of the machine learning model being trained is known, etc. The ML model evaluator 1028 can obtain the model data for a machine learning model being trained and evaluation data from the training data store 1060. The evaluation data is separate from the data used to train a machine learning model and includes both input data and expected outputs (e.g., known results), and thus the ML model evaluator 1028 can define a machine learning model using the model data and execute the machine learning model by providing the input data as inputs to the machine learning model. The ML model evaluator 1028 can then compare the outputs of the machine learning model to the expected outputs, and determine one or more quality metrics of the machine learning model being trained based on the comparison (e.g., the error rate can be a difference or distance between the machine learning model outputs and the expected outputs).”
Note: here, engine is running experiments using analysis of other similar jobs previously run (historical)  by the model training system.  Note here, the training uses different configurations, and model characteristics to recommend optimal approach, and the citation above teaches system can rely on similar jobs previously performed.  In addition, it teaches associated training approach labels (known results).)

Regarding claim 20, Liberty as modified by Shoeybi , Venkatraman and Song teaches the computer system of claim 15.
Liberty further teach wherein training approaches are selected from the group consisting of :  model parallelism, data parallelism, large model support, gradient checkpointing, large model supports with parallelism, gradient checkpointing with model parallelism, or utilizing host memory as swap space (Liberty, Col 15, lines 46-62, “The user devices 1002 can interact with the model training system 124 via frontend 1029 of the model training system 124. For example, a user device 1002 can provide a training request to the frontend 1029 that includes a container image (or multiple container images, or an identifier of one or multiple locations where container images are stored), an indicator of input data (e.g., an address or location of input data), one or more hyperparameter values (e.g., values indicating how the algorithm will operate, how many algorithms to run in parallel[model parallelism], how many clusters into which to separate data, etc.), and/or information describing the computing machine on which to train a machine learning model (e.g., a graphical processing unit (GPU) instance type, a central processing unit (CPU) instance type, an amount of memory to allocate, a type of virtual machine instance to use for training, etc.).”
Liberty, Col 19, lines 53-67, Col 20, lines 1-5, “In some embodiments, a plurality of virtual machine instances 1022 execute code 1036 stored in a plurality of ML training containers 1030. For example, the resources used to train a particular machine learning model can exceed the limitations of a single virtual machine instance 1022. However, the algorithm included in the container image can be in a format that allows for the parallelization of the training process. Thus, the model training system 124 can create multiple copies of the container image provided in a training request, initialize multiple virtual machine instances 1022, and cause each virtual machine instance 1022 to load a container image copy in one or more separate ML training containers 1030 [data parallelism]. The virtual machine instances 1022 can then each execute the code 1036 stored in the ML training containers 1030 in parallel. The model training system 124 can further provide configuration information to each ML training container 1030 via the virtual machine instances 1022 (e.g., information indicating that N ML training containers 1030 are collectively training a machine learning model and that a particular ML training container 1030 receiving the configuration information is ML training container 1030 number X of N, information indicating that M virtual machine instances 1022 are collectively training a machine learning model and that a particular ML training container 1030 receiving the configuration information is initialized in virtual machine instance 1022 number Y of M, etc.), which can be included in the resulting model data. As described above, by parallelizing the training process, the model training system 124 can significantly reduce the training time in some embodiments.”).


Claim 5 is  rejected under 35 U.S.C. 103 as being unpatentable over Liberty as modified by Shoeybi , Venkatraman and Song  in view of Amaral et al.( Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments, November 12–17, 2017)

Regarding claim 5, Liberty as modified by Shoeybi , Venkatraman and Song teaches the method of claim 1.
Liberty further teach wherein system configurations are selected from the group consisting of: (i) CPU configurations information regarding a number of CPU cores, a number of threads per CPU core, non-uniform memory access nodes, a remote memory access latency, a memory bandwidth, a CPU-GPU link bandwidth and latency, and CPU-CPU interconnection bandwidth and latency; and (ii) graphical processing unit configurations information regarding a [number of] GPUs, [a GPU compute capability], a GPU memory, [a GPU topology, a GPU-GPU link bandwidth, and a GPU-GPU link latency] (Liberty, Col 15, lines 46-67, “The user devices 1002 can interact with the model training system 124 via frontend 1029 of the model training system 124. For example, a user device 1002 can provide a training request to the frontend 1029 that includes a container image (or multiple container images, or an identifier of one or multiple locations where container images are stored), an indicator of input data (e.g., an address or location of input data), one or more hyperparameter values (e.g., values indicating how the algorithm will operate, how many algorithms to run in parallel, how many clusters into which to separate data, etc.), and/or information describing the computing machine on which to train a machine learning model (e.g., a graphical processing unit (GPU) instance type, a central processing unit (CPU) instance type, an amount of memory to allocate, a type of virtual machine instance to use for training, etc.).”
Liberty as modified by Shoeybi , Venkatraman and Song does not explicitly teach wherein system configurations are CPU configurations information regarding a number of CPU cores, a number of threads per CPU core, non-uniform memory access nodes, a remote memory access latency, a memory bandwidth, a CPU-GPU link bandwidth and latency, and CPU-CPU interconnection bandwidth and latency; [or graphical processing unit configurations information regarding a number of [GPUs], a GPU compute capability, [a GPU memory], a GPU topology, a GPU-GPU link bandwidth, and a GPU-GPU link latency.
 	Amaral teaches wherein system configurations are CPU configurations information regarding a number of CPU cores, a number of threads per CPU core, non-uniform memory access nodes, a remote memory access latency, a memory bandwidth, a CPU-GPU link bandwidth and latency, and CPU-CPU interconnection bandwidth and latency; [or graphical processing unit configurations information regarding a number of [GPUs], a GPU compute capability, [a GPU memory], a GPU topology, a GPU-GPU link bandwidth, and a GPU-GPU link latency (Amaral, section  2,  DEEP LEARNING WORKLOADS  “Additionally, a key parameter that plays a significant role in the communication is the batch size. It determines how many samples per GPU[number of GPUs] the NN will analyze in each training step, and directly impacts the amount of communication and computation[ a GPU compute capability] in each step. The lower the batch size is, the noisier the training signal is going to be; the higher it is, the longer it will take to compute the stochastic gradient descent. Noise is an important component for solving nonlinear problems.”
Amaral 3.1 Testing Platform and Configuration All experiments are conducted on an IBM Power8 System S822LC release, code-named as “Minsky” shown in Figure 1. The server has 2 sockets and 8 cores per socket that run at 3.32 GHz and two NVIDIA GPU P100’s per socket. Each GPU has 3584 processor cores at boot clocks from 1328 MHz to 1480 MHz, and 16 GB of memory[a GPU memory]. Each socket is connected with 256 GB of DRAM. Where the intra-socket CPU-to-GPU and GPU-to-GPU are linked via dual NVLinks that uses NVIDIA’s new High-Speed Signaling interconnects (NVHS). 
Amaral Abstract  “This paper presents a new topology-aware workload placement strategy to schedule deep learning jobs on multi-GPU systems[a GPU topology].”
Amaral 4.1, Topology Representation, 4.1.1 Job graph. “This graph represents the communication requirements of tasks (i.e. GPUs). Vertexes represent GPUs and edges represent communication. Each edge has an associated weight denoting the communication volume, given by the average GPU-toGPU bandwidth usage. During the mapping process, this weight is normalized by the total available bandwidth in the physical machine, where a value equal to 0 represents no communication and higher than 0 accounts for the communication level.”
Amaral, section 3, EVALUATING THE IMPACT OF PLACEMENT STRATEGIES “In this section, we evaluate two general purpose workload placement strategies:  “………… The main sources of performance perturbation on multi-GPU applications are how the allocated GPUs are connected, i.e. the topology, and how much of the shared bus bandwidth other applications are utilizing. To illustrate it, Figure 2 shows different workload placement strategies that can be defined on top a single machine with hardware topology composed of two sockets and two GPUs per socket (the same topology shown in Figure 1 for the Power8 system). The GPUs within the same socket are located at a “shorter” distance (from a topology perspective) than the GPUs located across sockets. Besides, GPUs on the same socket can utilize the higher bandwidth and lower latency network (e.g., NVLink) to communicate instead of going over the PCI-e and the QPI links to communicate across CPU sockets.”
In view of the teaching of Amaral, it would have been obvious for a person of
ordinary skill in the art to apply teaching of Liberty as modified by Shoeybi , Venkatraman and Song at the time the application was filed in order to allocate the resources for optimal execution time and improved utilization (Amaral, Abstract, “…Workload schedulers must consider hardware topology and workload communication requirements in order to allocate CPU and GPU resources for optimal execution time and improved utilization in shared cloud environments.”)


Response to Arguments
Applicant’s arguments with respect to claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Rejection with regard to 35 U.S.C 112 have been withdrawn in view of claim amendments.


Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HUMA WASEEM whose telephone number is (571)272-1316. The examiner can normally be reached Monday-Friday(9:00am - 5:00 pm) EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jason B. Dunham can be reached on (571) 272-8109. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/HUMA WASEEM/Examiner, Art Unit 3686 


/JASON B DUNHAM/Supervisory Patent Examiner, Art Unit 3686