Patent Application 17425200 - CONTROLLING AN AGENT TO EXPLORE AN ENVIRONMENT

Title: CONTROLLING AN AGENT TO EXPLORE AN ENVIRONMENT USING OBSERVATION LIKELIHOODS
Application Information

Invention Title: CONTROLLING AN AGENT TO EXPLORE AN ENVIRONMENT USING OBSERVATION LIKELIHOODS
Application Number: 17425200
Submission Date: 2025-05-14T00:00:00.000Z
Effective Filing Date: 2021-07-22T00:00:00.000Z
Filing Date: 2021-07-22T00:00:00.000Z
National Class: 706
National Sub-Class: 015000
Examiner Employee Number: 97128
Art Unit: 2121
Tech Center: 2100
Rejection Summary

102 Rejections: 0
103 Rejections: 5
Cited Patents

The following patents were cited in the rejection:
Office Action Text



    DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
Applicant’s arguments with respect to claims 1, 16 and 17 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
In regard to 101 rejections
On Page 9 of 12, the applicant argues that by determining the agent can discover the items within the environment and that a greater knowledge of the environment can be provided that leads to faster execution of the task and less usage of computer resource and further argues that it support technical improvement in the field of agent control (robots).
Examiner’s Response:
The examiner first identifies that in [0010],  a predictive model of the environment is explored and the above argument about environment relates to the modeling. The examiner identifies that the argument is really applying the improvement to the model application and does not present any details what improvement for computing performance as within the context of applying the model to practical application environment. The applicant argues that control example of a robot control can be improved. As identified in [0035], [0037], [0038], [0041], [0042 of the robot applications, the examiner identifies the control actions and the application has not provided any details of the improvement to such control action as related to usage of the computational capability to perform the task.  
In CONCLUSION, the examiner rejects claims 1-14, 16-17 and 19-24 under 101 and MAINTAINS the 101 rejection.
In regard to 103 rejections
On Page 10-11 of 12, the applicant argues based on the amendments to the claims 1, 16 and 17 primarily arguing that the reference Graepel do not teach the reward as a measure of divergence. The applicant also argues that reference Horvitz does not teach such measures are obtained two difference models. The applicant also argues that the references do not teach first and second statistical models.
Examiner’s Response:
The examiner identifies the amendments as related using two probability distributions.  It is known in the art that a probability distribution can be considered a statistical model, as it forms the foundation for describing the likelihood of different outcomes in a random experiment or data set. A statistical model uses probability distributions to understand and predict data behavior.  In this context, a new reference “Liljeholm” teaches two probability distributions and two statistical models. The examiner also used a new reference “ Jürgen Schmidhuber” to teach the amendment of claim 23.
The examiner notes that the claim 18 is CANCELED.
In CONCLUSION, the examiner rejects claims 1-14, 16-17 and 19-24 under 103 and MAINTAINS the 103 rejection and MOVES the application to FINAL REJECTION.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-14  and 16-17, 19-24 rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
Step 1:
According to the first part of the analysis, claim 1 is a “method” claim, claim 16 is a “machine” claim (as a result of computer-readable storage) and claim 17 is a “machine” claim (as a result of a system). Thus,  the claims 1, 16, 17 fall into any one of the four statutory categories (i.e. process, machine, manufacture, or composition of matter). 

In regard to claim 1 (Currently Amended)
	Step 2A Prong 1:
“ a current observation characterizing a current state of the environment at a current time” is a mental step of analyzing a time-series data.
“ (d) generating a reward value for the generated action as a measure of the divergence between” is a mental step of comparison of time-series data points.
“  under a first statistical model of the environment that generates a first probability distribution over a set of possible observations of states of the environment
given a first history of past observations characterizing past states of the environment and past actions performed by the agent and (ii) the likelihood of the further observation of the further state of the environment “ is a math function.
“ under a second, different statistical model of the environment that generates a second probability distribution over the set of possible observations of states of the environment given a second history of past observations characterizing past states of the environment and past actions performed by the agent”, is a math function.
“ [[the]]a most recent past observation in the first history is more recent than [[the]]a most recent past observation in the second history; ” is a mental step of data identification in a 
series of data.
Additional Elements
Step 2A Prong 2:
“ A method for generating actions to be performed by an agent interacting with an environment, the environment taking at successive times a corresponding one of a plurality of states, the method comprising “ recited in the preamble do not integrate the judicial exception into a practical application.  This is a mere instruction to apply the judicial exception on a generic 
computer.  See MPEP 2106.05(h).
“ using [[the]]a neural network to generate an action based on ” is a mere instruction to apply the judicial exception on a generic computer and do not integrate the judicial exception into a practical application.  See MPEP 2106.05(h).
” causing the agent to perform the generated action on the environment” do not integrate the judicial exception into a practical application.  See MPEP 2106.05(h).
” (c) obtaining a further observation of a further [[the]]state” do not integrate the judicial exception into a practical application.  The element is a data gathering and transmission.  See MPEP 2106.05(h).
“ the environment transitioned into following the performance of the generated action by the agent  “is a mere instruction to apply the judicial exception on a generic computer and do not integrate the judicial exception into a practical application. See MPEP 2106.05(h).
“ wherein: and training the neural network based on the reward value for the generated action” 
is a  mere instruction to apply the judicial exception on a generic computer and do not 
integrate the judicial exception into a practical application.  See MPEP 2106.05(h).

Step 2B:
“ A method for generating actions to be performed by an agent interacting with an 
environment, the environment taking at successive times a corresponding one of a plurality of 
states, the method comprising “ recited in the preamble do not amount to more than the 
judicial exception in the claim.  This is a mere instruction to apply the judicial exception on a 
generic computer.  See MPEP 2106.05(h).
“ using [[the]]a neural network to generate an action based on ” is a mere instruction to apply 
the judicial exception on a generic computer and do not amount to more than the judicial 
exception in the claim.  See MPEP 2106.05(h).
” causing the agent to perform the generated action on the environment” and do not amount to 
more than the judicial exception in the claim. The element is performing a generic computer 
function. See MPEP 2106.05(h).
” (c) obtaining a further observation of a further [[the]]state” do not amount to more than the 
judicial exception in the claim. The element is a data gathering and transmission.  See MPEP 2106.05(h).
“ the environment transitioned into following the performance of the generated action by the agent  “is a mere instruction to apply the judicial exception on a generic computer and do not amount to more than the judicial exception in the claim. See MPEP 2106.05(h).
“ wherein: and training the neural network based on the reward value for the generated action” 
is a  mere instruction to apply the judicial exception on a generic computer and do 
not amount to more than the judicial exception in the claim. See MPEP 2106.05(h).
In regard to claim 2 (Currently Amended)
Step 2A Prong 1:
“ the most recent observation in the first history being H time steps into the past compared to 
the current time, where H is an integer greater than zero”  is a mental step of data 
identification.
Step 2A Prong 2: no additional elements
Step 2B: no additional elements
In regard to claim 3 (Original)
Step 2A Prong 1:
“the most recent observation in the second history is H+1 steps into the past compared to the 
current time” is a mental step of data identification.
Step 2A Prong 2: no additional elements
Step 2B: no additional elements
In regard to claim 4 (Previously Presented) 
Step 2A Prong 1:
“ in which H is greater than one” is a mental step of data identification.
Step 2A Prong 2: no additional elements
Step 2B: no additional elements
In regard to claim 5 (Currently Amended) 
Step 2A Prong 2:
“the measure is the difference between a logarithmic function of the probability of the further 
observation under the first probability distribution, and the logarithmic function of the 
probability of the further observation under the second probability distribution” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“the measure is the difference between a logarithmic function of the probability of the further 
observation under the first probability distribution, and the logarithmic function of the 
probability of the further observation under the second probability distribution” do 
not amount to more than the judicial exception in the claim.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
In regard to claim 6 (Currently Amended) 
Step 2A Prong 2:
“in which each statistical model is defined by a respective adaptive system defined by a plurality 
of Parameters” do not integrate the judicial exception into a practical application.  These 
additional elements are merely directed to using a computer as a tool to perform an abstract 
idea.  See MPEP 2106.05(h).
Step 2B:
“each statistical model is defined by a respective adaptive system defined by a plurality of 
Parameters” do not amount to more than the judicial exception in the claim. These additional 
elements are merely directed to using a computer as a tool to perform an abstract idea.  See 
MPEP 2106.05(h).
In regard to claim 7 (Original) 
Step 2A Prong 2:
“ each adaptive system comprises a respective probability distribution generation unit which 
receives an encoding of the respective history and data encoding actions performed by the 
agent after the most recent action recorded in the respective history, the probability generation 
unit being arranged to generate a probability distribution for the further observation over the 
plurality of states of the system ”   do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“ each adaptive system comprises a respective probability distribution generation unit which 
receives an encoding of the respective history and data encoding actions performed by the 
agent after the most recent action recorded in the respective history, the probability generation 
unit being arranged to generate a probability distribution for the further observation over the 
plurality of states of the system ”  do not amount to more than the judicial exception in the 
claim. These additional elements are merely directed to using a computer as a tool to perform 
an abstract idea.  See MPEP 2106.05(h).
In regard to claim 8 (Original) 
Step 2A Prong 2:
“ the probability generation unit is a multi-layer perceptron” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“ the probability generation unit is a multi-layer perceptron” do not amount to more than the 
judicial exception in the claim. These additional elements are merely directed to using a 
computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
In regard to claim 9 (Currently Amended) 
Step 2A Prong 2:
“in which the encoding of the history of previous observations and actions is generated by a 
recurrent unit which successively receives the actions and corresponding representations of the 
resulting observations” do not integrate the judicial exception into a practical application.  These 
additional elements are merely directed to using a computer as a tool to perform an abstract 
idea.  See MPEP 2106.05(h).
Step 2B:
“in which the encoding of the history of previous observations and actions is generated by a 
recurrent unit which successively receives the actions and corresponding representations of the 
resulting observations” do not amount to more than the judicial exception in the claim. These 
additional elements are merely directed to using a computer as a tool to perform an abstract 
idea.  See MPEP 2106.05(h).
In regard to claim 10 (Original) 
Step 2A Prong 2:
“the representation of each observation is obtained as the output of a convolutional model 
which receives the observation” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“the representation of each observation is obtained as the output of a convolutional model 
which receives the observation” do not amount to more than the judicial exception in the 
claim.  These additional elements are merely directed to using a computer as a tool to perform 
an abstract idea.  See MPEP 2106.05(h).
In regard to claim 11 (Currently Amended) 
Step 2A Prong 2:
“the adaptive system for the second statistical model is a multi-action probability generation 
unit operative to generate a respective probability distribution over the possible observations of 
the state of the system following the successive application of a corresponding plural number of 
actions generated by the neural network and input to the multi-action probability generation 
unit”  do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“the adaptive system for the second statistical model is a multi-action probability generation 
unit operative to generate a respective probability distribution over the possible observations of 
the state of the system following the successive application of a corresponding plural number of 
actions generated by the neural network and input to the multi-action probability generation 
unit”  do not amount to more than the judicial exception in the claim. These additional 
elements are merely directed to using a computer as a tool to perform an abstract idea.  See 
MPEP 2106.05(h).
In regard to claim 12 (Currently Amended) 
Step 2A Prong 2:
“the reward value further comprises a reward component indicative of an extent to which the 
action performs a task”  do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“the reward value further comprises a reward component indicative of an extent to which the 
action performs a task”  do not amount to more than the judicial exception in the claim.
These additional elements are merely directed to using a computer as a tool to perform an 
abstract idea.  See MPEP 2106.05(h).
In regard to claim 13 (Currently Amended) 
Step 2A Prong 2:
“neural network is a policy network which outputs a probability distribution over possible 
actions, the action being generated as a sample from the probability distribution”  do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“neural network is a policy network which outputs a probability distribution over possible 
actions, the action being generated as a sample from the probability distribution” do not 
amount to more than the judicial exception in the claim. These additional elements are merely 
directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).





In regard to claim 14 (Currently Amended) 
Step 2A Prong 2:
“each observation is a sample from a probability distribution based on the state of the 
environment” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“each observation is a sample from a probability distribution based on the state of the 
environment” ”  do not amount to more than the judicial exception in the claim.
These additional elements are merely directed to using a computer as a tool to perform an 
abstract idea.  See MPEP 2106.05(h).
In regard to claim 16 (Currently Amended)
Step 2A Prong 1:
“ a current observation characterizing a current state of the environment at a current time” is a mental step of analyzing a time-series data.
“ (d) generating a reward value for the generated action as a measure of the divergence between” is a mental step of comparison of time-series data points.
“  under a first statistical model of the environment that generates a first probability distribution over a set of possible observations of states of the environment given a first history of past observations characterizing past states of the environment and past actions performed by the agent and (ii) the likelihood of the further observation of the further state of the environment “ is a math function.
“ under a second, different statistical model of the environment that generates a second probability distribution over the set of possible observations of states of the environment given a second history of past observations characterizing past states of the environment and past actions performed by the agent, is a math function.
“ [[the]]a most recent past observation in the first history is more recent than [[the]]a most recent past observation in the second history; ” is a mental step of data identification in a 
series of data.
Additional Elements
Step 2A Prong 2:
“ One or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for generating actions to be performed by an agent interacting with an environment, the environment taking at successive times a corresponding one of a plurality of states” recited in the preamble do not integrate the judicial exception into a practical application. This is a mere instruction to apply the judicial exception on a generic computer. See MPEP 2106.05(h).
“ using [[the]]a neural network to generate an action based on ” is a mere instruction to apply the judicial exception on a generic computer and do not integrate the judicial exception into a practical application.  See MPEP 2106.05(h).
” causing the agent to perform the generated action on the environment” and do not integrate the judicial exception into a practical application.  The element is performing a generic 
computer function. See MPEP 2106.05(h).
” (c) obtaining a further observation of a further [[the]]state” do not integrate the judicial exception into a practical application.  The element is a data gathering and transmission.  See MPEP 2106.05(h).
“ the environment transitioned into following the performance of the generated action by the agent  “is a mere instruction to apply the judicial exception on a generic 
computer and do not integrate the judicial exception into a practical application. See MPEP 
2106.05(h).
“ wherein: and training the neural network based on the reward value for the generated action” 
is a  mere instruction to apply the judicial exception on a generic computer and do not 
integrate the judicial exception into a practical application.  See MPEP 2106.05(h).
Step 2B:
“ One or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for “ recited in the preamble do not amount to more than the judicial exception in the claim. This is a mere instruction to apply the judicial exception on a generic computer and do not integrate the judicial exception into a practical application.  See MPEP 2106.05(h).
“ using [[the]]a neural network to generate an action based on ” is a mere instruction to apply 
the judicial exception on a generic computer and do not amount to more than the judicial 
exception in the claim.  See MPEP 2106.05(h).
” causing the agent to perform the generated action on the environment” and do not amount to 
more than the judicial exception in the claim. The element is performing a generic computer 
function. See MPEP 2106.05(h).
” (c) obtaining a further observation of a further [[the]]state” do not amount to more than the 
judicial exception in the claim. The element is a data gathering and transmission.  See MPEP 2106.05(h).
“ the environment transitioned into following the performance of the generated action by the agent  “is a mere instruction to apply the judicial exception on a generic 
computer and do not amount to more than the judicial exception in the claim. See MPEP 
2106.05(h).
“ wherein: and training the neural network based on the reward value for the generated action” 
is a  mere instruction to apply the judicial exception on a generic computer and do not 
amount to more than the judicial exception in the claim.  See MPEP 2106.05(h).
In regard to claim 17 (Currently Amended) 
Step 2A Prong 1:
“ a current observation characterizing a current state of the environment at a current time” is a mental step of analyzing a time-series data.
“ (d) generating a reward value for the generated action as a measure of the divergence between” is a mental step of comparison of time-series data points.
“  under a first statistical model of the environment that generates a first probability distribution over a set of possible observations of states of the environment given a first history of past observations characterizing past states of the environment and past actions performed by the agent and (ii) the likelihood of the further observation of the further state of the environment “ is a math function.
“ under a second, different statistical model of the environment that generates a second probability distribution over the set of possible observations of states of the environment given a second history of past observations characterizing past states of the environment and past actions performed by the agent, is a math function.
“ [[the]]a most recent past observation in the first history is more recent than [[the]]a most recent past observation in the second history; ” is a mental step of data identification in a 
series of data.
Step 2A Prong 2:
“ A training system implemented by one or more computers and for training a neural
network which is operative to generate actions to be performed by an agent interacting with an
environment, the environment taking at successive times a corresponding one of a plurality of 
states, the method comprising “ is a generic computer function and do not integrate the judicial 
exception into a practical application.  See MPEP 2106.05(h).
“ using [[the]]a neural network to generate an action based on ” is a mere instruction to apply the judicial exception on a generic computer and do not integrate the judicial exception into a practical application.  See MPEP 2106.05(h).
” causing the agent to perform the generated action on the environment” do not integrate the 
judicial exception into a practical application. The element is performing a generic computer 
function. See MPEP 2106.05(h).
“ the environment transitioned into following the performance of the generated action by the agent  “is a mere instruction to apply the judicial exception on a generic 
computer and do not integrate the judicial exception into a practical application. See MPEP 
2106.05(h).
“training the neural network based on the reward value for the generated action” 
is a  mere instruction to apply the judicial exception on a generic computer and do not 
integrate the judicial exception into a practical application.  See MPEP 2106.05(h).
Step 2B:
“ A training system implemented by one or more computers and for training a neural
network which is operative to generate actions to be performed by an agent on an
environment”  do not amount to more than the judicial exception in the claim. This is a mere instruction to apply the judicial exception on a generic computer.  See MPEP 2106.05(h).
“ using [[the]]a neural network to generate an action based on ” is a mere instruction to apply the judicial exception on a generic computer and do not integrate the judicial exception into a practical application.  See MPEP 2106.05(h).
” causing the agent to perform the generated action on the environment” and do not amount to 
more than the judicial exception in the claim. The element is performing a generic computer 
function. See MPEP 2106.05(h).
” (c) obtaining a further observation of a further [[the]]state” do not amount to more than the 
judicial exception in the claim. The element is a data gathering and transmission.  See MPEP 2106.05(h).
“ the environment transitioned into following the performance of the generated action by the agent  “is a mere instruction to apply the judicial exception on a generic computer and do not amount to more than the judicial exception in the claim. See MPEP 
2106.05(h).
“training the neural network based on the reward value for the generated action” 
is a  mere instruction to apply the judicial exception on a generic computer and do not amount 
to more than the judicial exception in the claim..  See MPEP 2106.05(h).
In regard to claim 19 (Currently Amended) 
Step 2A Prong 2:
“the measure is the difference between a logarithmic function of the probability of the further 
observation under the first probability distribution, and the logarithmic function of the 
probability of the further observation under the second probability distribution”  do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“the measure is the difference between a logarithmic function of the probability of the further 
observation under the first probability distribution, and the logarithmic function of the 
probability of the further observation under the second probability distribution”   do not 
amount to more than the judicial exception in the claim. These additional elements are merely 
directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
In regard to claim 20  (Currently Amended) 
Step 2A Prong 2:
“for each statistical model, a respective adaptive system defined by a 
plurality of parameters” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“ for each statistical model, a respective adaptive system defined by a plurality of 
parameters” do not amount to more than the judicial exception in the claim.
These additional elements are merely directed to using a computer as a tool to perform an 
abstract idea.  See MPEP 2106.05(h).
In regard to claim 21 (Original) 
Step 2A Prong 2:
“each adaptive system comprises a respective probability distribution generation unit arranged 
to receive an encoding of the respective history and data encoding actions performed by the 
agent after the most recent action recorded in the respective history, the probability generation 
unit being arranged to generate a probability distribution for the further observation over the 
plurality of states of the system” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“ each adaptive system comprises a respective probability distribution generation unit arranged 
to receive an encoding of the respective history and data encoding actions performed by the 
agent after the most recent action recorded in the respective history, the probability generation 
unit being arranged to generate a probability distribution for the further observation over the 
plurality of states of the system”  do not amount to more than the judicial exception in the 
claim. These additional elements are merely directed to using a computer as a tool to perform 
an abstract idea.  See MPEP 2106.05(h).
In regard to claim 22 (Original) 
Step 2A Prong 2:
“the probability generation unit is a multi-layer perceptron” do not integrate the judicial exception into a practical application.  These additional elements are merely directed to using a computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“the probability generation unit is a multi-layer perceptron”  do not amount to more than the 
judicial exception in the claim. These additional elements are merely directed to using a 
computer as a tool to perform an abstract idea.  See MPEP 2106.05(h).
In regard to claim 23 (Currently Amended) 
Step 2A Prong 2:
“the encoding of the the history of previous observations and actions and actions is generated 
by a recurrent unit which successively receives the actions and corresponding representations of 
the resulting observations” do not integrate the judicial exception into a practical application.  
These additional elements are merely directed to using a computer as a tool to perform an 
abstract idea.  See MPEP 2106.05(h).
Step 2B:
““the encoding of the history of previous observations and actions and actions is generated 
by a recurrent unit which successively receives the actions and corresponding representations of 
the resulting observations” do not amount to more than the judicial exception in the claim.
These additional elements are merely directed to using a computer as a tool to perform an 
abstract idea.  See MPEP 2106.05(h).
In regard to claim 24 (Currently Amended) 
Step 2A Prong 2:
“the representation of each observation is obtained as the output of a convolutional model 
which receives the observation” do not integrate the judicial exception into a practical 
application.  These additional elements are merely directed to using a computer as a tool to 
perform an abstract idea.  See MPEP 2106.05(h).
Step 2B:
“the representation of each observation is obtained as the output of a convolutional model 
which receives the observation” do not amount to more than the judicial 
exception in the claim. These additional elements are merely directed to using a computer as a 
tool to perform an abstract idea.  See MPEP 2106.05(h).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 6-7, 11-12, 16-17 , and 20-21 are rejected under 35 U.S.C. 103
unpatentable over Graepel et.al. (hereinafter Graepel) US 2018/0032863 A1, , in view of  Horvitz (hereinafter Horvitz ) US 2006/0291580 A1, in view of  Mimi Liljeholm (hereinafter Liljeholm) Neural Correlates of the Divergence of Instrumental Probability Distributions, The Journal of Neuroscience, Jul 24, 2014, Page 12519-12527.
 In regard to claim 1 (Currently Amended)
Graepel discloses: 
-	A method for generating actions to be performed by an agent interacting with an environment, the environment taking at successive times a corresponding one of a plurality of states, 
in [0010]:
 FIG. 2 is a flow diagram of an example process for training a collection of neural networks for use in selecting actions to be performed by an agent interacting with an environment,
in [0046]:
 In particular, the action selection subsystem 120 maintains data representing a state tree of the environment 104. The state tree includes nodes that represent states of the environment 104 and directed edges that connect nodes in the tree. An outgoing edge from a first node to a second node in the tree represents an action that was performed in response to an observation characterizing the first state and resulted in the environment transitioning into the second state,
in [0007]:
 By using neural networks in searching the state tree, the amount of computing resources and the time required to effectively select an action to be performed by the agent can be reduced. 
(BRI: state tree nodes represent the plurality of states)
-	the method at each of a plurality of successive time steps: using [[the]] a neural network to generate an action based on a current observation of characterizing a current state of the environment at a current time
In [0014]:
This specification generally describes a reinforcement learning system that selects actions to be performed by a reinforcement learning agent interacting with an environment. In order to interact with the environment, the reinforcement learning system receives data characterizing the current state of the environment and selects an action to be performed by the agent from a set of actions in response to the received data
in [0049]:
 At any given time, the action score for an action represents the current likelihood that the agent 102 will complete the objectives if the action is performed, the visit count for the action is the current number of times that the action has been performed by the agent 102. 
(BRI: searching for data within a specified period of time can be considered a form of successive time steps. When you analyze data over a defined timeframe, you are essentially examining the data points at different points in that time, which can be seen as successive steps within that period)
(b) causing the agent to perform the generated action on the environment; 
In [0049]:
 At any given time, the action score for an action represents the current likelihood that the agent 102 will complete the objectives if the action is performed
(c) obtaining a further observation of a further [[the]] state [[of ]] that the environment transitioned into following the performance of the generated action by the agent; 
In [0015]:
 Generally, the agent interacts with the environment in order to complete one or more objectives and the reinforcement learning system selects actions in order to maximize the objectives, as represented by numeric rewards received by the reinforcement learning system in response to actions performed by the agent.
in [0007]:
Actions to be performed by an agent interacting with an environment that has a very large state space can be effectively selected to maximize the rewards resulting from the performance of the action,
-	 and(e)  training the neural network based on the reward value for the generated action. 
In [0015]:
 Generally, the agent interacts with the environment in order to complete one or more objectives and the reinforcement learning system selects actions in order to maximize the objectives, as represented by numeric rewards received by the reinforcement learning system in response to actions performed by the agent.
	Graepel does not explicitly disclose:
-	(d) generating a reward value for the generated action for the generated action as a measure of the divergence between (i) the likelihood of the further observation of the further state of the environment that the environment transitioned into following the performance of the generated action by the agent under a first statistical model of the environment that generates a first probability distribution over a set of possible observations of states of the environment given a first history of past observations characterizing past states of the environment and past actions performed by the agent 
-	(ii) the likelihood of the further observation of the further state of the environment that the environment transitioned into following the performance of the generated action by the agent under a second, different statistical model of the environment that generates a second probability distribution over the set of possible observations of states of the environment given a second history of past observations characterizing past states of the environment and past actions performed by the agent, 
-	wherein: and [[the]] a most recent past observation in the first history is more recent than the most recent is more recent than [[the]]a most recent past observation in the second history; 
However, Liljeholm discloses:
-	(d) generating a reward value for the generated action for the generated action as a measure of the divergence between (i) the likelihood of the further observation of the further state of the environment that the environment transitioned into following the performance of the generated action by the agent under a first statistical model of the environment that generates a first probability distribution over a set of possible observations of states of the environment given a first history of past observations characterizing past states of the environment and past actions performed by the agent 
In [ § Results , Page 12523]:
A possible alternative explanation for our effects of JS divergence is that the IPL and SMA are encoding simpler representations of outcome probabilities,
(BRI: IPL and SMA are two environments that relate two probability distributions in which IPL and SMA are related to a medical area)
In [ § Results, Page 12523]:
To formally determine which of the three variables provided the best account of neural activity in the SMA and IPL, we performed a Bayesian model selection analysis, 
In [§ Introduction, Page 12519]:
values of actions are acquired by means of a reward prediction error (RPE), and a “model-based” class that constructs a mental map of the environment and generates decisions by flexibly combining estimates of state-transition probabilities
In [Abstract, Page 12519]:
Reinforcement learning theories formalize this map as a set of stochastic relationships between actions and states, such that for any given action considered in a current state, a probability distribution is specified over possible outcome states,
(BRI: The probability distribution specified over possible outcome states may incorporate past history of observations)
In [ § Materials and Methods , Page 12522]:
we entered as modulators the entropy conditional on the chosen action, the entropy conditional on both available actions, and the JS divergence of outcome probability distributions of available actions. For the outcome regressor, we entered as modulators, in order, the RPE, the SPE, and the utility of the received outcome,
In [ § Results , Page 12523]:
To empirically assess the neural effects of simpler representations of outcome probabilities relative to those of JS divergence we specified two additional GLMs that were identical to our original model except for the replacement of the JS divergence modulator with a regressor modeling the difference between or the sum of reward probabilities,
(BRI: In neural networks, "divergence modulator" refers to a mechanism where a single neuron's output signal spreads out to influence multiple other neurons or even the entire network. The first GLM (General linear model) among two additional GLM is the first statistical model)
In [ § Introduction , Page 12520]:
Our primary objective was to assess neural correlates of the difference between outcome distributions associated with alternative actions, formalized as Jensen–Shannon (JS) divergence–a measure that quantifies the distance between probability distributions. The relationship between JS divergence and other decision variables is illustrated in Figure 1B
(BRI: The Fig 1B shows the JS divergence between the two distributions. Within the context of two statistical model ,the first statistical model relates to first probability distribution)
In [ § Introduction , Page 12520]:
The probabilities with which actions produced their outcomes were generated so as to minimize correlations between our three decision variables (i.e., between outcome probabilities, outcome values, and action values 
	(BRI: outcome value or value associated with potential outcome is an “observation”)
In [ § Results , Page 12523]: 
We implemented a model-based RL learner, which uses experience with state transitions to update a matrix, T(s,a,s’), of state transition probabilities, where each element of T(s,a,s’) holds the current estimate of the probability of transitioning from state s to s’ given action a.
-	(ii) the likelihood of the further observation of the further state of the environment that the environment transitioned into following the performance of the generated action by the agent under a second, different statistical model of the environment that generates a second probability distribution over the set of possible observations of states of the environment given a second history of past observations characterizing past states of the environment and past actions performed by the agent, 
In [ § Results , Page 12523]:
A possible alternative explanation for our effects of JS divergence is that the IPL and SMA are encoding simpler representations of outcome probabilities,
(BRI: IPL and SMA are two environments that relate two probability distributions in which IPL and SMA are related to a medical area)
In [ § Results, Page 12523]:
To formally determine which of the three variables provided the best account of neural activity in the SMA and IPL, we performed a Bayesian model selection analysis, 
In [§ Introduction, Page 12519]:
values of actions are acquired by means of a reward prediction error (RPE), and a “model-based” class that constructs a mental map of the environment and generates decisions by flexibly combining estimates of state-transition probabilities
In [Abstract, Page 12519]:
Reinforcement learning theories formalize this map as a set of stochastic relationships between actions and states, such that for any given action considered in a current state, a probability distribution is specified over possible outcome states,
(BRI: The probability distribution specified over possible outcome states may incorporate past history of observations)
In [ § Materials and Methods , Page 12522]:
we entered as modulators the entropy conditional on the chosen action, the entropy conditional on both available actions, and the JS divergence of outcome probability distributions of available actions. For the outcome regressor, we entered as modulators, in order, the RPE, the SPE, and the utility of the received outcome,
In [ § Results , Page 12523]:
To empirically assess the neural effects of simpler representations of outcome probabilities relative to those of JS divergence we specified two additional GLMs that were identical to our original model except for the replacement of the JS divergence modulator with a regressor modeling the difference between or the sum of reward probabilities,
(BRI: In neural networks, "divergence modulator" refers to a mechanism where a single neuron's output signal spreads out to influence multiple other neurons or even the entire network. The second GLM (General linear model) among two additional GLM is the second statistical model)
In [ § Introduction , Page 12520]:
Our primary objective was to assess neural correlates of the difference between outcome distributions associated with alternative actions, formalized as Jensen–Shannon (JS) divergence–a measure that quantifies the distance between probability distributions. The relationship between JS divergence and other decision variables is illustrated in Figure 1B
(BRI: The Fig 1B shows the JS divergence between the two distributions. Within the context of two statistical model , the second statistical model relates to second probability distribution)
In [ § Introduction , Page 12520]:
The probabilities with which actions produced their outcomes were generated so as to minimize correlations between our three decision variables (i.e., between outcome probabilities, outcome values, and action values 
	(BRI: outcome value or value associated with potential outcome is an “observation”)
In [ § Results , Page 12523]: 
We implemented a model-based RL learner, which uses experience with state transitions to update a matrix, T(s,a,s’), of state transition probabilities, where each element of T(s,a,s’) holds the current estimate of the probability of transitioning from state s to s’ given action a.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, and Liljeholm. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions for a set of observation states.
One of ordinary skill would have motivation to combine Graepel, and Liljeholm to reduce the demand for a computationally costly binding of outcome probabilities with utilities (Liljeholm [ § Introduction , Page 12519]).
Graepel and Liljeholm do not explicitly disclose:
-	wherein: and [[the]] a most recent past observation in the first history is more recent than the most recent is more recent than [[the]]a most recent past observation in the second history; 
	However, Horvitz discloses: 
-	wherein: and [[the]] a most recent past observation in the first history is more recent than [[the]]a most recent past observation in the second history; 
(BRI: the statement implies that one historical record is newer than the most recent past event in another historical record.
In [0137]:
 The contactee data 520 may also include a context data 522. The context data 522 is generally related to observations about the contactee. For example, observations concerning the type of activity in which the contactee is involved (e.g., on task, not on task), location of the contactee (e.g., office, home, car, shower), calendar (e.g., appointment status, appointment availability), history of communications with other party (e.g., have replied to email in the past,) 
In [0137]:
While seven observations are listed in the preceding sentence it is to be appreciated that a greater or lesser number of observations may be stored in the context data 522,
In [0053]:
 Concerning utility of a communication, the value of a current potential communication can be evaluated by considering a measure of the history of utility of communication,
In [0071] :
The reliability predictor 222 can consider various factors associated with the communication 210 in predicting the reliability of the communication. One factor that is considered in predicting the reliability of a communication is the length of the communication. For example, a longer communication may be more subject to degradation due to more opportunities for failure. For some communications, the length may be known. By way of illustration, the length of an email message, or a file transfer may be determined. However, for other communications, the length may not be known (e.g., the length of a phone call may not be pre-determined, although it may be inferred). Thus, the length of the communication may be predicted, for example, by analyzing historical data associated with previous communications between the communicating parties that occurred via the channels being considered.
In [0182]:
The reliability history of the communication channel may include, for example, data concerning dates, times and durations of recent degradations. The reliability history of the current communication may include, for example, data concerning degradation of the current communication. For example, if the current communication is a satellite telephone call, then the reliability history of the current communication may include an average signal to noise ratio, a maximum signal to noise ratio, and the number of times the call has been dropped. 
	(BRI: the past is sending the email message(recent in the past) and the current is the telephone conversation (more recent in the current) within the context of history of communications)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm and Horvitz. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
One of ordinary skill would have motivation to combine Graepel, Liljeholm and Horvitz to optimize the utility of the communication  (Horvitz [0180]).
In regard to claim 6 (Currently Amended) 
Graepel and Liljeholm do not explicitly disclose:
-	 each statistical model is defined by a respective adaptive system defined by a plurality of parameters.  
However, Horvitz discloses: 
-	 each statistical model is defined by a respective adaptive system defined by a plurality of parameters. 
in [0011]:
multiple attributes concerning people, including their preferences, contexts, tasks and priorities are analyzed and to further facilitate establishing and adapting communication policies for people, 
in [0114]:
 The present invention may order the channels by assigned expected value and attempt to create a connection or to advise the contactor and/or contactee concerning the best connection. While this expected value can be employed to initially identify the channel that is predicted to maximize the utility of the communication 410, in one example of the present invention the contactee 430 will be presented with options concerning the communication. The contactee 430 reaction to the options will then determine the channel that is selected for the communication 410. The reactions to the options can be employed in machine learning that facilitates adapting the channel manager 402,
in [0185]:
 responses to options can be employed in ongoing machine learning to facilitate improving the performance of the method 900 by, for example, adapting the rules of 930 and 940,
in [0179]:
 one decision may be made using simple priority rules
in [0179]:
 a decision may be made employing decision-theoretic reasoning concerning the value of the communication given a consideration of the uncertainties about the context. In addition, the decisions can be made sensitive to dates and times, considering specific assertions about particular time horizons to guide communications,
(BRI: date and time is a parameter)
in [0179]:
 The rules may be selected on other parameters including, but not limited to, the number of matching preferences, the number of matching capabilities, the nature and quality of the contexts, the type and number of communications requested and the time critical nature of the desired communication.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm and Horvitz. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
One of ordinary skill would have motivation to combine Graepel, Liljeholm and Horvitz to optimize the utility of the communication  (Horvitz [0180]).
In regard to claim 7 (Original) 
Graepel and Liljeholm do not explicitly disclose:
-	each adaptive system comprises a respective probability distribution generation unit 
-	receives an encoding of the respective history and data encoding actions performed by the agent after the most recent action recorded in the respective history
-	the probability generation unit being arranged to generate a probability distribution for the further observation over the plurality of states of the system.  
However, Horvitz discloses:
-	each adaptive system comprises a respective probability distribution generation unit 
in [0228]:
 FIG. 23 is extended from the model in FIG. 22 to include considerations of the time expected to be required for a communication at 2310, as a function of the identity and goals of the communication, and the influence of a probability distribution over the time of the communication, and the channel reliability, on the likelihood of having different losses of fidelity and the dropping of the communication channel. These include the cost of loss of fidelity and cost of dropped connection, which influence the expected utility of the communication channel decision.
-	receives an encoding of the respective history and data encoding actions performed by the agent after the most recent action recorded in the respective history
in [0053]:
 an expected utility can be computed through the use of decision models such as influence diagrams, which encode a set of random variables, representing key aspects of context, cost, and value, as well as potential communication actions, and overall preferences about outcomes, 
in [0073]:
 Data stored in the channel data store 250 can include, but is not limited to, reliability history of the communication channel,
in [0053]:
 Concerning utility of a communication, the value of a current potential communication can be evaluated by considering a measure of the history of utility of communication, 
in [0092]:
 The reliability predictor 222 may alter the predicted reliability result based on the history of the current communication.
-	the probability generation unit being arranged to generate a probability distribution for the further observation over the plurality of states of the system.  
in [0111] :
the channel manager 402 may rely on one or more conditional probabilities associated with the contactee 430 location based on information like time of day, day of the week and current task.
in [0112]:
 there may be uncertainty concerning communication channels, preferences and one or more parameters employed to model a context. In this situation, a probability distribution over the different states of each variable can be inferred and expected values for each channel can be computed. 
In regard to claim 11 (Currently Amended) 
Graepel and Liljeholm do not explicitly disclose:
-	the adaptive system for the second statistical model is a multi-action probability generation 
unit operative to generate a respective probability distribution over the possible observations of the state of the system following the successive application of a corresponding plural number of actions generated by the neural network and input to the multi-action probability generation unit  
	However, Horvitz discloses:
in [0114]:
 The reactions to the options can be employed in machine learning that facilitates adapting the channel manager 402,
in [0185]:
 responses to options can be employed in ongoing machine learning to facilitate improving the performance of the method 900 by, for example, adapting the rules of 930 and 940, 
in [0228]:
 FIG. 23 is extended from the model in FIG. 22 to include considerations of the time expected to be required for a communication at 2310, as a function of the identity and goals of the communication, and the influence of a probability distribution over the time of the communication, and the channel reliability, on the likelihood of having different losses of fidelity and the dropping of the communication channel. These include the cost of loss of fidelity and cost of dropped connection, which influence the expected utility of the communication channel decision, 
in [0225]:
 FIGS. 21-23 represent influence diagrams 2100, 2200, and 2300 capturing in more general form the decision problem
in [0225]:
 The influence-diagram model includes key random variables (oval nodes), actions (square node), and the overall value of the outcome of actions,
in [0225]:
 Influence diagram processing algorithms can be employed to identify the action with the highest expected utility, 
in [0111] the channel manager 402 may rely on one or more conditional probabilities associated with the contactee 430 location based on information like time of day, day of the week and current task,
in [0112] a probability distribution over the different states of each variable can be inferred and expected values for each channel can be computed. 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm and Horvitz. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
One of ordinary skill would have motivation to combine Graepel, Liljeholm and Horvitz to optimize the utility of the communication  (Horvitz [0180]).
In regard to claim 12 (Currently Amended) 
Graepel and Liljeholm do not explicitly disclose:
-	reward value further comprises a reward component indicative of an extent to which the action performs a task.  
However, Horvitz discloses:
-	reward value further comprises a reward component indicative of an extent to which the action performs a task.  
in [0053]:
 employing the principles of maximum expected utility for optimization provides a useful method for computing the value of different communication actions. 
in [0053]:
 utility represents  communication effectiveness correlated to adherence to user preferences. Such effectiveness can be measured by factors including, but not limited to, reliability achieved on the communication channel, quantity of information content transferred, quality of information content transferred, and relevancy of information content transferred.
In regard to claim 16 (Currently Amended)  
Graepel discloses:
-	One or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations 
	in [0103], in [0104]:
-	generating actions to be performed by an agent interacting with an environment, the environment taking at successive times a corresponding one of a plurality of states, the operations comprising, at each of plurality of successive time steps;
In [0014]:
This specification generally describes a reinforcement learning system that selects actions to be performed by a reinforcement learning agent interacting with an environment. In order to interact with the environment, the reinforcement learning system receives data characterizing the current state of the environment and selects an action to be performed by the agent from a set of actions in response to the received data
in [0049]:
 At any given time, the action score for an action represents the current likelihood that the agent 102 will complete the objectives if the action is performed, the visit count for the action is the current number of times that the action has been performed by the agent 102. 
(BRI: searching for data within a specified period of time can be considered a form of successive time steps. When you analyze data over a defined timeframe, you are essentially examining the data points at different points in that time, which can be seen as successive steps within that period)
-	a) using [[the]]a neural network to generate an action based on a current observation characterizing a current state of the environment at a current time;
In [0014]:
This specification generally describes a reinforcement learning system that selects actions to be performed by a reinforcement learning agent interacting with an environment. In order to interact with the environment, the reinforcement learning system receives data characterizing the current state of the environment and selects an action to be performed by the agent from a set of actions in response to the received data
in [0049]:
 At any given time, the action score for an action represents the current likelihood that the agent 102 will complete the objectives if the action is performed, the visit count for the action is the current number of times that the action has been performed by the agent 102. 
-	b) causing the agent to perform the generated action on the environment;
In [0049]:
 At any given time, the action score for an action represents the current likelihood that the agent 102 will complete the objectives if the action is performed
-	(c) obtaining a further observation of a further [[the]]state [[of]] that the environment transitioned into following the performance of the generated action by the agent;
In [0015]:
 Generally, the agent interacts with the environment in order to complete one or more objectives and the reinforcement learning system selects actions in order to maximize the objectives, as represented by numeric rewards received by the reinforcement learning system in response to actions performed by the agent.
in [0007]:
Actions to be performed by an agent interacting with an environment that has a very large state space can be effectively selected to maximize the rewards resulting from the performance of the action,
-	training the neural network based on the reward value for the generated action.
In [0015]:
 Generally, the agent interacts with the environment in order to complete one or more objectives and the reinforcement learning system selects actions in order to maximize the objectives, as represented by numeric rewards received by the reinforcement learning system in response to actions performed by the agent.
Graepel does not explicitly disclose:
-	(d) generating a reward value for the generated action for the generated action as a measure of the divergence between (i) the likelihood of the further observation of the further state of the environment that the environment transitioned into following the performance of the generated action by the agent under a first statistical model of the environment that generates a first probability distribution over a set of possible observations of states of the environment given a first history of past observations characterizing past states of the environment and past actions performed by the agent 
-	(ii) the likelihood of the further observation of the further state of the environment that the environment transitioned into following the performance of the generated action by the agent under a second, different statistical model of the environment that generates a second probability distribution over the set of possible observations of states of the environment given a second history of past observations characterizing past states of the environment and past actions performed by the agent, 
wherein: a most recent past observation in the first history is more recent than a most recent past observation in the second history;
However, Liljeholm discloses:
-	(d) generating a reward value for the generated action for the generated action as a measure of the divergence between (i) the likelihood of the further observation of the further state of the environment that the environment transitioned into following the performance of the generated action by the agent under a first statistical model of the environment that generates a first probability distribution over a set of possible observations of states of the environment given a first history of past observations characterizing past states of the environment and past actions performed by the agent 
In [ § Results , Page 12523]:
A possible alternative explanation for our effects of JS divergence is that the IPL and SMA are encoding simpler representations of outcome probabilities,
(BRI: IPL and SMA are two environments that relate two probability distributions in which IPL and SMA are related to a medical area)
In [ § Results, Page 12523]:
To formally determine which of the three variables provided the best account of neural activity in the SMA and IPL, we performed a Bayesian model selection analysis, 
In [§ Introduction, Page 12519]:
values of actions are acquired by means of a reward prediction error (RPE), and a “model-based” class that constructs a mental map of the environment and generates decisions by flexibly combining estimates of state-transition probabilities
In [Abstract, Page 12519]:
Reinforcement learning theories formalize this map as a set of stochastic relationships between actions and states, such that for any given action considered in a current state, a probability distribution is specified over possible outcome states,
(BRI: The probability distribution specified over possible outcome states may incorporate past history of observations)
In [ § Materials and Methods , Page 12522]:
we entered as modulators the entropy conditional on the chosen action, the entropy conditional on both available actions, and the JS divergence of outcome probability distributions of available actions. For the outcome regressor, we entered as modulators, in order, the RPE, the SPE, and the utility of the received outcome,
In [ § Results , Page 12523]:
To empirically assess the neural effects of simpler representations of outcome probabilities relative to those of JS divergence we specified two additional GLMs that were identical to our original model except for the replacement of the JS divergence modulator with a regressor modeling the difference between or the sum of reward probabilities,
(BRI: In neural networks, "divergence modulator" refers to a mechanism where a single neuron's output signal spreads out to influence multiple other neurons or even the entire network. The first GLM (General linear model) among two additional GLM is the first statistical model)
In [ § Introduction , Page 12520]:
Our primary objective was to assess neural correlates of the difference between outcome distributions associated with alternative actions, formalized as Jensen–Shannon (JS) divergence–a measure that quantifies the distance between probability distributions. The relationship between JS divergence and other decision variables is illustrated in Figure 1B
(BRI: The Fig 1B shows the JS divergence between the two distributions. Within the context of two statistical model ,the first statistical model relates to first probability distribution)
In [ § Introduction , Page 12520]:
The probabilities with which actions produced their outcomes were generated so as to minimize correlations between our three decision variables (i.e., between outcome probabilities, outcome values, and action values 
	(BRI: outcome value or value associated with potential outcome is an “observation”)
In [ § Results , Page 12523]: 
We implemented a model-based RL learner, which uses experience with state transitions to update a matrix, T(s,a,s’), of state transition probabilities, where each element of T(s,a,s’) holds the current estimate of the probability of transitioning from state s to s’ given action a.
-	(ii) the likelihood of the further observation of the further state of the environment that the environment transitioned into following the performance of the generated action by the agent under a second, different statistical model of the environment that generates a second probability distribution over the set of possible observations of states of the environment given a second history of past observations characterizing past states of the environment and past actions performed by the agent, 
In [ § Results , Page 12523]:
A possible alternative explanation for our effects of JS divergence is that the IPL and SMA are encoding simpler representations of outcome probabilities,
(BRI: IPL and SMA are two environments that relate two probability distributions in which IPL and SMA are related to a medical area)
In [ § Results, Page 12523]:
To formally determine which of the three variables provided the best account of neural activity in the SMA and IPL, we performed a Bayesian model selection analysis, 
In [§ Introduction, Page 12519]:
values of actions are acquired by means of a reward prediction error (RPE), and a “model-based” class that constructs a mental map of the environment and generates decisions by flexibly combining estimates of state-transition probabilities
In [Abstract, Page 12519]:
Reinforcement learning theories formalize this map as a set of stochastic relationships between actions and states, such that for any given action considered in a current state, a probability distribution is specified over possible outcome states,
(BRI: The probability distribution specified over possible outcome states may incorporate past history of observations)
In [ § Materials and Methods , Page 12522]:
we entered as modulators the entropy conditional on the chosen action, the entropy conditional on both available actions, and the JS divergence of outcome probability distributions of available actions. For the outcome regressor, we entered as modulators, in order, the RPE, the SPE, and the utility of the received outcome,
In [ § Results , Page 12523]:
To empirically assess the neural effects of simpler representations of outcome probabilities relative to those of JS divergence we specified two additional GLMs that were identical to our original model except for the replacement of the JS divergence modulator with a regressor modeling the difference between or the sum of reward probabilities,
(BRI: In neural networks, "divergence modulator" refers to a mechanism where a single neuron's output signal spreads out to influence multiple other neurons or even the entire network. The second GLM (General linear model) among two additional GLM is the second statistical model)
In [ § Introduction , Page 12520]:
Our primary objective was to assess neural correlates of the difference between outcome distributions associated with alternative actions, formalized as Jensen–Shannon (JS) divergence–a measure that quantifies the distance between probability distributions. The relationship between JS divergence and other decision variables is illustrated in Figure 1B
(BRI: The Fig 1B shows the JS divergence between the two distributions. Within the context of two statistical model , the second statistical model relates to second probability distribution)
In [ § Introduction , Page 12520]:
The probabilities with which actions produced their outcomes were generated so as to minimize correlations between our three decision variables (i.e., between outcome probabilities, outcome values, and action values 
	(BRI: outcome value or value associated with potential outcome is an “observation”)
In [ § Results , Page 12523]: 
We implemented a model-based RL learner, which uses experience with state transitions to update a matrix, T(s,a,s’), of state transition probabilities, where each element of T(s,a,s’) holds the current estimate of the probability of transitioning from state s to s’ given action a.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, and Liljeholm. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions for a set of observation states.
One of ordinary skill would have motivation to combine Graepel, and Liljeholm to reduce the demand for a computationally costly binding of outcome probabilities with utilities (Liljeholm [ § Introduction , Page 12519]).
Graepel and Liljeholm do not explicitly disclose:
wherein: a most recent past observation in the first history is more recent than a most recent past observation in the second history;
	However, Horvitz discloses:
wherein: a most recent past observation in the first history is more recent than a most recent past observation in the second history;
In [0137]:
The contactee data 520 may also include a context data 522. The context data 522 is generally related to observations about the contactee. For example, observations concerning the type of activity in which the contactee is involved (e.g., on task, not on task), location of the contactee (e.g., office, home, car, shower), calendar (e.g., appointment status, appointment availability), history of communications with other party (e.g., have replied to email in the past,) 
In [0137]:
While seven observations are listed in the preceding sentence it is to be appreciated that a greater or lesser number of observations may be stored in the context data 522,
In [0053]:
Concerning utility of a communication, the value of a current potential communication can be evaluated by considering a measure of the history of utility of communication,
In [0071] :
The reliability predictor 222 can consider various factors associated with the communication 210 in predicting the reliability of the communication. One factor that is considered in predicting the reliability of a communication is the length of the communication. For example, a longer communication may be more subject to degradation due to more opportunities for failure. For some communications, the length may be known. By way of illustration, the length of an email message, or a file transfer may be determined. However, for other communications, the length may not be known (e.g., the length of a phone call may not be pre-determined, although it may be inferred). Thus, the length of the communication may be predicted, for example, by analyzing historical data associated with previous communications between the communicating parties that occurred via the channels being considered.
In [0182]:
The reliability history of the communication channel may include, for example, data concerning dates, times and durations of recent degradations. The reliability history of the current communication may include, for example, data concerning degradation of the current communication. For example, if the current communication is a satellite telephone call, then the reliability history of the current communication may include an average signal to noise ratio, a maximum signal to noise ratio, and the number of times the call has been dropped. 
(BRI: the past is sending the email message(recent in the past) and the current is the telephone conversation (more recent in the current) within the context of history of communications)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm and Horvitz. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
LiLie teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
One of ordinary skill would have motivation to combine Graepel, LiLie and Horvitz to optimize the utility of the communication  (Horvitz [0180]).
In regard to claim 17 (Currently Amended) 
Graepel discloses:
-	training system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to
 	in [0104], [0037])
-	perform operations for generating actions to be performed by an agent interacting with an environment, the environment taking at successive times a corresponding one of a plurality of states,  the operations comprising, at each of plurality of successive time steps; (a) using a neural network to generate an action based on a current observation characterizing a current state of the environment at a current time;
In [0014]:
This specification generally describes a reinforcement learning system that selects actions to be performed by a reinforcement learning agent interacting with an environment. In order to interact with the environment, the reinforcement learning system receives data characterizing the current state of the environment and selects an action to be performed by the agent from a set of actions in response to the received data
in [0049]:
 At any given time, the action score for an action represents the current likelihood that the agent 102 will complete the objectives if the action is performed, the visit count for the action is the current number of times that the action has been performed by the agent 102. 
(BRI: searching for data within a specified period of time can be considered a form of successive time steps. When you analyze data over a defined timeframe, you are essentially examining the data points at different points in that time, which can be seen as successive steps within that period)
-	(b) causing the agent to perform the generated action on the environment;
In [0049]:
 At any given time, the action score for an action represents the current likelihood that the agent 102 will complete the objectives if the action is performed
-	(c) obtaining a further observation of a further state that the environment transitioned into following the performance of the generated action by the agent;
In [0015]:
 Generally, the agent interacts with the environment in order to complete one or more objectives and the reinforcement learning system selects actions in order to maximize the objectives, as represented by numeric rewards received by the reinforcement learning system in response to actions performed by the agent.
in [0007]:
Actions to be performed by an agent interacting with an environment that has a very large state space can be effectively selected to maximize the rewards resulting from the performance of the action,
-	training the neural network based on the reward value for the generated action.
In [0015]:
 Generally, the agent interacts with the environment in order to complete one or more objectives and the reinforcement learning system selects actions in order to maximize the objectives, as represented by numeric rewards received by the reinforcement learning system in response to actions performed by the agent.
Graepel does not explicitly disclose:
-	(d) generating a reward value for the generated action for the generated action as a measure of the divergence between (i) the likelihood of the further observation of the further state of the environment that the environment transitioned into following the performance of the generated action by the agent under a first statistical model of the environment that generates a first probability distribution over a set of possible observations of states of the environment given a first history of past observations characterizing past states of the environment and past actions performed by the agent 
-	(ii) the likelihood of the further observation of the further state of the environment that the environment transitioned into following the performance of the generated action by the agent under a second, different statistical model of the environment that generates a second probability distribution over the set of possible observations of states of the environment given a second history of past observations characterizing past states of the environment and past actions performed by the agent, 
wherein: a most recent past observation in the first history is more recent than a most recent past observation in the second history;
However, Liljeholm discloses:
(d) generating a reward value for the generated action for the generated action as a measure of the divergence between (i) the likelihood of the further observation of the further state of the environment that the environment transitioned into following the performance of the generated action by the agent under a first statistical model of the environment that generates a first probability distribution over a set of possible observations of states of the environment given a first history of past observations characterizing past states of the environment and past actions performed by the agent 
in [ § Results , Page 12523]:
A possible alternative explanation for our effects of JS divergence is that the IPL and SMA are encoding simpler representations of outcome probabilities,
(BRI: IPL and SMA are two environments that relate two probability distributions in which IPL and SMA are related to a medical area)
In [ § Results, Page 12523]:
To formally determine which of the three variables provided the best account of neural activity in the SMA and IPL, we performed a Bayesian model selection analysis, 
In [§ Introduction, Page 12519]:
values of actions are acquired by means of a reward prediction error (RPE), and a “model-based” class that constructs a mental map of the environment and generates decisions by flexibly combining estimates of state-transition probabilities
In [Abstract, Page 12519]:
Reinforcement learning theories formalize this map as a set of stochastic relationships between actions and states, such that for any given action considered in a current state, a probability distribution is specified over possible outcome states,
(BRI: The probability distribution specified over possible outcome states may incorporate past history of observations)
In [ § Materials and Methods , Page 12522]:
we entered as modulators the entropy conditional on the chosen action, the entropy conditional on both available actions, and the JS divergence of outcome probability distributions of available actions. For the outcome regressor, we entered as modulators, in order, the RPE, the SPE, and the utility of the received outcome,
In [ § Results , Page 12523]:
To empirically assess the neural effects of simpler representations of outcome probabilities relative to those of JS divergence we specified two additional GLMs that were identical to our original model except for the replacement of the JS divergence modulator with a regressor modeling the difference between or the sum of reward probabilities,
(BRI: In neural networks, "divergence modulator" refers to a mechanism where a single neuron's output signal spreads out to influence multiple other neurons or even the entire network. The first GLM (General linear model) among two additional GLM is the first statistical model)
In [ § Introduction , Page 12520]:
Our primary objective was to assess neural correlates of the difference between outcome distributions associated with alternative actions, formalized as Jensen–Shannon (JS) divergence–a measure that quantifies the distance between probability distributions. The relationship between JS divergence and other decision variables is illustrated in Figure 1B
(BRI: The Fig 1B shows the JS divergence between the two distributions. Within the context of two statistical model ,the first statistical model relates to first probability distribution)
In [ § Introduction , Page 12520]:
The probabilities with which actions produced their outcomes were generated so as to minimize correlations between our three decision variables (i.e., between outcome probabilities, outcome values, and action values 
(BRI: outcome value or value associated with potential outcome is an “observation”)
In [ § Results , Page 12523]: 
We implemented a model-based RL learner, which uses experience with state transitions to update a matrix, T(s,a,s’), of state transition probabilities, where each element of T(s,a,s’) holds the current estimate of the probability of transitioning from state s to s’ given action a.
(ii) the likelihood of the further observation of the further state of the environment that the environment transitioned into following the performance of the generated action by the agent under a second, different statistical model of the environment that generates a second probability distribution over the set of possible observations of states of the environment given a second history of past observations characterizing past states of the environment and past actions performed by the agent, 
In [ § Results , Page 12523]:
A possible alternative explanation for our effects of JS divergence is that the IPL and SMA are encoding simpler representations of outcome probabilities,
(BRI: IPL and SMA are two environments that relate two probability distributions in which IPL and SMA are related to a medical area)
In [ § Results, Page 12523]:
To formally determine which of the three variables provided the best account of neural activity in the SMA and IPL, we performed a Bayesian model selection analysis, 
In [§ Introduction, Page 12519]:
values of actions are acquired by means of a reward prediction error (RPE), and a “model-based” class that constructs a mental map of the environment and generates decisions by flexibly combining estimates of state-transition probabilities
in [Abstract, Page 12519]:
Reinforcement learning theories formalize this map as a set of stochastic relationships between actions and states, such that for any given action considered in a current state, a probability distribution is specified over possible outcome states,
(BRI: The probability distribution specified over possible outcome states may incorporate past history of observations)
In [ § Materials and Methods , Page 12522]:
we entered as modulators the entropy conditional on the chosen action, the entropy conditional on both available actions, and the JS divergence of outcome probability distributions of available actions. For the outcome regressor, we entered as modulators, in order, the RPE, the SPE, and the utility of the received outcome,
In [ § Results , Page 12523]:
To empirically assess the neural effects of simpler representations of outcome probabilities relative to those of JS divergence we specified two additional GLMs that were identical to our original model except for the replacement of the JS divergence modulator with a regressor modeling the difference between or the sum of reward probabilities,
(BRI: In neural networks, "divergence modulator" refers to a mechanism where a single neuron's output signal spreads out to influence multiple other neurons or even the entire network. The second GLM (General linear model) among two additional GLM is the second statistical model)
In [ § Introduction , Page 12520]:
Our primary objective was to assess neural correlates of the difference between outcome distributions associated with alternative actions, formalized as Jensen–Shannon (JS) divergence–a measure that quantifies the distance between probability distributions. The relationship between JS divergence and other decision variables is illustrated in Figure 1B
(BRI: The Fig 1B shows the JS divergence between the two distributions. Within the context of two statistical model , the second statistical model relates to second probability distribution)
In [ § Introduction , Page 12520]:
The probabilities with which actions produced their outcomes were generated so as to minimize correlations between our three decision variables (i.e., between outcome probabilities, outcome values, and action values 
(BRI: outcome value or value associated with potential outcome is an “observation”)
In [ § Results , Page 12523]: 
We implemented a model-based RL learner, which uses experience with state transitions to update a matrix, T(s,a,s’), of state transition probabilities, where each element of T(s,a,s’) holds the current estimate of the probability of transitioning from state s to s’ given action a.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, and Liljeholm. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions for a set of observation states.
One of ordinary skill would have motivation to combine Graepel, and Liljeholm to reduce the demand for a computationally costly binding of outcome probabilities with utilities (Liljeholm [ § Introduction , Page 12519]).
	Graepel and Liljeholm do not explicitly disclose:
wherein: a most recent past observation in the first history is more recent than a most recent past observation in the second history;
However, Horvitz discloses:
wherein: a most recent past observation in the first history is more recent than a most recent past observation in the second history;
(BRI: the statement implies that one historical record is newer than the most recent past event in another historical record.
In [0137]:
 The contactee data 520 may also include a context data 522. The context data 522 is generally related to observations about the contactee. For example, observations concerning the type of activity in which the contactee is involved (e.g., on task, not on task), location of the contactee (e.g., office, home, car, shower), calendar (e.g., appointment status, appointment availability), history of communications with other party (e.g., have replied to email in the past,) 
In [0137]:
While seven observations are listed in the preceding sentence it is to be appreciated that a greater or lesser number of observations may be stored in the context data 522,
In [0053]:
 Concerning utility of a communication, the value of a current potential communication can be evaluated by considering a measure of the history of utility of communication,
In [0071] :
The reliability predictor 222 can consider various factors associated with the communication 210 in predicting the reliability of the communication. One factor that is considered in predicting the reliability of a communication is the length of the communication. For example, a longer communication may be more subject to degradation due to more opportunities for failure. For some communications, the length may be known. By way of illustration, the length of an email message, or a file transfer may be determined. However, for other communications, the length may not be known (e.g., the length of a phone call may not be pre-determined, although it may be inferred). Thus, the length of the communication may be predicted, for example, by analyzing historical data associated with previous communications between the communicating parties that occurred via the channels being considered.
In [0182]:
The reliability history of the communication channel may include, for example, data concerning dates, times and durations of recent degradations. The reliability history of the current communication may include, for example, data concerning degradation of the current communication. For example, if the current communication is a satellite telephone call, then the reliability history of the current communication may include an average signal to noise ratio, a maximum signal to noise ratio, and the number of times the call has been dropped. 
	(BRI: the past is sending the email message(recent in the past) and the current is the telephone conversation (more recent in the current) within the context of history of communications)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm and Horvitz. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
One of ordinary skill would have motivation to combine Graepel, Liljeholm and Horvitz to optimize the utility of the communication  (Horvitz [0180]).
In regard to claim 20  (Currently Amended) 
Graepel and Liljeholm do  not explicitly disclose:
-	each statistical model [[, ]] comprises a respective adaptive system defined by a plurality of parameters.  
However, Horvitz discloses:
-	each statistical model [[, ]] comprises a respective adaptive system defined by a plurality of parameters.  
in [0016]:
 a reliability predictor and a reliability prediction integrator. The reliability predictor generates a probability that a communication will be completed with desired transmission qualities. The reliability prediction integrator updates one or more pieces of information that are employed in selecting a channel for a communication,
in [0185]:
 responses to options can be employed in ongoing machine learning to facilitate improving the performance of the method 900 by, for example, adapting the rules of 930 and 940,
in [0179]:
 one decision may be made using simple priority rules
in [0179]:
 a decision may be made employing decision-theoretic reasoning concerning the value of the communication given a consideration of the uncertainties about the context. In addition, the decisions can be made sensitive to dates and times, considering specific assertions about particular time horizons to guide communications,
(BRI: date and time is a parameter)
in [0179]:
 The rules may be selected on other parameters including, but not limited to, the number of matching preferences, the number of matching capabilities, the nature and quality of the contexts, the type and number of communications requested and the time critical nature of the desired communication.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm and Horvitz. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
One of ordinary skill would have motivation to combine Graepel, Liljeholm and Horvitz to optimize the utility of the communication  (Horvitz [0180]).
In regard to claim 21 (Original) 
Graepel and Liljeholm do not explicitly disclose:
-	each adaptive system comprises a respective probability distribution generation unit 
-	receives an encoding of the respective history and data encoding actions performed by the agent after the most recent action recorded in the respective history
-	the probability generation unit being arranged to generate a probability distribution for the further observation over the plurality of states of the system.  
However, Horvitz discloses:
-	each adaptive system comprises a respective probability distribution generation unit 
in [0228]:
 FIG. 23 is extended from the model in FIG. 22 to include considerations of the time expected to be required for a communication at 2310, as a function of the identity and goals of the communication, and the influence of a probability distribution over the time of the communication, and the channel reliability, on the likelihood of having different losses of fidelity and the dropping of the communication channel. These include the cost of loss of fidelity and cost of dropped connection, which influence the expected utility of the communication channel decision.
-	receives an encoding of the respective history and data encoding actions performed by the agent after the most recent action recorded in the respective history
in [0053]:
 an expected utility can be computed through the use of decision models such as influence diagrams, which encode a set of random variables, representing key aspects of context, cost, and value, as well as potential communication actions, and overall preferences about outcomes, 
in [0073]:
 Data stored in the channel data store 250 can include, but is not limited to, reliability history of the communication channel,
in [0053]:
 Concerning utility of a communication, the value of a current potential communication can be evaluated by considering a measure of the history of utility of communication, 
in [0092]:
 The reliability predictor 222 may alter the predicted reliability result based on the history of the current communication.
-	the probability generation unit being arranged to generate a probability distribution for the further observation over the plurality of states of the system.  
in [0111]:
 the channel manager 402 may rely on one or more conditional probabilities associated with the contactee 430 location based on information like time of day, day of the week and current task.
in [0112]:
 there may be uncertainty concerning communication channels, preferences and one or more parameters employed to model a context. In this situation, a probability distribution over the different states of each variable can be inferred and expected values for each channel can be computed. 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, and Liljeholm. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
One of ordinary skill would have motivation to combine Graepel, and Liljeholm to reduce the demand for a computationally costly binding of outcome probabilities with utilities (Liljeholm [ § Introduction , Page 12519]).

Claims 2-4 are rejected under 35 U.S.C. 103 as being unpatentable over 
Graepel et.al. (hereinafter Graepel) US 2018/0032863 A1, in view of  Mimi Liljeholm (hereinafter Liljeholm) Neural Correlates of the Divergence of Instrumental Probability Distributions, The Journal of Neuroscience, Jul 24, 2014, Page 12519-12527, in view of  Horvitz (hereinafter Horvitz ) US 2006/0291580 A1, 
further in view of  Hoffberg et.al(hereinafter Hoffberg) US 2006/0167784 A1.
In regard to claim 2 (Currently Amended)
	Graepel,  Liljeholm and Horvitz do not explicitly disclose:
-	the most recent observation in the first history being H time steps into the past compared to the current time, where H is an integer greater than zero.  
However Hoffberg discloses:
-	the most recent observation in the first history being H time steps into the past compared to the current time, where H is an integer greater than zero.  
in [1546]:
 A time domain process demonstrates a Markov property if the conditional probability density of the current event, given all present and past events, depends only on the jth most recent events
(BRI: each event can relate to the history (current event) based on the sequential order. The first history dependents on the most recent events (current and past)) 
In [1546]:
 The evaluation problem is that given an observation sequence and a model, what is the probability that the observed sequence was generated by the model (Pr(O|.lamda.)) If this can be evaluated for all competing models for an observation sequence, then the model with the highest probability can be chosen for recognition,
(BRI: Pr(O|.lamda.)) is the probability distribution)
In [1587]:
 For a sequence of length T, we simply "unroll" the model for T time steps,
In at least [1595] We have a model .lamda.=(.DELTA., B, .pi.) and a sequence of observations O=o.sub.1, o.sub.2, . . . , o.sub.T, and p{O|.lamda.}must be found,  
In at least [1595] this calculation involves number of operations in the order of                          
                            
                                
                                    N
                                
                                
                                    T
                                
                            
                        
                    . This is very large even if the length of the sequence, T is moderate,  
In at least [1602] the probability of the partial observation sequence                         
                            
                                
                                    O
                                
                                
                                    t
                                    +
                                    1
                                
                            
                        
                    ,                         
                            
                                
                                    O
                                
                                
                                    t
                                    +
                                    2
                                
                            
                            …
                            .
                            
                                
                                    O
                                
                                
                                    T
                                
                            
                        
                     given that the current state is i, 
(BRI: T is H)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm , Horvitz and Hoffberg. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
Hoffberg teaches time steps.
One of ordinary skill would have motivation to combine Graepel, Liljeholm, Horvitz, and Hoffberg to provide a better filter performance by tunning the parameters (Hoffberg [1889]. 
In regard to claim 3 (Original)
Graepel, Liljeholm and Horvitz do not explicitly disclose:
-	the most recent observation in the second history is H+1 steps into the past compared to the current time.  
However, Hoffberg discloses:
-	the most recent observation in the second history is H+1 steps into the past compared to the current time.  
In at least [1602] the probability of the partial observation sequence                         
                            
                                
                                    O
                                
                                
                                    t
                                    +
                                    1
                                
                            
                        
                    ,                         
                            
                                
                                    O
                                
                                
                                    t
                                    +
                                    2
                                
                            
                            …
                            .
                            
                                
                                    O
                                
                                
                                    T
                                
                            
                        
                     given that the current state is i, 
(BRI: the observations at time steps from 1.4                          
                            
                                
                                    O
                                
                                
                                    t
                                    +
                                    1
                                
                            
                        
                    ,                         
                            
                                
                                    O
                                
                                
                                    t
                                    +
                                    2
                                
                            
                            …
                            .
                            
                                
                                    O
                                
                                
                                    T
                                
                            
                             
                        
                     implies that the second history is at next iteration i which is at step T+1)
In regard to claim 4 (Previously Presented) 
Graepel, Liljeholm and  Horvitz do not explicitly disclose:
-	H is greater than one.  
However, Hoffberg discloses:
-	H is greater than one.  
In [1602]:
 the probability of the partial observation sequence                         
                            
                                
                                    O
                                
                                
                                    t
                                    +
                                    1
                                
                            
                        
                    ,                         
                            
                                
                                    O
                                
                                
                                    t
                                    +
                                    2
                                
                            
                            …
                            .
                            
                                
                                    O
                                
                                
                                    T
                                
                            
                        
                     given that the current state is i, 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm, Horvitz and Hoffberg. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
Hoffberg teaches time steps.
One of ordinary skill would have motivation to combine Graepel, Liljeholm, Horvitz, and Hoffberg to provide a better filter performance by tunning the parameters (Hoffberg [1889]. 

Claims  5, 8, 13-14, 19 and 22 are rejected under 35 U.S.C. 103 as being unpatentable 
over  Graepel et.al. (hereinafter Graepel) US 2018/0032863 A1, in view of  Mimi Liljeholm (hereinafter Liljeholm) Neural Correlates of the Divergence of Instrumental Probability Distributions, The Journal of Neuroscience, Jul 24, 2014, Page 12519-12527,  in view of  Horvitz (hereinafter Horvitz ) US 2006/0291580 A1,
further in view of Divakaran et.al(hereinafter Diva) US 2017/0160813 A1. 
In regard to claim 5 (Currently Amended) 
Graepel, Liljeholm, and Horvitz do not explicitly disclose:
-	in which the measure is the difference between a logarithmic function of the probability of the further observation under the first probability distribution, and the logarithmic function of the probability of the further observation under the second probability distribution.
However, Diva discloses:
-	the measure is the difference between a logarithmic function of the probability of the further observation under the first probability distribution, and the logarithmic function of the probability of the further observation under the second probability distribution.
in [0194]:
Features extracted by the duration feature extractor 1706 relate to the duration of events, and may be extracted from the time alignments of words and phones.
in [0193]:
 these features include, for example, lexical content and linguistic content. N-gram classifiers can be applied to lexical content to produce a distribution of probabilities over a number of characteristics and emotional states, 
in [0198]:
 The feature combination and conditioning module 1710 may be implemented as a processing device, configured to combine and condition the features that are extracted by the feature extractors 1706. In some implementations, multiple features are combined at different levels and can be modeled as joint features, which allows statistical models to account for dependencies and correlations. For example, a first group of features can be conditioned on a second group of features at specific events,
(BRI: conditioning require measuring the difference that relates to the probabilities as a result of application of the features)
in [0196] :
Features extracted by the energy feature extractor 1706d include the energy-related features of speech waveforms, such as the zeroeth cepstral coefficient, the logarithm of short time energy (hereinafter referred to simply as “energy”), and time alignment information (e.g., from automatic speech recognition results). 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm, Horvitz, Hoffberg and Diva.
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
Diva teaches probability distributions.
One of ordinary skill would have motivation to combine Graepel, Liljeholm, Horvitz, Hoffberg  and Diva to train for higher performance for acoustic environments (Diva [0086])
In regard to claim 8 (Original) 
Graepel, Liljeholm and Horvitz do not explicitly disclose
-	the probability generation unit is a multi-layer perceptron.  
However, Diva discloses:
-	the probability generation unit is a multi-layer perceptron.  
in [0182]:
 The Gaussian mixture densities may be stored as a weighted sum of simple Gaussian curves. The set of simple Gaussian curves used to model a particular state is often referred to as a “codebook.” In a fully-tied speech recognition system, one codebook of simple Gaussian curves is used to model the probability density functions of all the speech states in the speech recognition system, 
in [0175]:
 An adaptive speech recognition system 1600 can be used to provide a semantic interpretation of audio input received by a virtual personal assistant-enabled device, 
in [0073]:
 a virtual personal assistant system 400 can be implemented using a layered approach, where lower layers provide basic or universal functionality, and upper layers provide domain and application-specific functionality, 
in [0214]:
 Once trained, the deep neural network can be used to associate a input sample 1830 with phonetic content. The deep neural network can produce bottleneck features 1817. Bottleneck features are generally generated by a multi-layer perceptron that has been trained to predict context-independent monophone states. Bottleneck features can improve the accuracy of automatic speech recognition systems.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm and Horvitz. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
Diva teaches multi-layer perception.
One of ordinary skill would have motivation to combine Graepel, Liljeholm, Horvitz, and Diva to train for higher performance for acoustic environments (Diva [0086])
In regard to claim 13 (Currently Amended) 
Graepel discloses:
-	neural network is a policy network 
in [0029]:
 Generally, a policy neural network is a neural network that is configured to receive an observation and to process the observation in accordance with parameters of the policy neural network to generate a respective action probability for each action in the set of possible actions that can be performed by the agent to interact with the environment.
	Graepel, Liljeholm and Horvitz do not explicitly disclose:
-	which outputs a probability distribution over possible actions, the action being generated as a sample from the probability distribution.  
However, Diva discloses:
-	which outputs a probability distribution over possible actions, the action being generated as a sample from the probability distribution.  
in [0241]:
 the image capture device 2002 can be configured to capture certain aspects of a person and/or the person's environment. For example, the image capture device 2002 can be configured to capture images of the person's face, body, and/or feet, 
in [0163]:
 an Intelligent Interactive System, such as a virtual personal assistant, can have components that understand a person's multi-modal input, can interpret the multi-modal input as an intent and/or an input state, and can reason as to the best response or course of action that addresses the intent and/or input state, 
in [0194]:
 Features extracted by the duration feature extractor 1706 relate to the duration of events, and may be extracted from the time alignments of words and phones.
in [0193]:
these features include, for example, lexical content and linguistic content. N-gram classifiers can be applied to lexical content to produce a distribution of probabilities over a number of characteristics and emotional states, 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm and Horvitz. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
Diva teaches multi-layer perception.
One of ordinary skill would have motivation to combine Graepel, Liljeholm, Horvitz, and Diva to train for higher performance for acoustic environments (Diva [0086])
In regard to claim 14 (Currently Amended) 
Graepel, Liljeholm and Horvitz do not explicitly disclose:
-	each observation is a sample from a probability distribution based on the state of the environment.  
However, Diva discloses:
-	each observation is a sample from a probability distribution based on the state of the environment.  
in [0181]:
 a speech recognition system, such as the adaptive speech recognition system 1600, may use multi-dimensional Gaussian mixture densities to model the probability functions of various speech states in stored recognition models,
in [0162]:
explicit input system 1156 and the action or adaptation system 1158 may have access to data models 1160 that may assist these systems in making determinations. The data models 1160 may include temporal data models and/or domain models. The temporal models attempt to store the temporal nature of the person's interaction with the Intelligent Interactive System 1100. For example, the temporal models may store observations and features that lead to those observations. 
in [0162]:
 The domain models may store observation and feature associations as well as intent and response associations for a specific domain. 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm and Horvitz. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
Diva teaches multi-layer perception.
One of ordinary skill would have motivation to combine Graepel, Liljeholm, Horvitz, and Diva to train for higher performance for acoustic environments (Diva [0086])
In regard to claim 19 (Currently Amended) 
Graepel, Liljeholm and Horvitz do not explicitly disclose:
-	the measure is the difference between a logarithmic function of the probability of the further observation under the first probability distribution, and the logarithmic function of the probability of the further observation under the second probability distribution.  
However, Diva discloses:
-	 the measure is the difference between a logarithmic function of the probability of the further observation under the first probability distribution, and the logarithmic function of the probability of the further observation under the second probability distribution.  
In [0194]:
 Features extracted by the duration feature extractor 1706 relate to the duration of events, and may be extracted from the time alignments of words and phones.
In [0193]:
 these features include, for example, lexical content and linguistic content. N-gram classifiers can be applied to lexical content to produce a distribution of probabilities over a number of characteristics and emotional states, 
in [0198]:
 The feature combination and conditioning module 1710 may be implemented as a processing device, configured to combine and condition the features that are extracted by the feature extractors 1706. In some implementations, multiple features are combined at different levels and can be modeled as joint features, which allows statistical models to account for dependencies and correlations. For example, a first group of features can be conditioned on a second group of features at specific events,
(BRI: conditioning require measuring the difference that relates to the probabilities as a result of application of the features)
In [0196]:
Features extracted by the energy feature extractor 1706d include the energy-related features of speech waveforms, such as the zeroeth cepstral coefficient, the logarithm of short time energy (hereinafter referred to simply as “energy”), and time alignment information (e.g., from automatic speech recognition results). 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm and Horvitz. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
Diva teaches multi-layer perception.
One of ordinary skill would have motivation to combine Graepel, Liljeholm, Horvitz, and Diva to train for higher performance for acoustic environments (Diva [0086])
In regard to claim 22 (Original) 
Graepel, Liljeholm and Horovitz  do not explicitly disclose
-	the probability generation unit is a multi-layer perceptron.  
However, Diva discloses
-	the probability generation unit is a multi-layer perceptron.  
in [0182]:
 The Gaussian mixture densities may be stored as a weighted sum of simple Gaussian curves. The set of simple Gaussian curves used to model a particular state is often referred to as a “codebook.” In a fully-tied speech recognition system, one codebook of simple Gaussian curves is used to model the probability density functions of all the speech states in the speech recognition system, 
in [0175]:
 An adaptive speech recognition system 1600 can be used to provide a semantic interpretation of audio input received by a virtual personal assistant-enabled device, 
in [0073]:
 a virtual personal assistant system 400 can be implemented using a layered approach, where lower layers provide basic or universal functionality, and upper layers provide domain and application-specific functionality, 
in [0214]:
 Once trained, the deep neural network can be used to associate a input sample 1830 with phonetic content. The deep neural network can produce bottleneck features 1817. Bottleneck features are generally generated by a multi-layer perceptron that has been trained to predict context-independent monophone states. Bottleneck features can improve the accuracy of automatic speech recognition systems.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm, Horvitz and Diva. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
Diva teaches multi-layer perception.
One of ordinary skill would have motivation to combine Graepel, Liljeholm, Horvitz, and Diva to train for higher performance for acoustic environments (Diva [0086])

Claims  9-10 and 24 are rejected under 35 U.S.C. 103 as being unpatentable over 
Graepel et.al. (hereinafter Graepel) US 2018/0032863 A1, in view of  Mimi Liljeholm (hereinafter Liljeholm) Neural Correlates of the Divergence of Instrumental Probability Distributions, The Journal of Neuroscience, Jul 24, 2014, Page 12519-12527,  in view of  Horvitz (hereinafter Horvitz ) US 2006/0291580 A1,
further in view of  Donner et.al(hereinafter Donner) US 2020/0411164 A1.
In regard to claim 9 (Currently Amended) 
Graepel and Liljeholm do not explicitly disclose:
-	the encoding of the history of previous observations and actions is generated by a recurrent unit which successively receives the actions and corresponding representations of the resulting observations.  
However, Horvitz discloses: 
-	the encoding of the history of previous observations and actions is generated 
in [0053]:
 an expected utility can be computed through the use of decision models such as influence diagrams, which encode a set of random variables, representing key aspects of context, cost, and value, as well as potential communication actions, and overall preferences about outcomes, 
in [0073] :
Data stored in the channel data store 250 can include, but is not limited to, reliability history of the communication channel,
in [0092]:
 The reliability predictor 222 may alter the predicted reliability result based on the history of the current communication,
in [0137]:
 The context data 522 is generally related to observations about the contactee. For example, observations concerning the type of activity in which the contactee is involved (e.g., on task, not on task), location of the contactee (e.g., office, home, car, shower), calendar (e.g., appointment status, appointment availability), history of communications with other party (e.g., have replied to email in the past, have spoken to on the telephone recently, the utility of the interaction, 
in [0137] :
Such context data may affect the reliability of a communication
	Graepel, Liljeholm and Horvitz do not explicitly disclose:
-	by a recurrent unit which successively receives the actions and corresponding representations of the resulting observations.  
However, Donner discloses:
-	by a recurrent unit which successively receives the actions and corresponding representations of the resulting observation,
in [0090]:
 In the example shown in FIG. 1 the entries in the medical records of the patient are available as additional information 5 for the data record 1a in the database 2. 
in [0090]:
 More comprehensive descriptions in clinical observations, can be converted by self-trained neural networks, for example recurrent neural networks (RNNs),
in [0019]:
 in order to create a search request, at least one query image, in particular of a region of interest, is selected from at least one two-dimensional or higher-dimensional examination image or in an examination image sequence.
(BRI: An examination image sequence is an “observed” image sequence)
in [0021]:
 the database is searched for data records with feature vectors that lie in the vicinity of the feature vector of the query image on the basis of a specified metric, and data records of which the partial images look similar to the selection region or are semantically relevant are output as the result of the search request.
(BRI: a result is an “action”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm and Horvitz. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
Donner teaches convolution and recurrent unit.
One of ordinary skill would have motivation to combine Graepel, Liljeholm, Horvitz, and Donner to provide an improvement in the distance function between two images on the basis of improved data foundation (Donner [0031])
In regard to claim 10 (Original) 
Graepel, Liljeholm and Horvitz do not explicitly disclose
-	the representation of each observation is obtained as the output of a convolutional model which receives the observation.  
However, Donner discloses
-	the representation of each observation is obtained as the output of a convolutional model which receives the observation.  
in [0008]:
 a projection for obtaining feature vectors is created from the partial images, which projection, in particular visually or semantically, maps similar partial images to feature vectors with a short distance, and wherein, in order to prepare the execution of the projection, a neural network, in particular a convolutional neural network, based on specified learning partial images, is created, 
in [0098]:
  search request on the basis of image information to the database 2, one or more feature vectors 6′ of the query image 3′ is/are determined initially for the query 
0098 Partial images 3a, 3b, 3c, which are optionally sorted and which look similar to the selection region, are output as the result of the search request, optionally together with or replaced by the data records 1a, 1b, 1c.


In regard to claim 24 (Currently Amended) 
Graepel, Liljeholm and Horvitz do not explicitly disclose
-	the representation of each observation is obtained as the output of a convolutional model which receives the observation.
However, Donner discloses:
-	the representation of each observation is obtained as the output of a convolutional model which receives the observation.
in [0008]:
 a projection for obtaining feature vectors is created from the partial images, which projection, in particular visually or semantically, maps similar partial images to feature vectors with a short distance, and wherein, in order to prepare the execution of the projection, a neural network, in particular a convolutional neural network, based on specified learning partial images, is created, wherein the data records or part of the data records are/is used by the neural network within the scope of a metric learning method to learn the projection and creation of the feature vectors from learning partial images or groups of learning partial images and a specified similarity, that is to be achieved, between the learning partial images, 
(BRI: projection is the observation and the similarity is the result of prediction (output))
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm and Horvitz. 
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
Donner teaches convolution and recurrent unit.
One of ordinary skill would have motivation to combine Graepel, Horvitz, Mano and Donner to provide an improvement in the distance function between two images on the basis of improved data foundation (Donner [0031])



Claim 23 is rejected under 35 U.S.C. 103 unpatentable over Graepel 
(hereinafter Graepel) US 2018/0032863 A1, in view of  Mimi Liljeholm (hereinafter Liljeholm) Neural Correlates of the Divergence of Instrumental Probability Distributions, The Journal of Neuroscience, Jul 24, 2014, Page 12519-12527, in view of  Horvitz (hereinafter Horvitz ) US 2006/0291580 A1,
further in view of Hans Jürgen Schmidhuber (hereinafter Schmidhu ) US 2019/0197403 A1.		
In regard to claim 23 (Currently Amended) 
Graepel, Liljeholm and Horvitz do not explicitly disclose
-	the encoding of the history of previous observations and actions is generated by a recurrent unit which successively receives the actions and corresponding representations of the resulting observations 
However, Schmidhu discloses:
-	the encoding of the history of previous observations and actions is generated by a recurrent unit which successively receives the actions and corresponding representations of the resulting observations 
In [0020] :
The input units receive input data from multiple time-varying data sources that are located outside of ONE. 
In [0020] :
The data sources are considered time-varying because, over time, the data being provided by the sources may change (e.g., as time progresses or as conditions outside of ONE change). In the illustrated implementation, the input units are configured to receive at discrete time step t (t=1, 2, 3 . . . ) of a given trial several real-valued, vector-valued inputs: a goal input, goal(t), a reward input, r(t), and a normal sensory input, in(t) from time-varying data sources outside of ONE. 
In [0012]:
 FIG. 1 is a schematic representation of an exemplary  recurrent neural network (referred to herein as ONE) coupled to a humanoid agent or other type of process to be controlled.
In [0060] :
Some of the model units generate an output pattern pred(t) at time t which predicts sense(t+1)=(r(t+1), in(t+1), goal(t+1)), others generate an output pattern code(t) that may represent a compact encoding of the history of actions and observations and ONE's computations so far, one of them generates a real value PR(t) to predict the cumulative reward until the end of the current trial, 
In [0022]:
 The output signal controls or influences the environment outside of ONE (e.g., by controlling the agent's actions). In this regard, the output signal can be sent to any components outside of ONE that are meant to be controlled or influenced by ONE (e.g., the agent). The history encoding signal may be sent to an external computer database to store an indication of ONE's historical performance,
In [0098]:
replays of unsuccessful trials can still be used to retrain ONE to become a better predictor or world model, given past observations and actions.
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Graepel, Liljeholm, Horvitz and Schmidhu.  
Graepel teaches agent interacting with an environment in a time-series and generate actions as defined by the neural network parameters by observing the states of the environment. 
Liljeholm teaches two probability distributions each for a set of observation states.
Horvitz teaches more recent values in current state than the past observation.
Schmidhu teaches encoding the history of observations. 
One of ordinary skill would have motivation to combine Graepel, Liljeholm, Horvitz, and Schmidhu that can improve encoding code for better compression of the history [Schmidhu [0064])

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the
examiner should be directed to TIRUMALE KRISHNASWAMY RAMESH whose telephone number is (571)272-4605. The examiner can normally be reached by phone.
Examiner interviews are available via telephone, in-person, and video conferencing
using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at
http://www.uspto.gov/interviewpractice. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on phone (571-272-3768). The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be
obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit:
https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for
information about filing in DOCX format. For additional questions, contact the Electronic
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO
Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/TIRUMALE K RAMESH/Examiner, Art Unit 2121           


/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121