US Patent Application 18364601. APPARATUS AND METHOD FOR TRAINING PARAMETRIC POLICY simplified abstract

From WikiPatents
Jump to navigation Jump to search

APPARATUS AND METHOD FOR TRAINING PARAMETRIC POLICY

Organization Name

Huawei Technologies Co., Ltd.==Inventor(s)==

[[Category:Vincent Moens of London (GB)]]

[[Category:Hugues Van Assel of Lyon (FR)]]

[[Category:Haitham Bou Ammar of London (GB)]]

APPARATUS AND METHOD FOR TRAINING PARAMETRIC POLICY - A simplified explanation of the abstract

This abstract first appeared for US patent application 18364601 titled 'APPARATUS AND METHOD FOR TRAINING PARAMETRIC POLICY

Simplified Explanation

The abstract describes an apparatus for training a parametric policy using a proposal distribution. The apparatus includes processors that perform the following steps repeatedly:

  • Form a proposal based on the proposal distribution.
  • Input the proposal to the policy to generate an output state.
  • Estimate the loss between the output state and a preferred state.
  • Use an adaptation algorithm to form a policy adaption based on the loss.
  • Apply the policy adaption to the policy to create an adapted policy.
  • Use the adapted policy to estimate the variance in the policy adaptation.
  • Adapt the proposal distribution based on the estimate of variance to reduce the variance of policy adaptations in subsequent iterations.

In summary, the apparatus trains a parametric policy by repeatedly generating proposals, evaluating their performance, adapting the policy based on the evaluation, and adjusting the proposal distribution to improve the training process.


Original Abstract Submitted

An apparatus for training a parametric policy in dependence on a proposal distribution, the apparatus comprising one or more processors configured to repeatedly perform the steps of: forming, in dependence on the proposal distribution, a proposal; inputting the proposal to the policy so as to form an output state from the policy responsive to the proposal; estimating a loss between the output state and a preferred state responsive to the proposal; forming, by means of an adaptation algorithm and in dependence on the loss, a policy adaption; applying the policy adaption to the policy to form an adapted policy; forming, by means of the adapted policy, an estimate of variance in the policy adaptation and adapting the proposal distribution in dependence on the estimate of variance so as to reduce the variance of policy adaptations formed on subsequent iterations of the steps.