EFFICIENT TRANSFORMER TRAINING BASED ON SMALLER PRETRAINED MODELS

Organization Name

Massachusetts Institute of Technology

Inventor(s)

Rameswar Panda of Medford MA US

Peihao Wang of Austin TX US

LEONID Karlinsky of Acton MA US

Rogerio Schmidt Feris of West Hartford CT US

David Cox of Somerville MA US

Yoon Hyung Kim of Cambridge MA US

EFFICIENT TRANSFORMER TRAINING BASED ON SMALLER PRETRAINED MODELS

This abstract first appeared for US patent application 18372661 titled 'EFFICIENT TRANSFORMER TRAINING BASED ON SMALLER PRETRAINED MODELS

Original Abstract Submitted

Parameters of a first transformer are accessed, and size dimensions of a second transformer that is to be trained and is larger than the first transformer are received. The parameters of the first transformer are linearly transformed using a combination of a width-growth operator and a depth-growth operator, wherein the linear transformation produces a set of new parameters, the set corresponding to the size dimensions of the second transformer. The second transformer is initialized with the set of new parameters.