Amazon technologies, inc. (20240428082). EFFICIENT RECOVERY FROM FAILURES DURING DISTRIBUTED TRAINING OF MACHINE LEARNING MODELS

From WikiPatents
Revision as of 14:35, 29 December 2024 by Wikipatents (talk | contribs) (Creating a new page)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

EFFICIENT RECOVERY FROM FAILURES DURING DISTRIBUTED TRAINING OF MACHINE LEARNING MODELS

Organization Name

amazon technologies, inc.

Inventor(s)

Zhuang Wang of Kirkland WA (US)

Zhen Jia of San Jose CA (US)

Shuai Zheng of Santa Clara CA (US)

Zhen Zhang of Santa Clara CA (US)

Xinwei Fu of Fremont CA (US)

Yida Wang of Palo Alto CA (US)

EFFICIENT RECOVERY FROM FAILURES DURING DISTRIBUTED TRAINING OF MACHINE LEARNING MODELS

This abstract first appeared for US patent application 20240428082 titled 'EFFICIENT RECOVERY FROM FAILURES DURING DISTRIBUTED TRAINING OF MACHINE LEARNING MODELS



Original Abstract Submitted

a placement plan for training state checkpoints of a machine learning model is generated based at least in part on a number of training servers of a distributed training environment. the plan indicates, with respect to an individual server, one or more other servers at which replicas of training state checkpoints of the individual server are to be stored. during selected periods of one or more training iterations of the model, respective portions of a replica of a training state checkpoint of a first server are transmitted to a second server selected based on the placement plan. after an event causes disruption of the training iterations, one of the checkpoints generated at the first server is retrieved from the second server and used to resume the training iterations.