Google llc (20240195679). SMART ONLINE LINK REPAIR AND JOB SCHEDULING IN MACHINE LEARNING SUPERCOMPUTERS simplified abstract
Contents
SMART ONLINE LINK REPAIR AND JOB SCHEDULING IN MACHINE LEARNING SUPERCOMPUTERS
Organization Name
Inventor(s)
Yazhou Zu of Sunnyvale CA (US)
Alireza Ghaffarkhah of San Jose CA (US)
SMART ONLINE LINK REPAIR AND JOB SCHEDULING IN MACHINE LEARNING SUPERCOMPUTERS - A simplified explanation of the abstract
This abstract first appeared for US patent application 20240195679 titled 'SMART ONLINE LINK REPAIR AND JOB SCHEDULING IN MACHINE LEARNING SUPERCOMPUTERS
The patent application discusses a method for smart topology-aware link disabling and user job rescheduling strategies for online network repair in high-performance networks used in supercomputers, particularly in machine learning and high-performance computing applications.
- The approach involves detecting broken links through pre-flight checks or during job runtime using a distributed failure detection and mitigation software stack.
- A centralized network controller and multiple agents on each node ensure user jobs are rerouted to healthy links until broken links are fixed and tested.
- Once repaired, the network controller enables the previously disabled links for future user jobs.
Potential Applications: - Supercomputers - Machine learning applications - High-performance computing environments
Problems Solved: - Efficient online network repair in high-performance networks - Minimizing downtime for user jobs during link repairs
Benefits: - Continuous operation of user jobs during link repairs - Improved network reliability and performance - Enhanced user experience in high-performance computing environments
Commercial Applications: Title: "Smart Topology-Aware Link Disabling Technology for High-Performance Networks" This technology can be utilized in: - Data centers - Cloud computing environments - Research institutions
Prior Art: Further research can be conducted in the field of online network repair strategies in high-performance computing environments to identify related prior art.
Frequently Updated Research: Stay updated on advancements in online network repair strategies and distributed failure detection technologies relevant to high-performance networks.
Questions about Smart Topology-Aware Link Disabling Technology: 1. How does the distributed failure detection and mitigation software stack work in detecting broken links? 2. What are the key differences between traditional network repair methods and the approach described in the patent application?
Original Abstract Submitted
generally disclosed herein is an approach for smart topology-aware link disabling and user job rescheduling strategies for online network repair of broken links in high performance networks used in supercomputers that are common in machine learning (ml) and high-performance computing (hpc) applications. while a disabled link is repaired online, user jobs may continue to run. the broken links may be detected as part of pre-flight checks before the user jobs run and/or during the job run time via a distributed failure detection and mitigation software stack which includes a centralized network controller and multiple agents running on each node. the network controller may ensure that the user jobs are rerouted to healthy links within the same network until the broken links are fixed and tested by the repair workflows, in which case the broken links are enabled again by the network controller for future user jobs.