Google llc (20240195679). SMART ONLINE LINK REPAIR AND JOB SCHEDULING IN MACHINE LEARNING SUPERCOMPUTERS simplified abstract

From WikiPatents
Jump to navigation Jump to search

SMART ONLINE LINK REPAIR AND JOB SCHEDULING IN MACHINE LEARNING SUPERCOMPUTERS

Organization Name

google llc

Inventor(s)

Yazhou Zu of Sunnyvale CA (US)

Alireza Ghaffarkhah of San Jose CA (US)

Dayou Du of San Jose CA (US)

SMART ONLINE LINK REPAIR AND JOB SCHEDULING IN MACHINE LEARNING SUPERCOMPUTERS - A simplified explanation of the abstract

This abstract first appeared for US patent application 20240195679 titled 'SMART ONLINE LINK REPAIR AND JOB SCHEDULING IN MACHINE LEARNING SUPERCOMPUTERS

The patent application discusses a method for smart topology-aware link disabling and user job rescheduling strategies for online network repair in high-performance networks used in supercomputers, particularly in machine learning and high-performance computing applications.

  • The approach involves detecting broken links through pre-flight checks or during job runtime using a distributed failure detection and mitigation software stack.
  • A centralized network controller and multiple agents on each node ensure user jobs are rerouted to healthy links until broken links are fixed and tested.
  • Once repaired, the network controller enables the previously disabled links for future user jobs.

Potential Applications: - Supercomputers - Machine learning applications - High-performance computing environments

Problems Solved: - Efficient online network repair in high-performance networks - Minimizing downtime for user jobs during link repairs

Benefits: - Continuous operation of user jobs during link repairs - Improved network reliability and performance - Enhanced user experience in high-performance computing environments

Commercial Applications: Title: "Smart Topology-Aware Link Disabling Technology for High-Performance Networks" This technology can be utilized in: - Data centers - Cloud computing environments - Research institutions

Prior Art: Further research can be conducted in the field of online network repair strategies in high-performance computing environments to identify related prior art.

Frequently Updated Research: Stay updated on advancements in online network repair strategies and distributed failure detection technologies relevant to high-performance networks.

Questions about Smart Topology-Aware Link Disabling Technology: 1. How does the distributed failure detection and mitigation software stack work in detecting broken links? 2. What are the key differences between traditional network repair methods and the approach described in the patent application?


Original Abstract Submitted

generally disclosed herein is an approach for smart topology-aware link disabling and user job rescheduling strategies for online network repair of broken links in high performance networks used in supercomputers that are common in machine learning (ml) and high-performance computing (hpc) applications. while a disabled link is repaired online, user jobs may continue to run. the broken links may be detected as part of pre-flight checks before the user jobs run and/or during the job run time via a distributed failure detection and mitigation software stack which includes a centralized network controller and multiple agents running on each node. the network controller may ensure that the user jobs are rerouted to healthy links within the same network until the broken links are fixed and tested by the repair workflows, in which case the broken links are enabled again by the network controller for future user jobs.