COLLECTIVE COMMUNICATION PHASES AT MIXTURE-OF-EXPERTS LAYER

The computing system described in the abstract includes processing devices that execute a Mixture-of-Experts (MoE) layer in an MoE model. The MoE layer involves splitting input tensors, processing them at expert sub-models, and concatenating the outputs to obtain final output tensors.

The computing system includes multiple processing devices.
The processing devices execute the MoE layer by splitting input tensors during a collective communication phase.
The split tensors are processed by expert sub-models to obtain new input tensors.
During a second communication phase, the processed tensors are concatenated to obtain final output tensors.

Potential Applications

This technology can be applied in various fields such as natural language processing, image recognition, and recommendation systems.

Problems Solved

This technology helps in improving the performance and efficiency of deep learning models by utilizing a Mixture-of-Experts approach.

Benefits

The benefits of this technology include enhanced model accuracy, faster computation, and better utilization of resources.

Potential Commercial Applications

Potential commercial applications of this technology include improving search engines, personalized recommendations, and speech recognition systems.

Possible Prior Art

One possible prior art for this technology is the use of ensemble models in machine learning to combine multiple models for better performance.

Unanswered Questions

How does the system handle communication between processing devices efficiently?

The abstract mentions collective communication phases, but it does not provide details on the specific mechanisms used for communication optimization.

What is the impact of the MoE layer on model interpretability?

While the abstract focuses on the technical aspects of the MoE layer, it does not address how the use of this layer may affect the interpretability of the overall model.

Original Abstract Submitted

A computing system including a plurality of processing devices configured to execute a Mixture-of-Experts (MoE) layer included in an MoE model. The processing devices are configured to execute the MoE layer at least in part by, during a first collective communication phase between the processing devices, splitting each of a plurality of first input tensors along a first dimension to obtain first output tensors. Executing the MoE layer further includes processing the first output tensors at a respective a plurality of expert sub-models to obtain a plurality of second input tensors. Executing the MoE layer further includes, during a second collective communication phase between the processing devices, receiving the second input tensors from the expert sub-models and concatenating the second input tensors along the first dimension to obtain second output tensors. Executing the MoE layer further includes outputting the second output tensors as output of the MoE layer.

18054452. COLLECTIVE COMMUNICATION PHASES AT MIXTURE-OF-EXPERTS LAYER simplified abstract (Microsoft Technology Licensing, LLC)

Contents

COLLECTIVE COMMUNICATION PHASES AT MIXTURE-OF-EXPERTS LAYER

Organization Name

Inventor(s)