The pursuit of larger, more capable Large Language Models (LLMs) is fundamentally constrained by the immense computational cost of their training and inference. While the Mixture-of-Experts (MoE) paradigm successfully decouples model parameter count from computational cost by dynamically scaling network width, it neglects the critical dimension of depth, enforcing a uniform and often wasteful computational graph for all tokens. This paper introduces Mixture-of-Experts-and-Depths (MoED), a novel architectural framework that synergistically unifies dynamic width and depth scaling. MoED employs a hierarchical routing mechanism, where a meta-controller at each layer makes a joint decision on both expert selection and a token's subsequent computational path—whether to exit, proceed, or skip ahead. This approach creates a unique, input-adaptive sub-network for every token, optimizing the allocation of compute. The proposed architecture presents a fundamental shift towards more efficient and scalable LLMs, theoretically enabling superior performance and reduced latency while managing the activation memory bottlenecks that plague traditional trillion-parameter models.
Large Language Models, Mixture of Experts, Adaptive Computation, Dynamic Networks, Hierarchical Routing, Efficient Inference
IRE Journals:
Kalyan Chakravarthy Kodela
"Mixture-of-Experts-and-Depths: A Hierarchical Dynamic Compute Architecture for Extreme-Scale Efficiency" Iconic Research And Engineering Journals Volume 9 Issue 2 2025 Page 987-999
IEEE:
Kalyan Chakravarthy Kodela
"Mixture-of-Experts-and-Depths: A Hierarchical Dynamic Compute Architecture for Extreme-Scale Efficiency" Iconic Research And Engineering Journals, 9(2)