Article

Microsoft’s WINA Cuts LLM Inference Costs with Sparse Activation, No Training Required

DATE: 6/1/2025 · STATUS: LIVE

AI systems groan under inference demands until a radical sparse activation hack teases lightning-fast performance but conceals an unexpected twist…

Microsoft’s WINA Cuts LLM Inference Costs with Sparse Activation, No Training Required

Article content

Large language models (LLMs) that incorporate billions of parameters form the backbone of many AI-enabled solutions in sectors such as finance, healthcare, and customer support. Their ability to generate text, summarize documents, and handle question-answering tasks has driven widespread adoption. Yet the computational burden of running inference through such extensive networks remains a major obstacle for organizations that seek to deploy these systems at scale. As each input requires a full forward pass, service providers face high latency and energy consumption costs.

The primary obstacle emerges during inference. Whenever an input, like a prompt or a document snippet, is fed into an LLM, every layer and every neuron activates in sequence, causing billions of parameter interactions. In reality, only a fraction of neurons contributes significant information to the final answer. Activating the entire parameter set leads to wasted computation. Researchers have developed sparse activation algorithms to address this inefficiency, turning off low-impact neurons at inference time. Most of these efforts, though, rely solely on measuring hidden state magnitudes and neglect how weight magnitudes shape error flow through the model.

Some existing sparse schemes leverage Mixture-of-Experts (MoE) frameworks, as seen in GPT-4 and Mistral, where a routing mechanism assigns input tokens to a small collection of expert sub-networks. That approach requires extensive extra training so the gating network can learn expert selection for each input. Alternative strategies like TEAL (Token-Efficient Activation Learning) and CATS (Channel-wise Activation Thresholding) trim computation by dropping neurons with lower activation values. These methods can misidentify critical neurons or retain nodes with little influence. They also demand careful threshold tuning for each model, which limits portability across different architectures.

To overcome these hurdles, teams from Microsoft, Renmin University of China, New York University and the South China University of Technology have unveiled WINA, short for Weight Informed Neuron Activation. WINA represents a training-free sparse activation method that considers both the activation strength of hidden units and the column-wise ℓ2 norms of weight matrices. By merging those two metrics, it consistently selects the most impactful neurons. WINA adapts automatically to each layer’s characteristics, eliminating the need for fine-tuning or manual threshold selection.

At the core of WINA is a simple computation: it multiplies each neuron’s hidden state value with its corresponding weight vector norm and then picks the top-K elements based on that product. This process constructs a smaller sub-network that retains the majority of signal-carrying pathways. Beyond this, WINA applies a lightweight tensor transformation that enforces column-wise orthogonality in the weight matrices—often implemented via a QR decomposition step. This orthogonal constraint guarantees that the theoretical bounds on approximation error remain valid in practice, keeping model predictions stable.

The research team tested WINA on a suite of popular LLMs—Qwen-2.5-7B, LLaMA-2-7B, LLaMA-3-8B and Phi-4-14B—across tasks such as reasoning, summarization and code generation, at sparsity rates from 30% up to 70%. In all experiments, WINA outmatched TEAL and CATS. On Qwen-2.5-7B with 65% inactive neurons, it delivered a 2.94% higher average accuracy than TEAL and a 1.41% gain over a TEAL variant with transformation. When applied to LLaMA-3-8B, WINA yielded a 1.06% improvement at 50% sparsity and 2.41% at 65%. On complex benchmarks like GSM8K arithmetic reasoning and the ARC Challenge, performance remained robust. Floating-point operations dropped by as much as 63.7% on LLaMA-2-7B and 62.7% on Phi-4-14B, translating into real-time speed-ups.

This approach provides a reliable, plug-and-play avenue for reducing inference costs in large language models without modifying original weights or adding training overhead. It can be deployed across varied architectures without extra configuration, making it an appealing option for settings that demand low latency, lower energy consumption and consistent output quality.

BOND’s May 2025 AI Trends report delivers a detailed, data-driven overview of current developments and shifts in the field, highlighting investment patterns, emerging applications and key performance metrics.

Academic investigations in areas such as chemistry, biology and artificial intelligence still rely heavily on domain experts to survey existing literature, formulate research questions, design experiments and validate results.

The customer experience model in B2B technology is shifting, driven by agentic AI advances; Cisco’s Agentic AI Report highlights practical use cases, risk mitigation strategies and deployment guidelines.

Reasoning challenges in AI cover commonsense understanding, mathematical problem-solving and symbolic logic; these tasks often require multi-step processing, world knowledge and specialized evaluation benchmarks.

A recent tutorial shows how to implement the Agent Communication Protocol in Python, building a flexible system that complies with ACP standards using Google’s Gemini API and featuring example message flows, error-handling routines and performance benchmarks.

State-of-the-art systems now match human-level accuracy on benchmarks like AIME, GPQA, MATH-500 and OlympiadBench, tackling competition-grade math problems and formal proofs with minimal fine-tuning.

Yandex released Yambda, the largest publicly available dataset for recommender system research, offering a vast collection of user-item interactions and metadata suited to collaborative filtering and ranking.

Biomedical research continues to explore disease mechanisms, identify novel therapeutic targets and accelerate drug discovery with cutting-edge experimental methods and computational screening techniques, supported by AI-enhanced data analytics and high-throughput sequencing pipelines.

Long-chain-of-thought methods improve reasoning in large models but can introduce latency; the think-then-answer approach may slow down response generation, affect user interaction speed and increase compute overhead.

DeepSeek, a leading Chinese AI startup, rolled out DeepSeek-R1-0528, an upgraded version of its R1 reasoning model, boosting its capacity for complex query handling, multi-hop inference and abstract deduction.

Keep building

Join Skool — Ship Your First Microapp Back to feed