Moonshot AI Debuts Kimi-Researcher, RL Agent That Tackles Complex Reasoning and Web-Scale Search
–
Reinforcement learning stands at the center of efforts to give computational agents the ability to learn from interactions. In this paradigm, an agent observes its environment, selects actions, and receives feedback in the form of rewards or penalties. Over many trials, the system adjusts its policy to maximize cumulative reward. This approach excels when explicit instructions prove impractical—situations such as complex game playing, real-time control, or adaptive resource allocation. Extending these capabilities to long-duration tasks that demand both memory of past observations and dynamic information retrieval remains a major technical challenge.
Designing agents that coordinate reasoning across multiple steps, update strategies as new data arrives, and plan deep searches calls for mechanisms beyond standard supervised training or rigid workflows. A single multi-turn query may require switching between internal search, code execution, and logical inference, all while retaining coherence over dozens of actions. If the environment shifts—new documents appear, external APIs change, or task goals evolve—an agent trained on fixed prompts can falter, sticking to outdated procedures or repeating steps that no longer apply.
Contemporary development methods fall into two camps, each with drawbacks. Multi-agent systems divide work among expert sub-agents assigned to roles like search, summarization, or planning, then coordinate them via fixed protocols. This works under well-structured conditions but demands extensive manual reengineering whenever roles or tasks change. Supervised fine-tuning relies on human-labeled demonstrations, teaching agents to mimic expert behavior. It cuts down on prompt engineering but requires large volumes of annotations and yields brittle performance when situations stretch beyond the training examples.
Researchers at Moonshot AI addressed these limitations by building Kimi-Researcher, an autonomous agent trained end-to-end with reinforcement learning. Based on the internal Kimi k-series model, this system learns reasoning and tool use exclusively from its own trial-and-error experience, without prewired workflows or human demonstration labels. During training, the agent explores a range of strategies on complex tasks, rates each run according to success and efficiency, and updates its policy iteratively. This unified approach treats tool calls, search queries, browsing steps, and code execution as part of one continuous decision process.
To prepare the agent for real-world complexity, the team generated a synthetic training corpus of diverse tasks. Scenarios ranged from detailed mathematical derivations and multi-stage logic puzzles to extended search challenges and algorithmic problem solving. Each example is crafted so that success depends on choosing the right combination of computational tools. An automated validation pipeline checks solution correctness, verifies that code executes as intended, and ensures search queries return relevant results. Invalid trajectories are filtered out, guaranteeing consistent feedback throughout training.
At the core of the learning framework lies the REINFORCE algorithm, chosen for its ability to handle sequential decision-making without requiring differentiable rewards. Training data is generated on-policy so that the agent samples actions according to its current policy, avoiding stale off-policy examples. Negative trajectories—those ending in failure—are managed carefully to prevent destabilizing weight updates. The reward function blends two key terms: final result accuracy and trajectory efficiency. A gamma-decay factor places higher value on shorter successful runs, guiding the agent toward concise, focused reasoning.
Additional refinements help stabilize learning and reduce variance. The system estimates a baseline performance level and subtracts it from raw returns, curbing noise in policy gradient updates. Reward clipping limits extreme values that might otherwise dominate training. Together, these practices enable Kimi-Researcher to acquire complex skills without oscillations or catastrophic forgetting.
Benchmark results attest to the strength of this end-to-end approach. On Humanity’s Last Exam (HLE), a rigorous test of autonomous reasoning and search, pass@1 accuracy jumped from 8.6% in the initial zero-shot setting to 26.9% after reinforcement learning. On xbench-DeepSearch—a suite designed to measure deep exploration and multi-step inference—the agent reached a 69% pass@1 rate, outpacing strong baselines such as the o3 model equipped with search tools. In practice, Kimi-Researcher averages more than 20 reasoning steps per task and visits over 200 distinct URLs to gather and integrate information.
Sustaining coherent performance across many decision cycles required a specialized context-management system. A hierarchical buffer summarizes past observations, discards extraneous text, and highlights key facts. This structure preserves essential context through up to 50 sequential interactions, preventing memory overload and keeping the decision process focused. Attention-based scoring ranks incoming data by relevance, compressing or removing less critical content.
Training throughput improved through an asynchronous rollout infrastructure that eliminates idle compute time. Partial trajectories are preserved across weight updates and resumed with the latest policy parameters. When a batch of simulations completes, new snapshots propagate to worker threads without halting active rollouts. This design accelerates convergence by at least 1.5× compared to traditional synchronous updates and makes more efficient use of distributed hardware.
By demonstrating that a single model can learn advanced reasoning, dynamic search, and multi-step planning solely from reinforcement signals, Moonshot AI’s work marks a significant step toward truly autonomous agents. Kimi-Researcher’s success shows that manual workflows and imitation data are not the only paths to high performance. The innovations in synthetic task design, reward shaping, context management, and rollout parallelism offer a roadmap for building scalable, adaptable intelligence capable of tackling open-ended challenges with minimal human intervention.