Article

LongWriter-Zero Uses Reinforcement Learning to Generate Coherent Text Spanning Thousands of Words Without Synthetic Data

DATE: 7/1/2025 · STATUS: LIVE

Massive AI models now stretch beyond limits, juggling coherence and creativity across thousands of words, but their next breakthrough awaits…

LongWriter-Zero Uses Reinforcement Learning to Generate Coherent Text Spanning Thousands of Words Without Synthetic Data

Article content

The growing demand for AI-driven writing that extends across thousands of words has pushed researchers to tackle issues like capacity limits and quality decline. Still, large language models encounter constraints on maximum length and risk generating incoherent, repetitive, or off-topic passages.

Early attempts like LongWriter applied supervised fine-tuning on synthetic corpora to boost output length. This approach proved costly, tricky to assemble, and prone to producing text that reads as contrived. Relying on existing LLMs for synthetic training data narrowed the potential for creativity, and conventional training did little to tighten long-form coherence or formatting.

Recent work in long-form generation has centered on coherence, personalization, and pushing outputs past 2,000 words. Models such as Re3 and DOC adopted recursive schemes to uphold a global structure, whereas LongLaMP introduced user-focused adaptation through reasoning-aware self-training. Suri produced a large instruction-based collection but capped results below 5,000 tokens because of back-translation restraints. LongWriter then pushed generation into the 6,000–20,000 token range via supervised fine-tuning and preference learning, yet it carried over biases from its teacher models. On a parallel track, reinforcement learning has boosted reasoning in systems like DeepSeek-R1 and QwQ-32B, but RL-driven strategies for ultra-long content remain sparse.

A team from Tsinghua University and SUTD has introduced LongWriter-Zero, a new method that trains LLMs for ultra-long text without any annotated or synthetic datasets. Starting from the Qwen2.5-32B foundation model, they apply reinforcement learning guided by reward functions crafted to measure length, fluency, coherence, and formatting. Their design draws on breakthroughs in mathematical reasoning and code generation to refine three factors: reward formulation, inference-time scaling, and continual pretraining. LongWriter-Zero surpasses earlier supervised methods, achieving top marks on WritingBench and Arena-Write, and it even outperforms models exceeding 100 billion parameters such as DeepSeek-R1.

The research team extends Proximal Policy Optimization with a Group Relative Policy Optimization technique, training a 32 billion parameter model on instruction-following prompts that produce up to 14,000 tokens. They assess outputs using a new benchmark called Arena-Write and leverage a multi-part reward that balances text length, clarity, structural coherence, and layout. One insight highlights the impact of prompting the model to “Think” through intermediate reasoning steps before generating text, which yields tighter control and enhanced organization.

Additional performance gains arrive from extensive pretraining over writing-rich sources, which underlines the value of a strong, writing-centered foundation. This two-phase scheme combines 30 billion tokens of book-style corpora for continual pretraining with a 150-step RL fine-tuning stage that integrates “Think” cues to spark deeper reasoning.

On WritingBench, LongWriter-Zero records an average score of 8.69, outperforming GPT-4o (8.16), Qwen2.5-Max (8.37), and DeepSeek-R1 (8.55), and it leads in five out of six categories. In Arena-Write, it secures an Elo rating of 1447. Removing the “Think” prompts or skipping the extended pretraining phase causes dramatic score drops, highlighting their critical role. A head-to-head evaluation using GPT-4.1 ratings delivers a 98.2 percent win rate for LongWriter-Zero, and human assessors confirm its strength in producing long-form narratives.

LongWriter-Zero eliminates dependence on label creation or synthetic examples by relying exclusively on reinforcement learning. The approach configures reward functions that cover length control, writing quality, and format consistency, all layered onto Qwen2.5-32B. Outcome metrics show top performance on WritingBench with an 8.69 score and an Arena-Write Elo of 1447, surpassing GPT-4o, DeepSeek-R1, and Qwen3-235B-A22B (Elo 1343). Independent judgments by humans and GPT-4.1 also yield win rates up to 98.2 percent. Known weaknesses include reward manipulation tactics, such as padding text with repetitive segments or inserting keywords like “quantum entanglement” to boost length tallies, pointing toward the need for more robust reward engineering and human oversight.

A separate guide covers the integration of AutoGen with Semantic Kernel alongside Google’s Gemini Flash model, walking through each stage to embed these frameworks into AI systems. It outlines configuration steps, input handling, and context management.

Another feature examines benchmark practices for tabular machine learning, spotlighting methods to evaluate models that learn patterns from data tables. It details cross-validation and fairness metrics, with examples illustrating algorithm comparisons.

A different article explores Masked Diffusion Models and the performance trade-offs when generating discrete sequences, pointing out inefficiencies in both training and inference. The write-up reviews core algorithms and discusses adaptations for symbolic data and code generation.

Coverage of learning-based robotics highlights how data-driven strategies have replaced manual control rules, allowing robot hands to refine movements through feedback loops. It reviews key policy-learning frameworks and showcases experiments in simulated environments.

An analysis of large language models underscores the need for tighter scientific code control, with AI agents increasingly tasked with drafting and verifying research scripts. The discussion covers versioning, static analysis integration, and balancing automation with manual review to uphold reproducibility.

A tutorial on the Lilac library demonstrates building a fully modular data analysis pipeline without relying on signal processing modules. Readers learn to scaffold data ingestion, transformation, and visualization, then plug in Lilac components for scalable batch and streaming workflows.

An examination of dexterous hand manipulation data collection details the challenge of assembling diverse motion datasets, where capturing nuanced finger kinematics remains a major barrier. The article reviews sensor setups, annotation protocols, and synthetic augmentation methods to address gaps in real-world samples.

Another segment walks through creating custom tools for AI agents, outlining how to enable bespoke functions via plugin architectures and code templates. Examples highlight how these plugins can extend agent capabilities and adapt to domain-specific tasks.

A health report notes that rare diseases affect roughly 400 million people worldwide, covering over 7,000 disorders, most of which have a genetic origin. It highlights advances in gene therapy trials, treatment gaps, and efforts in data sharing to accelerate diagnosis.

Finally, Tencent’s Hunyuan team offers an open-source large language model named Hunyuan-A13B, built on a sparse Mixture-of-Experts architecture to scale up parameters efficiently. The documentation covers expert routing strategies, benchmark scores on multilingual tasks, and guidelines for domain-specific fine-tuning.

Keep building

Join Skool — Ship Your First Microapp Back to feed