Article

NUS Debuts Dimple, Diffusion Model That Speeds Up Multimodal Text Generation and Enhances Control

DATE: 5/29/2025 · STATUS: LIVE

Learn how diffusion language models reinvent text with parallel denoising magic and template control but one mysterious unexpected breakthrough awaits…

NUS Debuts Dimple, Diffusion Model That Speeds Up Multimodal Text Generation and Enhances Control

Article content

In recent months, interest has surged in applying diffusion methods—first created for continuous sources like images—to language processing. This trend has produced Discrete Diffusion Language Models (DLMs) that view text generation as a denoising task. Rather than producing words step by step, DLMs allow sequences to decode in parallel and offer detailed format management. They can initialize every position at once and steer the final sequence toward desired patterns. DLMs support strict output templates, richer infills through bidirectional context, and flexible sequence setup. Plus, the non-linear framework speeds up generation paths. Yet most multimodal large language models (MLLMs), including LLaMA, Qwen-VL and InternVL, still rely on autoregressive decoding.

Work in diffusion has covered continuous and discrete domains. Continuous variants such as DiffuSeq and SED employ embedding or relaxed categorical spaces to smooth transitions. Discrete methods like SDDM and RDM apply noise directly on token sets to respect language boundaries. Training often uses masked language objectives or entropy-based score matching. Hybrid approaches, including AR-Diffusion and SSD-LM, layer autoregressive routines over diffusion. Open-source MLLMs such as LLaVA and InternVL use visual instruction tuning and joint pretraining yet retain token-by-token decoding.

Researchers at the National University of Singapore introduced Dimple, the first discrete diffusion multimodal LLM. It combines a vision encoder with a discrete diffusion-based language component. To address stability issues, the group used a two-stage training plan: autoregressive alignment for vision-language coherence, masked diffusion training to rebuild generation skills. Dimple-7B outperforms LLaVA-NEXT by 3.9% on standard benchmarks. They also introduced Confident Decoding for dynamic token updates and explored Structure Priors to control output format. These changes boost inference speed, flexibility and structural precision without losing accuracy.

Dimple links a visual encoder to a discrete diffusion language engine. The developers split training into two steps to overcome limited supervision and gaps in generation coverage. Step one uses autoregressive learning with a causal attention mask to align visual and text features. Step two applies masked diffusion training to recover generation strength. At inference, a Confident Decoding routine tweaks tokens based on prediction certainty. Although the system relies on far fewer examples, Dimple matches or surpasses comparable autoregressive models across several benchmarks, even if it falls short of the top-tier, larger-scale designs.

Evaluation tests pitted Dimple against pure autoregressive counterparts on complex instruction-following benchmarks. Dimple’s mixed training approach, blending both autoregressive and diffusion updates, produced top-tier results, outscoring models built on similar data volume for most tasks. It still trails systems trained on vastly larger collections, but it gains from a robust, expanded base language backbone. Ablation trials showed that the hybrid tuning reduces sequence length bias and stabilizes output. A prefilling technique sped up decoding by pre-establishing high-confidence tokens, cutting runtime with only minor accuracy trade-offs. These traits make Dimple an efficient, competitive choice for multimodal comprehension.

Dimple’s hybrid design tackles key issues in pure discrete diffusion training, including instability, performance drops and sequence length bias. The model uses an initial autoregressive phase for data alignment, followed by diffusion-based tuning to refine output quality across varied tasks. The resulting Dimple-7B achieves a 3.9% gain over LLaVA-NEXT on established benchmarks. Confident Decoding cuts needed inference steps by focusing on high-certainty tokens, and a prefilling option accelerates throughput with only slight accuracy shifts. Structure Priors grant explicit command over output format and length, giving greater precision and consistent layouts than typical autoregressive schemes.

Keep building

Join Skool — Ship Your First Microapp Back to feed