Photo editing plays a key role in digital photography, letting professionals and hobbyists adjust image attributes such as tone, exposure, and contrast to achieve specific visual effects. Whether tackling a commercial campaign or a personal portfolio, producing a refined result demands technical proficiency and an artistic sensibility. So obtaining high-grade retouches can require a steep learning curve or extensive manual effort.
Users must choose between manual suites such as Adobe Lightroom, packed with powerful controls yet steep in learning, and AI-driven apps that simplify editing but lack precision for subtle tweaks. These automated services often misinterpret goals, produce inconsistent output under varying lighting or color conditions, and struggle to apply changes selectively across diverse scenes.
Early techniques applied zeroth- and first-order optimization or reinforcement learning to guide retouching. Others used diffusion-based models for image synthesis. These tactics improved outcomes in certain cases but often fell short on pixel-level control, high-resolution fidelity, and content preservation. More recent text-driven systems—like GPT-4o and Gemini-2-Flash—accept natural language directives but limit user influence, and their generative routines can overwrite critical image details.
A joint team from Xiamen University, the Chinese University of Hong Kong, ByteDance, the National University of Singapore, and Tsinghua University introduced JarvisArt, a retouching agent built on a multimodal language model. It interprets user instructions provided via text or images, replicates the workflow of a skilled artist, and invokes over 200 Lightroom functions through a dedicated integration protocol.
The approach relies on three key components. First, MMArt, a comprehensive dataset containing 5,000 base images and 50,000 Chain-of-Thought–annotated samples that span global edits, targeted adjustments, and complex multi-step workflows. Second, a dual-phase training regimen: supervised fine-tuning instills reasoning and tool-selection abilities, followed by Group Relative Policy Optimization for Retouching (GRPO-R), which assigns custom rewards for factors such as editing accuracy and perceptual quality. During GRPO-R, each proposed action receives a score based on how closely it matches target edits and how natural the outcome appears. Third, an Agent-to-Lightroom (A2L) protocol orchestrates the transparent execution of Lightroom commands, enabling real-time, interactive control of each tool.
To measure performance, the researchers created MMArt-Bench, a benchmark built from authentic user edits. JarvisArt achieved a 60% gain in average pixel-level content fidelity over GPT-4o and preserved its command-compliance rate. The system handles both sweeping modifications—such as exposure correction and color grading—and precise regional refinements, including skin smoothing, eye brightening, or hair sharpening. It operates at arbitrary resolutions and maintains the original aesthetic intent without introducing unwanted artifacts.
By combining synthetic data, reasoning-driven training, and seamless integration with industry-standard software, the team delivered a tool that brings professional-grade retouching within reach of nonexpert users. JarvisArt merges flexibility and quality in a single interface, offering creative individuals fine-grained command over every adjustment. This system lowers the barrier to entry for advanced image editing and sets the stage for future innovations in accessible retouching.

