Tencent has released ArtifactsBench, a new testing platform aimed at assessing AI models that generate interactive code and interface designs with an emphasis on user experience and visual quality.
Developers frequently find that AI-created web pages or charting widgets do run, yet they suffer from misplaced controls, clashing colors, awkward layouts, or choppy animations that hamper usability.
Until recently, model performance was judged almost entirely on code correctness: whether scripts compiled, executed without errors, or met predefined unit tests.
Those suites confirm a module works, but they ignore its visual polish and interactive behavior, leaving developers to review every user interface manually as the number of AI-generated components multiplies.
Earlier automated tests missed subtle layout flaws, off-screen buttons, or inconsistencies in color schemes and typography that degrade an application’s overall appeal.
ArtifactsBench steps in to reduce that manual workload by automating the assessment of both technical function and visual experience.
It begins by assigning one of more than 1,800 predefined challenges, drawn from a catalog that includes data dashboards, interactive charts, simple browser games, mini e-commerce sites, mapping tools, and adaptive forms.
Each task comes with a reference specification outlining required features, performance targets, and design guidelines, giving the system clear criteria to judge against.
When a model submits its code, ArtifactsBench compiles and runs it inside a secure, sandboxed environment to prevent unintended side effects.
During execution, the platform captures a series of screenshots at defined intervals to document hover states, button-click transitions, animation quality, and responses to user interaction.
These artifacts – the original prompt, the generated code, and the visual record – feed into a specialized multimodal LLM that combines image analysis with text understanding.
To equip the LLM with an eye for quality, researchers fine-tuned it on a dataset of thousands of human-rated interface examples, capturing variations in layout, color schemes, typography, and accessibility needs.
The LLM applies a detailed, per-task checklist to assign scores across ten metrics:
- functional accuracy
- layout consistency
- color harmony and contrast
- text readability
- component spacing and alignment
- input handling and validation
- smoothness of animated transitions
- load and response times
- cross-browser stability
- clarity of error messages
By enforcing uniform standards, ArtifactsBench turns what was once a subjective review into a consistent, reproducible scoring process.
When its rankings were compared with votes on WebDev Arena – a community platform where developers cast ballots for top AI creations – the system matched human preferences 94.4 percent of the time.
This marks a large improvement over previous benchmarks, which managed only 69.4 percent consistency with human judgment.
The framework’s feedback aligned with professional developers more than 90 percent of the time, confirming its ability to capture core elements of user satisfaction.
In a large-scale evaluation, Tencent ran more than 30 top AI models through ArtifactsBench and published a leaderboard of results.
Leading commercial systems from Google (Gemini-2.5-Pro) and Anthropic (Claude 4.0-Sonnet) claimed the top spots, yet one of the most surprising findings involved a general-purpose model.
Qwen-2.5-Instruct, which trains on a wide mix of text and code, outscored its own code-specialized variant Qwen-2.5-coder and its vision-tuned sibling Qwen2.5-VL across tasks that ranged from mini-games to interactive chart builders.
That runs counter to the assumption that specialization guarantees superior performance in a given niche.
Researchers suggest that building a complete, polished interface requires more than coding skill or visual processing on their own.
They cite “Robust reasoning, nuanced instruction following, and an implicit sense of design aesthetics” as essential talents that generalist models can develop through exposure to varied data.
By offering a systematic way to evaluate both functional performance and design quality, ArtifactsBench can help development teams track progress as AI systems learn to produce applications that satisfy technical requirements and appeal to users.
Early adopters in Tencent’s internal research groups report that ArtifactsBench has already reduced evaluation time by over sixty percent, letting engineers shift effort from manual quality checks to refining model architectures.
Tencent intends to share ArtifactsBench with the wider research community, providing open-source tools for defining custom challenges and plug-in support for popular front-end frameworks like React, Vue, and Angular.
By highlighting where AI systems struggle with interactive feedback or visual balance, teams can refine training pipelines, adjust prompt patterns, or tweak model designs to better fuse code logic with user interaction flows.
Tencent expects this benchmark to serve as a key reference for measuring the next wave of AI tools that generate interactive experiences people want to use.

