In a recent demonstration, a single round of self-critique transformed a basic reply into a verified answer for a well-known train-speed puzzle. The process had the model compute an initial solution, review its own reasoning, then generate a refined walkthrough that arrived at 60 km/h. This test showcases how embedding a feedback loop within a prompt pipeline can elevate precision and clarity with minimal overhead.
One representative challenge asked: “A train travels 120 km at a certain speed. If the speed had been 20 km/h faster, it would have taken 30 minutes less to cover the same distance. What was the original speed of the train?” Using one pass of refinement, the model solved the resulting quadratic equation and identified 60 km/h as the correct answer.
Self-Refine involves three main steps: an LLM produces an initial response to a user prompt; it then critiques its own output by pointing out errors or gaps; at the end, it uses that critique to generate an improved reply. Repeating these steps lets the model gradually sharpen reasoning, correct subtle mistakes, and strip away ambiguous phrasing.
Tasks that demand multi-step reasoning, code generation, or structured content creation benefit from Self-Refine. Each pass helps iron out logical flaws in calculations, tighten code snippets, or clarify narrative flow in a generated article. Use cases include complex math puzzles, API integration examples, or draft outlines for reports.
A sample pipeline uses Mirascope’s @openai.call and @prompt_template decorators to structure each stage. First, the model receives a formatted prompt and returns a raw answer. Next, it critiques its own draft, outputting feedback comments. The self_refine function then loops through this cycle for a configured number of iterations, merging each critique with the original prompt to produce a refined reply at each step.
Built in Python, the Mirascope framework offers utilities for prompt templating, API connectivity, and logging. It supports chaining multiple steps through clear function decorators and handles error reporting when a call fails. Logging hooks can capture each prompt, raw response, feedback comment, and final refined text for audit trails and debugging.
The project repository includes example notebooks and Docker configurations that let developers run the pipeline locally or in cloud environments without extra setup.
API credentials are loaded from environment variables to avoid hardcoding secrets. Developers can configure iteration limits, prompt templates, and feedback criteria via a centralized settings file. This design maintains consistency across local, staging, and production deployments without modifying code.
To make outputs machine-readable and predictable, the pipeline can integrate Pydantic models. For example, a MathSolution schema defines fields for a step-by-step derivation and a final numeric result. An enhanced decorator then reads raw output plus critique notes, parses them into this schema, and returns a typed object. Downstream tools that consume such objects gain clear contracts around data shape and validation.
Benefits of this self-review approach include:
- Higher accuracy by incorporating model-generated critique into each iteration.
- More transparent reasoning, with explicit steps for variable definition, equation setup, and algebraic solving.
- Stronger trust in results, since users see both the original output and the feedback-driven corrections.
This workflow extends well beyond math puzzles. Engineering teams can employ it to review and polish code snippets, preparing API clients or data pipelines for production. Content creators can auto-generate first drafts of blog posts or reports and use self-review to refine structure, tone, and factual accuracy. In data analysis, it can verify SQL queries or chart captions before final delivery.
Each refinement cycle adds calls to the API, raising computation time and cost. Users should select an iteration count that balances quality and performance. In simpler tasks, one or two loops may suffice. More intricate analysis or lengthy narratives may benefit from three or more cycles. Prompt designers can tune feedback questions to focus on calculation errors, logical gaps, or clarity issues based on project needs.
Current research explores self-review layers alongside chain-of-thought prompts or few-shot templates. Having the model audit its own output helps catch errors that might slip through a one-pass generation. Initial trials show that two attentive passes often correct issues in reasoning chains or code samples. This extra check yields higher reliability for scenarios where precision matters, such as test case generation or formal documentation.
Future extensions could integrate self-refinement with retrieval-augmented generation or incorporate external calculators for numeric verification. By combining model introspection with domain-specific tools, teams may achieve even higher confidence in outputs that span facts, code, and structured data.
By building internal review loops into prompt workflows, teams gain a structured way to let models “check their work” before presenting final results. This pattern adds predictable quality control on top of raw generative capability, making it easier to trust outcomes without exhaustive manual review. As LLM platforms evolve, self-driven refinement may become a standard step in production pipelines for technical and creative applications.

