In modern AI systems, keeping outputs tied to verifiable information is critical. Upstage’s Groundedness Check API enables developers to assess whether a generated answer truly reflects supporting context. By sending context–answer pairs to an Upstage endpoint, teams instantly receive a grounded/not-grounded verdict along with a confidence score. This article provides a hands-on tutorial on single-shot verification, batch processing, and cross-domain testing to help practitioners maintain factual consistency in their NLP pipelines.
To begin, install LangChain’s core library alongside the Upstage integration package using your preferred package manager. Next, import Python modules for data structures and type annotations. Then assign your Upstage API key to an environment variable (for example, UPSTAGE_API_KEY) so that every call to the groundedness endpoint uses proper authentication. This setup provides seamless communication with the API for both interactive sessions and automated scripts.
A central component of this walkthrough is the AdvancedGroundednessChecker class. It wraps Upstage’s HTTP interface into a clean Python object that stores context–answer checks in a list as they are performed. The class exposes two primary methods: run_single to verify one pair at a time, and run_batch to process multiple items in a single request. Additional helpers extract a categorical confidence label from each response and calculate overall accuracy metrics after all checks complete.
In our first examples, we execute four distinct checks to showcase the API’s responses. The first pair tests a wrong height claim: the Eiffel Tower’s height is stated incorrectly. The second offers an accurate fact about a well-known landmark. The third case returns a partial grounding when an answer overlaps only partly with the provided context. Finally, a contradictory statement is submitted to see how the service flags ungrounded replies. Each result prints immediately to the console.
For larger workloads, we demonstrate batch processing. A list of context–answer objects is passed at once to run_batch, and Upstage returns an array of judgment objects. After iterating through these responses, the script summarizes the number of grounded versus ungrounded answers and computes an overall accuracy percentage. The detailed output includes confidence scores, making it easy to filter borderline cases for review. This approach highlights how teams can validate hundreds or thousands of AI outputs in a scalable manner.
We then apply the same batch workflow to three subject areas: science, history, and geography. For science, context–answer pairs include topics such as chemical formulas and planetary data; history cases cover dates and key events; geography examples ask about capitals and landforms. Each domain batch returns grounding verdicts accompanied by confidence ratings. Comparing results shows consistent performance even when handling specialized terminology. This cross-domain capability demonstrates the API’s versatility in diverse AI systems.
To consolidate findings, a helper function named create_test_report gathers all recorded checks into a structured summary. It computes overall accuracy, groups cases by domain, and highlights items that fall below a certain confidence threshold. The report is formatted with clear labels for grounded and ungrounded examples, and it prints a concise table of metrics. The helper also exports results in JSON or markdown for further consumption. This tool simplifies tracking model performance and identifying areas requiring further validation.
Integrating Upstage’s groundedness service into a continuous integration pipeline allows teams to enforce quality checks on every model update. Developers can script nightly runs to detect shifts in fact accuracy, trigger alerts when confidence drops below a target, and archive historic reports for audit purposes. Pairing groundedness assessments with existing logging and monitoring tools creates a robust workflow that continuously guards against content drift and helps maintain high standards for AI-generated knowledge.
For projects seeking deeper insights, the checker’s parameters can be tuned to adjust sensitivity thresholds or to include additional metadata in requests. Extending the class with custom domain-specific rules or integrating multilingual support broadens its applicability. Teams may also incorporate feedback loops, retraining prompts based on flagged cases. This flexible, domain-agnostic solution equips organizations with a reliable mechanism to verify AI outputs at scale.

