Paradigm

August 19, 2025

Agent and Model Evaluations with Paradigm

By Kyle Tianshi, Arnav Adhikari, Anna Monaco, and June Lee

As Paradigm’s evals team, we set out to design a robust evaluation system to assess and improve the quality of the data our agents generate. Our system needed to handle two types of evaluations at scale: model and agent. Model evaluations measure how well different models (e.g., GPT, Claude, Gemini, Grok) perform against each other, with everything else held equal. Agent evaluations assess how well the whole system (the model, scraping tools, the prompt, etc.) performs in practical scenarios. This presented us with three primary difficulties:

Multi-step reasoning. Paradigm’s reasoning is much more involved than calling an AI model to return an answer. Each cell is generated using both column- and row-level context: columns are defined by a column name and optionally a column prompt, type, and format; rows are generated incrementally, with cells serving as context for other cells. As a result, the agent may take a long series of reasoning steps to arrive at a data point. This makes it necessary to evaluate not only the final answer but also the reasoning process and sources used.

An example of an agent’s cell enrichment process, which is a combination of tools (web search, scraping, private data APIs) and model reasoning.

Open-ended outputs. Our evaluator should be able to handle subjective outputs (e.g., founder bios, candidate rankings, quality or relevance scores) where verifying cells is difficult and be flexible enough to address use cases that we haven’t even identified.
Live data. Use cases for Paradigm can involve live data (e.g., stock prices, exchange rates, competitor prices), meaning that the same tool call can return different outputs depending on when it’s run.

Our first design was a ground truth evaluation system. We provided Paradigm with initial information and asked it to enrich several columns, then calculated accuracy by string matching against manually scraped results. See the example below: we gave Paradigm a list of engineer names, asked it to fill in their LinkedIn, GitHub, and Twitter URLs, and compared the results against a manually gathered list.

Version one of Paradigm’s evaluation system.

This approach quickly proved to be infeasible. It only worked for data that could be clearly verified as true or false, string comparisons were often unreliable due to slight variations in text, and gathering the ground truth datasets required significant manual effort.

Evaluation Pipeline

Our solution was to incorporate a second agent into the evaluation pipeline.

A schematic of our current pipeline is shown below. We use an evaluator agent to label cell agent reasoning and outputs as either true or false. In our agent evaluation pipeline, we randomly select enrichment outputs to send to the evaluator. In our model evaluation pipeline, we run different AI models across a control dataset of 30+ well-known use cases to compare performance. In both cases, the evaluator agent runs directly after the cell agent to minimize issues caused by live data. These accumulate into a dataset of labelled cell agent outputs that informs improvements to our research and scraping capabilities.

Evaluation pipeline.

In the sections that follow, we’ll take a closer look at each part of this pipeline.

Evaluator Agent

Rather than only evaluating results of the spreadsheet data, we also evaluate the task steps.

We employ a separate AI evaluator to trace the line of reasoning that the agent takes and confirm that (a) each step is verifiable and makes sense, (b) the linked sources are correct and relevant, (c) the result is correct, and (d) the path to the solution is optimal. We return a Boolean value of false if the evaluator finds an error in any of these criteria, and true otherwise. This evaluation approach allows us to assign a truth value to each cell, but with the advantage of being able to handle more subjective result data like startup quality ratings and professional summaries.

The prompt that the evaluator receives with each cell.

For example, we can run the evaluator on a company information spreadsheet that contains the names of big tech companies and asks Paradigm to enrich fields like company LinkedIn, employee count, and annual revenue. For a straightforward task like finding the Tesla company website, the evaluator simply checks that the link generated is valid. For a more subjective task like determining Apple’s competitive landscape, the evaluator checks the validity of each of the three links accessed by the initial agent, verifies that the listed companies are accurate and relevant, and ensures the reasoning path is efficient. Per the prompt that the evaluator receives, any error found in the agent’s reasoning automatically results in the evaluator returning false to avoid feeding false positives into the fine-tuning pipeline.

An evaluator output for a cell about Apple’s competitive landscape.

Since the evaluator is LLM-powered, it is prone to errors and hallucinations. A recurring issue in our testing is scrape failures (for example, when the evaluator cannot verify a source because of a CAPTCHA), which happen because the evaluator needs to go beyond the cell agent’s logic to validate the cell output. As such, we try to avoid false positives as much as possible, which is why our evaluator prompt fails if it finds even one error. We’ve also tried to minimize the number of incorrect evaluations by using a more powerful model for the evaluator and checking for consistency across evaluations of the same cell agent outputs. In practice, we consider evaluator outputs to be ground truth, though we recognize that our accuracy estimates are likely conservative.

Agent Evaluation Pipeline

To get a sense of how well Paradigm infrastructure performs in practice, we automatically send samples of cells generated by users into the evaluator pipeline. This allows us to evaluate more varied use cases produced by real users and receive immediate feedback on agent performance for new scenarios.

Model Evaluation Pipeline

Our model evaluation pipeline is a controlled setting for evaluating small changes in our agent setup: right now, we use it to evaluate model performance, but it can also be used to track the impacts of giving the model access to new tools and APIs.

We needed to develop a control dataset that is sufficiently representative of most Paradigm use cases to be able to get a baseline for how models perform against each other. We ultimately compiled a dataset of over thirty use cases, ranging from people and company lookups to email, LinkedIn, Github, Twitter, Instagram, and TikTok scraping, as well as financial research, research paper analysis, and price comparisons.

We also bucketed these use cases based on difficulty level. Easier use cases are well-indexed with well-defined answers (e.g., finding Meta’s website URL), while harder use cases are open-ended, involve unstructured information, require multiple steps, or require corroborating between multiple sources (e.g., summarizing how changes in privacy laws affect a certain software product).

Control dataset.

With this control dataset, we’re able to easily run evaluations and get a sense of how models perform. We can also more finely examine performance on specific use cases and difficulty levels, which is useful for evaluating the impact of certain tools on the agent (e.g., how well it does at scraping Twitter posts after adding a Twitter tool).

Results

Our agent evaluation results are ongoing as we onboard more users to the product. We’ve already run 71,660 individual evaluations on samples of real user enrichments and plan to release more results when we have a larger dataset.

Our current model evaluation results are the result of 59,390 individual evaluations of cells generated within our control dataset, with Claude 4 Sonnet as a consistent evaluator.

Future Work

We are constantly working to improve the performance of our cell agents across all use cases and expand the suite of web APIs and scrapers available to the agent. We plan to continually release evals for new models as they come out and refine our control dataset to cover all use cases as accurately as possible, with the hope that our system can serve as a valuable benchmark for model performance in web scraping-related tasks.