loading . . . How to Score a User Simulator: Introducing USR-8 Authors: José Santos, Shuo Qiu, Morteza Ziyadi User simulators have become a standard part of the agent-building toolkit: cheaper than real-user pilots, faster than scripted dialogues, and the only way to push an agent through hundreds of plausible conversations between every code change. The hard question is the one that comes next: how do you know the simulator itself is any good? It's easy to underestimate how much can go wrong. A simulator that ends every turn with "Thanks so much for your help!" will quietly inflate your agent's scores. A simulator that coaches your agent ("could you check the fare rules first?") will hide failures behind helpful nudges. A simulator that scores well on a holistic "user-coherence" metric can still produce conversations that cluster around the same two openings and the same four invented names. None of these failure modes are visible from the agent's score sheet; all of them distort it. If the test harness isn't measured with the same rigor as the agent, you're back to vibes — just with bigger numbers attached. This post is a research study on how we developed and measured the Microsoft Foundry’s user simulator quality. The post offers two contributions: 1) an Eight-Metric User Simulation Rubric (USR-8), a minimal, orthogonal, and sufficient framework for scoring a user simulator that separates behavior from style and surfaces failure modes a single composite score erases, and 2) a set of empirical findings from running that rubric across 1,200 conversations, three domains, and four simulator configurations. The short version, up front: across 1,200 conversations spanning three domains and four simulator configurations, the Foundry user simulator scored at or near ceiling on every per-conversation metric except realism, and a small prompt revision closed the realism gap. More interestingly, the same prompt loaded into a third-party harness produced indistinguishable conversations on every metric we measured — so most of what makes a "good" simulator lives in the prompt policy, not the orchestration code. That last finding generalizes well past Foundry, and it's the one we'd most like other groups doing this kind of evaluation to test against their own systems. Two philosophies of user simulation Before you can evaluate a user simulator, you have to decide what you want it to be. There are two coherent answers, and they pull in opposite directions. Philosophy A — Realistic foil. The simulator stays in character and never coaches the agent. If the agent skips a step, the simulator notices the way a real customer would — by getting confused, getting frustrated, or moving on — not by saying "you forgot to check the fare rules." Conversation failures attributable to the agent stay visible and clean. This is what you want when you're measuring the agent's intrinsic quality. Philosophy B — Helpful tester. The simulator is part of an end-to-end task-success benchmark. It may nudge the agent back on track when it goes off-script, but only as a real cooperative user would: by answering clarifying questions and supplying missing details, not by coaching the agent on how to do its job. That help can mask failures, so this setup measures whether the task gets completed by the agent and simulator together. Treat the result as an optimistic system-level success rate: it credits the simulator’s help and can overstate the agent’s standalone ability if that help slips into coaching. Neither philosophy is wrong. They serve different product use cases, and the choice has direct consequences for what your scores mean. You're trying to…Use Philosophy AUse Philosophy BCompare two agent prompts✅⚠️ (simulator help can flatten differences)Catch regressions during development✅⚠️Measure end-to-end task success⚠️ (no help may lead to lower pass rates)✅Red-team for adversarial inputsneeds persona seeds✅Generate realistic-looking sample conversations✅❌ Microsoft Foundry's simulator is built around Philosophy A: stay in character, don't coach. Several widely used third-party user-simulator frameworks default to Philosophy B — a fixed opening prompt and explicit instructions for the simulated user to correct the agent when it misses a step. We'll come back to that contrast when we look at the results. The first thing to do, before you score anything, is pick the philosophy that matches your use case. Otherwise, you'll end up measuring a simulator on a job it wasn't designed to do. Eight metrics for scoring a user simulator Once you've picked a philosophy, you need a rubric. Generic agent metrics — task adherence, intent resolution, tool-call accuracy — don't apply, because the simulator isn't trying to complete the task, it's trying to be a user. We built USR-8: eight LLM-judge metrics specifically for evaluating user-simulator output. We designed these metrics to be minimal, orthogonal, and sufficient — each captures a distinct failure mode that cannot be reliably inferred from the others. Seven are per-conversation; one is cohort-level. All score on a 1–5 integer scale with a short rationale, and all were judged by a current-generation GPT model running at low reasoning effort. Per-conversation metrics (each judge sees the full transcript and the original scenario): Clarity — Does the user state their request clearly enough that a competent agent could act on it without guessing? Penalizes ambiguous, fragmented, or under-specified asks. Relevance — Does the user stay on the topic the scenario specifies? Penalizes drift, off-topic detours, and wholesale topic swaps (e.g., scenario asks for a California vendor agreement; user opens with an English NDA). Does not penalize the user when the agent goes off-topic. Steering — Does the user keep the conversation moving toward their goal without coaching the agent — i.e., without telling the agent how to do its job? A high score requires productive steering that stops short of becoming the agent's manager. This is the Philosophy A guard rail; expect a Philosophy B simulator to score lower here by design. Responsiveness — Does each user turn acknowledge and respond to what the agent just said, rather than ignoring it or repeating prior asks? Picks up on options the agent actually offered. Consistency (coherence) — Does the user maintain a single coherent identity, set of facts, and conversational thread across the whole conversation? Penalizes self-contradiction — e.g., changing a booking code mid-conversation. Realism — Does the user sound like a human writing in the moment, hesitations, false starts, contractions, emotional register that fits the scenario, plausible imperfection — rather than a polished ghostwriter playing a role? This metric scores how human the prose reads, not whether the behavior is correct. Persona fidelity — When the scenario specifies a persona ("frustrated returning customer", "junior on-call engineer", "in-house counsel"), does the user embody it faithfully? Defaults to 5 when the scenario specifies no particular traits. Cohort metric (judged on an entire run of conversations): Diversity — Across, say, 100 conversations of the same simulator on the same scenario set, how varied are the conversations? Considers distinct names, distinct opening framings, distinct emotional registers, and unique scoring terms. 1 = visibly clustered (same names, same opening template); 5 = high variety across all axes. A few notes on the design: Separate behavior from style. Clarity, relevance, steering, responsiveness, consistency, and persona fidelity describe what the simulator does. Realism describes how it sounds. They tend to move independently and conflating them hides important signal. Include a no-coaching guard. If you're under Philosophy A, you need at least one metric that explicitly penalizes the simulator for coaching the agent. Otherwise the judge's prior about "helpful" user turns will let coaching slip through. Score the cohort, not just the conversation. Per-conversation metrics can be at ceiling while the cohort is visibly clustered (same names, same opening, same closer). Diversity catches that. Use scenarios, not scripts. A "scenario" in our setup is a short prose task description (typically 80–200 words) given to the simulator as "the thing the user is trying to accomplish today." The simulator invents concrete details — names, booking codes, dates, error counts — that the scenario doesn't specify. This intentionally probes the simulator's ability to stay in character while improvising plausibly. The minimum subset, in our experience, is relevance + steering + consistency + realism: it covers on-topic behavior, no-coaching, within-conversation coherence, and prose quality. The other four sharpen the picture, especially when comparing two simulators that look close on the minimum set. How we put it to the test Three agents, in three domains. We created three Microsoft Foundry agents, each in simulation mode (tool responses generated by a current-generation GPT model against a tool catalog rather than a live backend), chosen to span both consumer-vs-professional tone and short-turn-vs-long-turn conversational shape: airline-customer-service-sim — refunds, rebookings, fare-rule lookups, baggage allowance, seat changes, 12 tools. sre-incident-triage-sim — on-call engineer that confirms alerts, looks up service ownership, opens incidents, paginates runbooks, coordinates rollback, 9 tools. legal-contract-review-sim — vendor agreements, redline liability and indemnity clauses, deviations from a contract template, negotiation drafts, 8 tools. Ten scenarios per domain, ten conversations per scenario. Per simulator configuration: 10 scenarios × 10 conversations × 3 domains = 300 conversations. With four simulator configurations, that's 1,200 conversations scored. The 10 × 10 design gives 95% CI half-widths of 0.05–0.15 on per-conversation metrics — tight enough to detect effects above ~0.1. Four simulator configurations. This is where the comparative claims come from: Foundry user simulator, baseline prompt - an earlier prompt, before the realism revision, included here as an ablation baseline. Foundry user simulator, production prompt - the current production prompt, whose realism-focused revision is the change we isolate below. A publicly available third-party user-simulator framework with its default prompt. The same third-party framework with our production prompt ported in through its prompt-customization mechanism. This was the isolation experiment — same underlying model, same third-party harness, same scenarios; only the prompt swapped. If results changed dramatically, the prompt was the lever; if they held steady, the harness was. Every conversation in all four configurations was scored by the eight judges. Total judgments: roughly 10,000. What we found Three findings are worth reporting in detail. 1. The Foundry simulator is at or near ceiling on every metric except realism For the baseline prompt (before the realism revision), at n=100 conversations per domain: Metric Airline SRE Legal Clarity 5.00 4.89 5.00 Relevance 5.00 5.00 5.00 Steering 5.00 4.86 5.00 Responsiveness 5.00 5.00 5.00 Consistency 5.00 5.00 5.00 Persona fidelity 4.48 4.89 4.86 Realism 3.97 4.13 3.34 Diversity, the eighth USR-8 metric, is cohort-level rather than per-domain, so it does not appear in this per-conversation table; on our current scenario set it does not yet discriminate between simulators. Five of the seven per-conversation metrics are essentially pinned at 5.00. Persona fidelity is strong without being at ceiling. Realism is the consistent underperformer. The judge's notes pinpoint what's missing in plain language: clean polished phrasing, no hesitation, no false starts, no contractions, names that read like marketing copy ("Priya Singh", "Olivia Hartwell"), and a polished closing tic that thanks the agent the way a real customer rarely does. In other words, the simulator sounds less like a customer and more like a professional ghostwriter playing a customer. 2. A single prompt revision closed the realism gap The realism revision (now the production default) focused entirely on register, hesitation, and dropping the closing-thanks tic, moved realism by +0.61 on airline, +0.62 on SRE, and +1.06 on legal, with no regression on any other metric. The deltas are 4–10× the 95% CI half-width on each domain, well outside noise. This says something specific about what "improving a user simulator" actually looks like in practice: at the production-prompt stage, the high-leverage edits are about prose register, not about behavior. The behavior-side metrics were already pinned at ceiling; the work that moved the needle was style work. 3. Most of the gap lived in the prompt, not the harness A surprising finding came from the isolation experiment. The third-party simulator with its default prompt scored realism in the range 1.5–2.0 across the three domains — 2.5–3.2 points below the Foundry simulator. The judge's notes named two recurring failures: meta-reasoning leakage in the user turn ("Step 1: Analyze what the Agent just said. Step 2: …") and explicit coaching of the agent ("It looks like you skipped a step. Could you please check the fare rules first?"). Both behaviors are required by the third-party default prompt and forbidden by Foundry's. When we ported the Foundry production prompt into the third-party framework's prompt-customization hook — same model, same harness, same scenarios, only the prompt swapped — realism on the third-party harness jumped to 4.4–4.6, essentially matching the Foundry production reference (4.4–4.7). Across all per-conversation metrics, gap closure landed in the 96–99% range. Both failure modes we had flagged in the default run — the reasoning leakage and the coaching — were gone once the Foundry prompt was in place. Read into this what it actually means: simulator behavior is encoded primarily in the prompt policy, not in the orchestration code. The two harnesses are similar enough that, given the same prompt, they produce indistinguishable conversations on every measured per-conversation metric. Where they diverge is in the default policy each one ships with — and that policy is editable. The practical consequence, with the caveat that we compared only two harnesses: within that pair, the choice of framework mattered far less than the prompt running on top of it. We would not over-generalize from two systems, but the direction is clear — most of the behavior we measured lived in the prompt policy, and that is the asset worth investing in. A note on diversity Across all four configurations, cohort diversity scored a flat 2/5. This is a feature of the scenario set, not the simulator: when the simulator is given the same 10 prose scenarios 10 times, it produces conversations that cluster on names, openings, and emotional registers. Discriminating between simulators on diversity will require either much richer scenarios or a persona-seed mechanism that supplies named persona archetypes, e.g. cooperative happy-path; frustrated; misinformed; adversarial; non-native speaker; low-effort. That's on our roadmap; it's not a near-term fix you can drop into an existing simulator. Methodological recommendations A checklist for any team doing this kind of evaluation — on Foundry's simulator or a different one — based on the gaps we ran into and the choices that paid off: Pick the philosophy first. Decide whether you're measuring a realistic foil or a helpful tester, write that decision down, and design your metrics around it. Mixing them inside one rubric produces incoherent scores. Separate behavior from style. Use at least one metric per category. Concretely: in our baseline-versus-production comparison, the entire improvement landed on realism while every behavior metric stayed flat — proof that a single composite score would have averaged away the only signal that moved. Add an explicit no-coaching metric (under Philosophy A). LLM judges tend to reward helpfulness; without an explicit penalty, coaching slips through and gets scored as conscientious behavior. Run at n ≥ 100 per condition. Below that, CIs are too wide to call effects under ~0.5 reliably. We caught one apparent catastrophic failure at n=10 that turned out to be a sampling artifact at n=100. Score the cohort, not just the conversation. A cohort-level diversity judge is what told us our diversity weakness was scenario-set-bound, not simulator-bound. Use scenarios, not scripts. Forcing the simulator to improvise specifics is what surfaces persona fidelity and realism. Scripted dialogues let a weak simulator pass by having nothing to invent. Build a human-rated calibration set. A few dozen conversations, hand-rated by at least two annotators, is a reasonable starting point for bounding judge bias via Spearman correlation, though the right size depends also on inter-human agreement and needs to be validated. We have not done human calibration for this study and this is an important methodological gap to close in future work. Always compare against at least one external baseline. A score on a custom rubric only means something relative to another simulator scored on the same rubric. "4.7 out of 5" reads as excellent in isolation, but if a baseline simulator also scores 4.7, you've learned nothing about your simulator — only that the rubric is generous. External baselines defend you against your own optimism. When you compare, isolate the variable. A simulator's output depends on multiple components — at minimum the prompt and the surrounding harness, often also the judge configuration and the model. A head-to-head between two simulators changes all of them at once, so the result is uninterpretable: you can't tell which component drove the gap. Run at least one swap experiment that holds everything constant except the variable you care about — port your prompt into the other harness, or theirs into yours, and re-score. Without a controlled swap, framework-vs-framework numbers are a confounded measurement, not a result. Where this kind of measurement has limits User simulators are not the right instrument for every evaluation. Three honest caveats: A deterministic oracle beats a simulated conversation, when one exists. If you can assert programmatically that a refund record was created, that check is tighter and cheaper than judging a whole conversation, so reserve user simulation for what no oracle can see: tone, persistence, and recovery from confusion. You cannot validate invented facts against a real system. Because the user simulator will make up specifics such as booking codes, account numbers, and dates, nothing it asserts can be cross-checked against ground-truth records, so simulation cannot test correctness that depends on real identifiers. This is structural, not a defect of the simulator. Adversarial behaviors are a different design. A simulator built to stay in character is, by construction, not built to actively try to break the agent. Red-teaming belongs in a separate evaluation track with its own metrics. Within those limits, the methodology does what we need: it quantifies realism, coherence, on-topic behavior, and persona fidelity at a scale hand-rating cannot reach. Final thoughts Three takeaways are worth carrying out of this work, whether or not you ever look at the Foundry user simulator specifically. Evaluation tools can be wrong too. Evaluation is recursive: any tool you use to score your agent is itself a system that can be wrong. A simulator that sounds polished can flatter a mediocre agent; a simulator that coaches when it shouldn't can hide real regressions. Behavior and style are separate axes. Style failures (the polished-ghostwriter problem) and behavior failures (the coaching problem) move independently and call for different fixes. A single composite score erases that signal. The prompt did most of the work. Swapping the prompt between two harnesses closed 96–99% of the measured gap on every per-conversation metric. A fair comparison between simulator frameworks has to control for prompt — otherwise you're measuring prompts, not frameworks. We saw this across the two harnesses we compared; a harness with more simulation-side logic could move outcomes more, so treat it as a strong signal, not a universal law. User simulators are how teams ship conversational agents without waiting for real users. Measuring them with the same rigor as the agents they're meant to test is what keeps the end-to-end evaluation honest. Get started Start building in Microsoft Foundry: ai.azure.com Get the BRK252 session: aka.ms/build26-BRK252 Build 2026 Observability team blog post: From observability to ROI for AI agents on any framework Read the docs: Microsoft Foundry user simulation documentation Join the community: aka.ms/ai/discord https://techcommunity.microsoft.com/t5/microsoft-foundry-blog/how-to-score-a-user-simulator-introducing-usr-8/ba-p/4523642