A 5-step playbook for building data agents on a governed semantic layer

Conversational data agents are useful when they are grounded in a governed semantic model. Here is the sequencing we use to ship one that holds up in front of an executive audience.

Conversational data agents — ask a question in English, get an answer with citations — are finally ready for production use. For the small set of organizations that did the semantic-layer work first. The 5-step playbook below is the sequencing we use on every engagement where leadership wants a real agent in production, not another demo that breaks in the third question.

The order matters more than the tooling. We have shipped agents on Power BI Copilot, on warehouse-native chat surfaces, and on bespoke LangChain-style stacks. The 5 steps below are tool-agnostic. They are what separates a data agent that an executive will actually open from one that gets quietly retired after a month.

Step 1: pick one domain and one audience

Every successful production agent we have shipped started with a deliberately narrow scope: one business domain, one audience. Sales pipeline for the RevOps team. Financial performance for FP&A. Customer health for the CS leadership team. The temptation to build a 'company-wide' agent on day one is the single biggest failure mode in this category.

We pick the domain by intersecting three criteria: a clean enough semantic model to ground against, an audience sophisticated enough to spot wrong answers, and an executive sponsor willing to be the user of last resort during the pilot. Anything less than three out of three and we hold the agent rollout until those conditions are met.

Step 2: curate the measures the agent is allowed to use

Agents are translators, not analysts. They translate a natural-language question into a query against measures and tables that already exist in the semantic model. If the measure is named clearly, defined canonically, and described in metadata, the agent can use it. If it is not, the agent invents — and invented answers in a leadership audience are a trust event, not a feature.

The measure curation checklist

Every exposed measure has a single canonical definition with a named owner
Every measure carries a human-readable description and synonyms list
Filter context expectations are documented (what is implied, what must be specified)
Measures that overlap or contradict are hidden from the agent's surface
Linguistic synonyms for common business terms are wired into the model's metadata
Each measure has at least one test question with a known-correct answer

Step 3: ground the agent and constrain its outputs

Grounding is the discipline of making the agent cite what it used to answer. Without grounding, the agent's responses are uncheckable, and the failure modes are silent. With grounding, every answer comes with a measure name, a filter context, and a query that an analyst can rerun to verify. We treat grounding as non-negotiable for any agent that touches a leadership audience.

We also constrain the agent's output format. Numbers are returned with their unit, their period, their currency, and their filter context. Comparisons are explicit about which baseline is being used. Uncertainty is surfaced when the agent's confidence is below a threshold. Each of these constraints feels pedantic in isolation; together they are the difference between an agent that earns trust and one that loses it.

Step 4: roll out with audit logging from day one

Every question the agent answers gets logged. Every answer the agent produces gets logged. Every time a user expresses dissatisfaction (a thumbs-down, a follow-up correction, an angry reply) gets logged. Without this, the team is shipping a product they cannot iterate on. With it, the wrong answers become the backlog — not a hypothetical exercise but a ranked list of definitional, modeling, and prompt-pattern issues to fix.

What the audit log captures

The user's question verbatim
The query or queries the agent ran to answer it
The measures, tables, and filters used
The final answer returned, with its citations
User feedback signals (thumbs, follow-ups, corrections, escalations)
The session identifier so multi-turn conversations can be reconstructed

Step 5: review the wrong answers weekly

The wrong-answer review is the single highest-leverage activity in the entire program. We schedule it weekly for the first 90 days. The data team plus an analyst from the pilot audience walks through every flagged answer from the prior week. Each wrong answer is classified: definitional gap, modeling gap, prompt-pattern gap, or genuine model failure. Each classification produces a fix that goes into the next sprint.

After 90 days the cadence usually drops to bi-weekly. The agent is provably more accurate at week 12 than at week 1, and the audit log proves it. Without this review, the agent's accuracy plateaus around launch quality and slowly degrades as the underlying data drifts. With it, accuracy compounds.

Where this goes wrong

Shipping the agent before the semantic layer is governed — the agent inherits every model defect
Exposing every measure in the model — confuses the agent and the user
Skipping grounding to ship faster — produces an agent that nobody can debug
Treating the wrong-answer review as optional — accuracy stalls
Rolling out to a broad audience before the pilot audience has stabilized the model
Picking a domain where leadership does not actually need conversational access — the agent gets opened twice and abandoned

What good looks like at day 90

A successful agent rollout produces three signals at the 90-day mark. The pilot audience opens the agent unprompted at least once a week. The wrong-answer rate is trending down quarter over quarter. And the executive sponsor is willing to use the agent in front of their peers without checking the number against a spreadsheet first. Two out of three is iterating. One out of three is a model-foundation problem.

“A data agent is not an interface decision. It is a model decision wearing an interface.”

The one-line takeaway

Build the agent on a governed semantic model, curate the measures it can use, ground every answer, log every interaction, and review the wrong answers weekly. The companies that follow that sequence ship agents that compound; the ones that skip steps ship demos that decay.

Back to all posts

Published May 13, 2026 · 12 min read