ai-assisted

How AI agent observability should influence Fabric data product design

Frank Garofalo

12 May 2026 — 9 min read

Observability is now part of the data product. If your Fabric estate cannot explain why an agent answered or acted, the product is not finished.

Most Fabric data products are still designed like BI assets: curated, governed, and optimized for dashboards. But once AI agents start querying, chaining, and triggering downstream actions, observability stops being an operations add-on and becomes part of the product contract.

Microsoft’s direction is clear. Fabric is being positioned as a unified analytics platform where shared business semantics can serve analytics, AI agents, and applications through capabilities like Fabric IQ, ontology, and data agents. That is not “BI plus chat.” It is an agent-facing data plane with a much larger accountability surface.

If a data product cannot explain agent behavior, it is under-designed

The old definition of a “good” data product focused on freshness, quality, discoverability, performance, and governance. Those still matter. But with agents, the trust question changes from “Is this dataset certified?” to “Can we reconstruct why this answer happened?”

That is why I take a hard position here: observability is part of product design, not a downstream platform concern.

Leaders should care for three reasons:

Governance exposure: if an agent pulls from a semantic model, joins context from connected systems, and produces a recommendation, you need evidence of what it touched and why.
Hidden cost growth: agent traffic creates new query patterns, retrieval loops, and repeated semantic resolution that get expensive fast without visibility.
Operational trust: when a business user says “that answer is wrong,” lineage diagrams are not enough.

One field example made this painfully clear. In a finance copilot pilot I reviewed with a 14-person data team, two users received different margin answers from the same Fabric semantic model. The team had no trace IDs, no entity-level capture, and no replay path, so they could not determine whether the issue came from prompt context, semantic ambiguity, or retrieval behavior. I do not present that as proof of a universal pattern. I present it as a very common failure mode when agent observability is missing.

A second failure mode is cost. Teams often discover only after rollout that an agent is repeatedly hitting the same semantic objects or re-running expensive retrieval paths because nothing in the product emits enough telemetry to show the loop.

Why now: Fabric is becoming agent-facing

This is not speculative. Microsoft Learn materials describe Copilot and data agents as part of the Fabric workflow itself, not as a separate sidecar. Fabric data agents can answer over lakehouses, warehouses, Power BI semantic models, KQL databases, ontologies, and Microsoft Graph in Fabric. Azure guidance now treats agent orchestration and retrieval patterns as architecture concerns, not experiments.

That changes the upstream design brief.

If Fabric products are going to support agents, they must emit enough telemetry and semantic context to make those interactions traceable across system boundaries. The question is no longer whether agent behavior is opaque. The question is whether your product emits the identifiers, semantic anchors, and policy checkpoints needed to make that behavior explainable.

A simple way to frame the shift: the answer is no longer the whole product. The explainable path to the answer is part of the product too.

Notice the design point: observability is not a passive sink. It sits alongside the agent-data interaction and captures metadata, answer context, and replay signals.

From BI-era products to agent-era products

Traditional Fabric success criteria are still valid:

data freshness
data quality
certified metrics
dashboard performance
role-based access
lineage from source to report

But agent-era products need additional criteria:

traceability of prompt-to-data paths
replayability of interactions
semantic accountability across systems
evaluation signals tied to business outcomes
policy visibility for what was allowed, filtered, or blocked

A semantic model designed for a human analyst is not automatically sufficient for an agent. Humans can infer ambiguity and ask careful follow-ups. Agents follow the retrieval and orchestration path they are given. If that path spans a lakehouse, warehouse, semantic model, KQL database, ontology, and connected context, the product boundary now includes what happens downstream in that chain.

That is the bridge to the real design question: what should a Fabric product emit so those paths can actually be investigated?

What first-class observability looks like in Fabric product design

Here is the minimum observability contract I would require for any Fabric data product that may be consumed by agents:

Request telemetry

A unique trace ID, timestamps, latency, caller or agent identity, and product identifier.

Data lineage

The source assets used: lakehouse, warehouse, semantic model, KQL database, ontology, or connected context.

Semantic context

The business entities, measures, and ontology terms involved in the answer path.

Policy decisions

Which access checks, masking rules, or governance controls influenced retrieval.

Retrieval traces

Which datasets, entities, or semantic objects were actually referenced.

Outcome and evaluation signals

User feedback, quality scores, escalation flags, or downstream action success/failure.

If you do not define that contract up front, governance, security, and AI engineering teams will be forced to reverse-engineer behavior later.

In a Fabric-centric architecture, these events should land somewhere durable and queryable: for example a lakehouse table for audit history, an event stream for near-real-time routing, or a KQL store for investigation and replay workflows.

# Python: Define a lightweight trace envelope for Fabric-backed agent answers
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
from typing import List

@dataclass
class AgentTrace:
    trace_id: str
    prompt: str
    answer: str
    fabric_dataset_id: str
    semantic_entities: List[str]
    latency_ms: int
    quality_score: float
    timestamp_utc: str

trace = AgentTrace(
    trace_id="trc-1001",
    prompt="Summarize Q4 margin drivers for retail.",
    answer="Margin improved due to lower logistics cost and higher premium mix.",
    fabric_dataset_id="fabric-ds-sales-001",
    semantic_entities=["Retail.Margin", "Retail.LogisticsCost", "Retail.ProductMix"],
    latency_ms=842,
    quality_score=0.91,
    timestamp_utc=datetime.now(timezone.utc).isoformat(),
)

print(asdict(trace))

What matters here is not the exact schema. It is that the trace includes a dataset identifier, semantic entities, latency, and a quality signal so the answer can be tied back to both the Fabric asset and the business meaning behind it.

# Python: Correlate an agent response with Fabric metadata and compute a simple quality score
import time
import uuid

def quality_score(answer: str, entities: list[str], grounded: bool) -> float:
    score = 0.4 if grounded else 0.1
    score += min(len(entities) * 0.15, 0.45)
    score += 0.15 if len(answer.split()) > 8 else 0.05
    return round(min(score, 1.0), 2)

start = time.perf_counter()
dataset_id = "fabric-ds-finance-007"
entities = ["Finance.Revenue", "Finance.GrossMargin"]
answer = "Revenue increased 8% and gross margin expanded due to favorable product mix."
latency_ms = int((time.perf_counter() - start) * 1000) + 120

trace = {
    "trace_id": str(uuid.uuid4()),
    "dataset_id": dataset_id,
    "semantic_entities": entities,
    "latency_ms": latency_ms,
    "quality_score": quality_score(answer, entities, grounded=True),
}
print(trace)

The next step is persistence. If traces are not stored in a durable, queryable format, there is no serious replay or governance workflow.

# Python: Persist observability events for replay and governance review as JSON lines
import json
from datetime import datetime, timezone

event = {
    "trace_id": "trc-2002",
    "agent_id": "sales-copilot",
    "dataset_id": "fabric-ds-sales-001",
    "semantic_entities": ["Sales.Region", "Sales.NetRevenue"],
    "latency_ms": 633,
    "quality_score": 0.88,
    "review_flag": "retain_for_governance",
    "timestamp_utc": datetime.now(timezone.utc).isoformat(),
}

path = "agent_trace_store.jsonl"
with open(path, "a", encoding="utf-8") as f:
    f.write(json.dumps(event) + "\n")

with open(path, "r", encoding="utf-8") as f:
    print(f.readlines()[-1].strip())

After this, the design task becomes concrete: define where these events live, who can access them, and how long they are retained.

Lineage is no longer enough

Classic lineage answers: “Where did this data come from?”

Agent traceability answers: “How was this answer assembled?”

Replayability answers: “Can we reproduce the path closely enough to investigate and improve it?”

Those are different questions.

Lineage is necessary because it establishes provenance. But lineage without interaction traces leaves teams unable to investigate hallucinations, policy breaches, or expensive query behavior. If a user gets a bad answer, you need more than the source dataset. You need the semantic entities referenced, the retrieval steps taken, and the versioned state of the product at the time.

That has direct design implications for Fabric products:

version semantic models intentionally
snapshot or timestamp critical source states
retain ontology changes with effective dates
preserve policy decision events
store enough interaction context to support replay without oversharing sensitive content

A simple replay workflow can expose where product design is weak.

# Python: Replay low-quality traces to identify data product design gaps
traces = [
    {"trace_id": "trc-1", "dataset_id": "fabric-ds-sales-001", "quality_score": 0.92},
    {"trace_id": "trc-2", "dataset_id": "fabric-ds-finance-007", "quality_score": 0.54},
    {"trace_id": "trc-3", "dataset_id": "fabric-ds-ops-003", "quality_score": 0.61},
]

for t in traces:
    if t["quality_score"] < 0.7:
        recommendation = {
            "trace_id": t["trace_id"],
            "dataset_id": t["dataset_id"],
            "action": "improve semantic model coverage or add entity-level lineage",
        }
        print(recommendation)

That is the operational loop that matters: not “the agent failed,” but “the product lacked the semantic or lineage structure needed for reliable agent use.”

And that leads directly to the next layer of accountability: semantics.

The semantic layer is now an accountability layer

This is where Fabric IQ and ontology matter most. Fabric IQ is designed to organize data across OneLake using business language for analytics, AI agents, and applications. Fabric ontology extends that with a shared governed business model across teams and systems.

Those are not just discoverability features. They are accountability infrastructure.

If business semantics are fragmented, no amount of downstream tracing will fully restore trust.

That is why semantic model sprawl becomes much more dangerous in the agent era. Duplicated metrics, overlapping ontologies, and inconsistent entity naming do more than confuse analysts. They make traces harder to interpret and ownership harder to assign. Two teams can both expose “gross margin,” but if the definitions diverge, a trace that references “GrossMargin” is not enough. You need governed semantic ownership and versioning behind that term.

Treat semantic governance as part of observability architecture:

assign owners to core business entities and measures
standardize entity IDs across Fabric assets where possible
record semantic model and ontology versions in traces
log policy checkpoints when access or masking changes the answer path
review semantic drift as an observability issue, not just a modeling issue

If the semantic layer is the shared language agents use, it is also the layer through which leaders will ask, “Who owns this answer?”

The real trade-offs leaders actually have to own

There are real tensions here.

Retention

Longer trace retention improves auditability, replay, and evaluation. It also increases storage cost and expands the blast radius of poor data handling.

Privacy

Richer traces help debugging, but they can capture sensitive prompts, row-level context, or user intent. Minimization, masking, and access control have to be designed in from day one.

Cost

Under-observability creates expensive blind spots. Over-instrumentation creates telemetry sprawl.

Performance and complexity

Every instrumentation choice adds schema management, integration work, and operational burden.

So the recommendation is not maximal logging. It is decision-grade observability.

Capture enough information to support governance, debugging, replay, and executive accountability for the class of decisions the agent can influence. An exploratory internal copilot does not need the same standard as a workflow agent that can trigger actions.

# Python: Route observability tier decisions into data product design choices
def observability_tier(agent_criticality: str, contains_sensitive_data: bool) -> str:
    if agent_criticality == "high" or contains_sensitive_data:
        return "tier-3"
    if agent_criticality == "medium":
        return "tier-2"
    return "tier-1"

def design_controls(tier: str) -> dict:
    return {
        "tier-1": {"retention_days": 7, "capture_prompt": False, "capture_entities": True},
        "tier-2": {"retention_days": 30, "capture_prompt": True, "capture_entities": True},
        "tier-3": {"retention_days": 90, "capture_prompt": True, "capture_entities": True},
    }[tier]

tier = observability_tier(agent_criticality="high", contains_sensitive_data=False)
print({"tier": tier, "controls": design_controls(tier)})

A practical design stance for Fabric leaders

If I were setting the standard for agent-facing Fabric data products today, I would require this checklist before calling one ready:

Semantic ownership defined

Who owns the business entities, measures, and ontology terms?

Trace schema defined

Which IDs, entities, timings, and quality signals are emitted?

Lineage depth agreed

How far across Fabric and connected systems will provenance be captured?

Evaluation metrics selected

What counts as a good answer, a risky answer, or a failed answer?

Retention policy approved

How long are traces kept, and under what review model?

Privacy controls implemented

Which prompts or contexts are masked, sampled, or excluded?

Replay strategy documented

Can the team reproduce enough of the interaction path to investigate incidents?

Cross-functional ownership in place

Data platform, governance, security, and AI engineering must all sign off.

The strategic point is simple: agent-ready data products must be designed to explain themselves in operation.

Fabric’s direction is toward unified data, shared business semantics, and AI-assisted consumption. Once Fabric products sit inside larger agent architectures, observability has to span boundaries across Fabric and adjacent systems to preserve governance, cost control, and trust.

So here is the better question for practitioners:

What observability field is your team missing today: trace ID, semantic entity, policy decision, version stamp, or replay context? And if a bad answer happened this afternoon, could you actually replay it?

#MicrosoftFabric #AIAgents #DataArchitecture

Sources & References

Try it yourself

Run this tutorial as a Jupyter notebook: Download runbook.ipynb (25 cells, 18 KB).