Agent Monitors

Overview

Agent Observability gives you full visibility into your AI agents, from the data that powers them to the prompts they receive and the outputs they generate. Instead of treating agents as black box systems, Monte Carlo provides a unified, end-to-end view across your data pipelines, models, tools, and agent workflows, so you can see exactly what happened, why it happened, and how to fix it.

Every agent run becomes a trace containing prompts, context, completions, token usage, latency, model metadata, errors, and workflow attributes. This level of detail enables teams to systematically evaluate output quality, detect silent failures, identify regressions, monitor cost and performance, and trace issues back to the upstream data or logic that shaped agent behavior.

With a warehouse-native architecture, all telemetry — including prompts and outputs — remains in your environment. Monte Carlo reads directly from your warehouse to provide the governance, security, and auditability enterprises expect, while giving you full visibility across diverse models, architectures, and workflows.

What Agent Observability unlocks

Agent Observability brings together data and AI in one connected view so you can deliver reliable, production-grade AI and detect issues fast.

  • Trusted, production-grade AI with measurable quality
  • Detection of subtle regressions, incomplete context, and behavioral drift before users are affected
  • Quality evaluation at scale using customizable LLM-as-judge templates or deterministic checks
  • Unified root-cause analysis across both data and AI layers
  • Faster debugging with trace-level visibility
  • Support for any model and any agent framework
  • Monitor agents alongside the pipelines and data that feed them

Supported Warehouses

Generally available:

  • Snowflake
  • Databricks

Coming soon:

  • BigQuery
  • Redshift

Agent Telemetry Ingestion pattern

  1. Agents emit OpenTelemetry (OTLP) traces via the Monte Carlo SDK.
  2. An OTLP collector receives and processes the traces.
  3. The collector writes telemetry to object storage and/or directly into your warehouse.
  4. Monte Carlo reads the warehouse telemetry table for monitoring, evaluation, and alerting.

Using the same warehouse that stores your operational data makes it easy to correlate agent behavior with lineage, data health, and pipeline integrity.

Instrumenting your agent

To begin collecting agent telemetry, configure the following components:

  1. Install, set up tracing, and enhance tracing data with identifying attributes with the Monte Carlo OpenTelemetry SDK.
  2. Deploy an OpenTelemetry (OTLP) Collector, which receives and processes the spans emitted by your agent.
  3. Configure your warehouse to ingest AI agent traces:

Configuring Agent Monitors

Agent monitors allow you to measure output quality, reliability, and operational performance across agent runs. They function similarly to Metric Monitors but operate on spans within agent traces.

Agent monitors can be created by navigating to Add monitor → Agent. Configuration steps include:

Choose data

📘

If your agent does not appear for selection in the Choose data → Agent* list, Monte Carlo has not yet detected telemetry flowing into your warehouse. Verify that your agent is properly instrumented.

  1. Agent: select an agent to monitor. The agent appears here once telemetry is flowing into your warehouse.
  2. Spans to monitor: select which spans within the agent trace you want the monitor to evaluate.
    • A specific LLM completion span — evaluates a single LLM call at a time.
    • A workflow or task grouping multiple spans — evaluate a group of spans that represent a multistep workflow or task.
  3. Optional filters: refine which spans are included by filtering on:
    • Model name
    • Latency or token usage
    • Metadata or other custom attributes
  4. Group data: choose whether to bucket the data hourly or daily. Rows will be grouped according to their ingestion time.
  5. Segment data: select up to 5 fields to segment the data by, or compose one with a SQL expression. When segmenting, the monitor will track metrics grouped by the values in that field or SQL expression.
    If multiple fields are selected, the monitor will calculate metrics grouped by each distinct combination of values from those multiple fields.

Add evaluations

Evaluations measure agent behavior, output quality, and adherence to expected patterns so you can detect issues early. Two evaluation types are available, LLM-as-judge and SQL, and you can combine them within the same monitor, mixing AI-based scoring with deterministic rules.

LLM-as-judge evaluator

Use built-in templates, or create your own, to evaluate the quality of agent outputs and determine whether a response is relevant and aligned with the prompt and context. Each evaluation produces a numeric score (for example, 1–5), which can be used directly in alert conditions.

Built-in templates include:

  • Task completion – assesses whether the agent successfully fulfilled the task or objective
  • Answer relevance – evaluates how well the response addresses the question or intent
  • Helpfulness – measures whether the response provides useful, actionable information
  • Clarity – checks if the response is easy to understand and well-communicated
  • Prompt adherence – determines how closely the response follows the given instructions
  • Language match – verifies that the response is in the expected language

Advanced capabilities:

  • Customizable templates to fit your use case
  • Choice of evaluator model
  • Previewing evaluations on sample spans
  • Selecting which segment to evaluate (e.g., first/last prompt or response)

SQL evaluator

SQL evaluations run deterministic checks directly against your telemetry table. Use them when success can be defined with clear, objective conditions — such as returning valid JSON or including required fields. These checks are cost-efficient and scale well for high-volume workloads.

They are well suited for checks like:

  • Verifying output length or word count
  • Confirming required keys are present in a response payload
  • Flagging forbidden or missing keywords

Configure sampling

Sampling is applied before evaluations run, giving full control over the number of spans evaluated while still getting useful coverage, which is particularly important for LLM-based checks.

You can apply percentage sampling, set hard caps (for example, up to 200 spans per hour), or use a combination of percentage and caps for predictable costs.

Define alert conditions

Alert conditions determine when Monte Carlo should generate a new Alert based on the evaluations and span attributes you’ve selected.

To define a condition, choose the metric or field you want to monitor (for example, an evaluation score, latency, or token usage) and specify the operator and threshold that should trigger an Alert. Conditions can be based on outputs (LLM-as-judge or SQL evaluations) or operational attributes captured during the agent run.

Once you select a field, you can either target a specific attribute or choose All supported fields. If All supported fields is selected, only automated thresholds are available.

Operational alerts

Monitor the health and stability of your agent so you can spot reliability issues early — even when outputs appear correct on the surface.

Operational alerts highlight runtime conditions that can impact agent reliability — for example, latency increases, elevated error rates, or unusual cost patterns. Using the span attributes recorded in your telemetry, these alerts make it easy to spot issues such as recurring SLA breaches, clusters of failures, or abnormal token usage.

These signals are useful for identifying problems that impact agent reliability, including:

  • Failures in external tools or upstream APIs
  • Execution slowdowns within specific workflows
  • Resource usage patterns that suggest instability

Operational alerts complement quality-based conditions by ensuring your agents are running in a healthy, predictable environment.

Anomaly detection

Agent monitors support automated thresholds, which are powered by machine learning.

When creating an alert condition, select the is anomalous operator to have Monte Carlo detect anomalous deviations in metrics, such as:

  • Variability in evaluation scores that suggests regressions
  • Unexpected increases in word count
  • Shifts in token usage patterns
  • Emerging drift in agent behavior over time

Send notifications

Alerts can be routed to all existing notification channels already configured in Monte Carlo, so they fit naturally into your existing incident response workflows.

Select which audiences should receive notifications when an anomaly is detected.

Text in the Notes section will be included directly in notifications. The "Show notes tips" dropdown includes details on how to @mention an individual or team if you are sending notifications to Slack.

Additional settings exist for customizing the description of the monitor, pre-setting a priority on any Alerts generated by the monitor, or for turning off failure notifications.