Agent Evaluation Monitors

Agent Evaluation Monitors measure output quality and correctness by alerting when agent responses contain hallucinations, fail accuracy checks, or don't meet quality standards. Use templated or custom LLM-as-judge evaluations alongside deterministic SQL checks to systematically validate agent outputs at scale. They function similarly to Metric Monitors but operate on individual spans or entire conversations within agent traces.

Creating Agent Evaluation Monitors

Agent evaluation monitors can be created by navigating to Add monitor β†’ Agent evaluation.

Choose data

πŸ“˜

If your agent does not appear for selection in the Choose data β†’ Agent list, Monte Carlo has not yet detected telemetry flowing into your warehouse. Verify that your agent is properly instrumented and that you've configured the agent trace table from the Agent Observability settings page.

  1. Agent: Select an agent to monitor. The agent appears here once telemetry is flowing into your warehouse.
  2. Monitor: Choose the scope the monitor evaluates:
    • Specific spans β€” evaluate individual spans within an agent trace.
      • A specific LLM completion span β€” evaluates a single LLM call at a time.
      • A workflow or task grouping multiple spans β€” evaluates a group of spans that represent a multistep workflow or task.
    • Entire conversations β€” evaluate complete multi-turn conversations end to end. Each conversation is scored as a whole using the full transcript across all of its turns, rather than scoring spans individually. Use this to assess outcomes that only emerge over a full exchange, such as whether the agent ultimately completed the user's task. This option is available for agents whose trace data supports conversation-level evaluation, and is disabled for agents that don't.
  3. Optional filters: Refine which spans or conversations are included.
    • For specific spans, filter on:
      • Model name
      • Latency or token usage
      • Metadata or other custom attributes
    • For entire conversations, filter on conversation properties such as workflow, number of turns, status, and duration, along with metadata and other custom attributes. When filtering on a per-turn attribute, a conversation is included only when every turn matches the filter.
  4. Group data: Choose whether to bucket the data hourly or daily.
    • For entire conversations, data is grouped by each conversation's last turn time. You can also set how long to wait before scoring with ignore the last N hours β€” conversations with more recent activity are treated as still in progress and aren't scored until they've been idle for that period.
  5. Segment data: Select up to 5 fields to segment the data by, or compose one with a SQL expression. When segmenting, the monitor will track metrics grouped by the values in that field or SQL expression. If multiple fields are selected, the monitor will calculate metrics grouped by each distinct combination of values from those multiple fields.

Add evaluations

Evaluations measure agent behavior, output quality, and adherence to expected patterns. Agent Evaluation Monitors support two evaluation types: LLM-as-judge and SQL. You can combine both types within the same monitor, mixing AI-based scoring with deterministic rules.

LLM-as-judge evaluator

Use built-in templates, or create your own, to evaluate the quality of agent outputs and determine whether a response is relevant and aligned with the prompt and context. Each evaluation produces a numeric score (for example, 1–5), which can be used directly in alert conditions.

Built-in templates include:

  • Task completion – Assesses whether the agent successfully fulfilled the task or objective
  • Answer relevance – Evaluates how well the response addresses the question or intent
  • Helpfulness – Measures whether the response provides useful, actionable information
  • Clarity – Checks if the response is easy to understand and well-communicated
  • Prompt adherence – Determines how closely the response follows the given instructions
  • Language match – Verifies that the response is in the expected language

Advanced capabilities:

  • Customizable templates to fit your use case
  • Choice of evaluator model
  • Previewing evaluations on sample spans
  • Selecting which segment to evaluate (e.g., first/last prompt or response)

When monitoring entire conversations, evaluations score the full conversation transcript rather than a single span. The transcript is rendered with # Turn N markers and USER: / ASSISTANT: / TOOL_CALL prefixes so the evaluator can assess the agent's trajectory across the whole exchange. Conversation-level prompts reference the transcript through the {{conversation}} variable. The same built-in templates listed above are available, scored across the entire conversation.

SQL evaluator

SQL evaluations run deterministic checks directly against your telemetry table. Use them when success can be defined with clear, objective conditions β€” such as returning valid JSON or including required fields. These checks are cost-efficient and scale well for high-volume workloads.

SQL evaluations are well suited for checks like:

  • Verifying output length or word count
  • Confirming required keys are present in a response payload
  • Flagging forbidden or missing keywords
  • Validating JSON structure
  • Checking for specific patterns or formats

Configure sampling

Sampling is applied before evaluations run, giving full control over the number of spans or conversations evaluated while still getting useful coverage, which is particularly important for LLM-based checks.

The evaluation sample size can be configured in one of two ways:

  • Set a hard cap – Limit the maximum number of spans evaluated per time bucket (e.g., up to 200 spans per run)
  • Set a percentage with a cap – Evaluate a percentage of all spans with an upper limit (e.g., 30% of spans, capped at 1,000 per run)

When monitoring entire conversations, sampling limits the number of conversations evaluated per run instead of spans. Set a count of conversations or a percentage of conversations per run.

Define alert conditions

Alert conditions determine when Monte Carlo should generate a new Alert based on the evaluations you've configured.

To define a condition, choose the metric and the evaluation output field to monitor (an evaluation score from LLM-as-judge or SQL). Then specify the operator and threshold that should trigger an Alert. You can use manual thresholds or automated machine learning-based thresholds.

Anomaly detection

Agent Evaluation Monitors support automated thresholds powered by machine learning.

When creating an alert condition, select the is anomalous operator to have Monte Carlo detect anomalous deviations in evaluation metrics, such as:

  • Unexpected variability in evaluation scores that suggests regressions
  • Sudden drops in quality metrics
  • Emerging drift in agent behavior over time

Define schedule

Select when the monitor should run. Agent evaluation monitors run according to the schedule defined below.

  • On a schedule: Input a regular, periodic schedule. Options for handling daylight savings are available in the advanced dropdown.
  • Manual trigger: the monitor is run manually from the monitor details page using the Run button, or programmatically via the runMonitor API call.

Send notifications

Alerts can be routed to all existing notification channels already configured in Monte Carlo, so they fit naturally into your existing incident response workflows.

Select which audiences should receive notifications when an evaluation-based alert is triggered.

Define Notes

Text in the Notes section will be included directly in Alert notifications. The "Show notes tips" dropdown includes details on how to @mention an individual or team if you are sending notifications to Slack.

Notes support rich-text formatting, including bold, italic, underline, strike-through, lists, links, and code blocks. Rich-text channels display these styles, while text-only channels show a plain-text equivalent.

Monitor properties can be dynamically inserted into Notes through variables. Supported variables include Created by, Last updated at, Last updated by, Priority, and Tags.

Additional settings exist for customizing the description of the monitor, pre-setting a priority on any Alerts generated by the monitor, or for turning off failure notifications.