Agent Evaluation Monitors

Agent Evaluation Monitors measure output quality and correctness by alerting when agent responses contain hallucinations, fail accuracy checks, or don't meet quality standards. Use templated or custom LLM-as-judge evaluations alongside deterministic SQL checks to systematically validate agent outputs at scale. They function similarly to Metric Monitors but operate on spans within agent traces.

Creating Agent Evaluation Monitors

Agent evaluation monitors can be created by navigating to Add monitor → Agent evaluation.

Choose data

📘

If your agent does not appear for selection in the Choose data → Agent list, Monte Carlo has not yet detected telemetry flowing into your warehouse. Verify that your agent is properly instrumented and that you've configured the agent trace table from the Agent Observability settings page.

  1. Agent: Select an agent to monitor. The agent appears here once telemetry is flowing into your warehouse.
  2. Spans to monitor: Select which spans within the agent trace you want the monitor to evaluate.
    • A specific LLM completion span — evaluates a single LLM call at a time.
    • A workflow or task grouping multiple spans — evaluate a group of spans that represent a multistep workflow or task.
  3. Optional filters: Refine which spans are included by filtering on:
    • Model name
    • Latency or token usage
    • Metadata or other custom attributes
  4. Group data: Choose whether to bucket the data hourly or daily.
  5. Segment data: Select up to 5 fields to segment the data by, or compose one with a SQL expression. When segmenting, the monitor will track metrics grouped by the values in that field or SQL expression. If multiple fields are selected, the monitor will calculate metrics grouped by each distinct combination of values from those multiple fields.

Add evaluations

Evaluations measure agent behavior, output quality, and adherence to expected patterns. Agent Evaluation Monitors support two evaluation types: LLM-as-judge and SQL. You can combine both types within the same monitor, mixing AI-based scoring with deterministic rules.

LLM-as-judge evaluator

Use built-in templates, or create your own, to evaluate the quality of agent outputs and determine whether a response is relevant and aligned with the prompt and context. Each evaluation produces a numeric score (for example, 1–5), which can be used directly in alert conditions.

Built-in templates include:

  • Task completion – Assesses whether the agent successfully fulfilled the task or objective
  • Answer relevance – Evaluates how well the response addresses the question or intent
  • Helpfulness – Measures whether the response provides useful, actionable information
  • Clarity – Checks if the response is easy to understand and well-communicated
  • Prompt adherence – Determines how closely the response follows the given instructions
  • Language match – Verifies that the response is in the expected language

Advanced capabilities:

  • Customizable templates to fit your use case
  • Choice of evaluator model
  • Previewing evaluations on sample spans
  • Selecting which segment to evaluate (e.g., first/last prompt or response)

SQL evaluator

SQL evaluations run deterministic checks directly against your telemetry table. Use them when success can be defined with clear, objective conditions — such as returning valid JSON or including required fields. These checks are cost-efficient and scale well for high-volume workloads.

SQL evaluations are well suited for checks like:

  • Verifying output length or word count
  • Confirming required keys are present in a response payload
  • Flagging forbidden or missing keywords
  • Validating JSON structure
  • Checking for specific patterns or formats

Configure sampling

Sampling is applied before evaluations run, giving full control over the number of spans evaluated while still getting useful coverage, which is particularly important for LLM-based checks.

The evaluation sample size can be configured in one of two ways:

  • Set a hard cap – Limit the maximum number of spans evaluated per time bucket (e.g., up to 200 spans per run)
  • Set a percentage with a cap – Evaluate a percentage of all spans with an upper limit (e.g., 30% of spans, capped at 1,000 per run)

Define alert conditions

Alert conditions determine when Monte Carlo should generate a new Alert based on the evaluations you've configured.

To define a condition, choose the metric and the evaluation output field to monitor (an evaluation score from LLM-as-judge or SQL). Then specify the operator and threshold that should trigger an Alert. You can use manual thresholds or automated machine learning-based thresholds.

Anomaly detection

Agent Evaluation Monitors support automated thresholds powered by machine learning.

When creating an alert condition, select the is anomalous operator to have Monte Carlo detect anomalous deviations in evaluation metrics, such as:

  • Unexpected variability in evaluation scores that suggests regressions
  • Sudden drops in quality metrics
  • Emerging drift in agent behavior over time

Define schedule

Select when the monitor should run. Agent evaluation monitors run according to the schedule defined below.

  • On a schedule: Input a regular, periodic schedule. Options for handling daylight savings are available in the advanced dropdown.
  • Manual trigger: the monitor is run manually from the monitor details page using the Run button, or programmatically via the runMonitor API call.

Send notifications

Alerts can be routed to all existing notification channels already configured in Monte Carlo, so they fit naturally into your existing incident response workflows.

Select which audiences should receive notifications when an evaluation-based alert is triggered.

Define Notes

Text in the Notes section will be included directly in Alert notifications. The "Show notes tips" dropdown includes details on how to @mention an individual or team if you are sending notifications to Slack.

Notes support rich-text formatting, including bold, italic, underline, strike-through, lists, links, and code blocks. Rich-text channels display these styles, while text-only channels show a plain-text equivalent.

Monitor properties can be dynamically inserted into Notes through variables. Supported variables include Created by, Last updated at, Last updated by, Priority, and Tags.

Additional settings exist for customizing the description of the monitor, pre-setting a priority on any Alerts generated by the monitor, or for turning off failure notifications.