Agent Evaluation Monitor

Score agent outputs with LLM-as-judge evaluations using Monitors as Code.

Overview

Agent evaluation monitors score an agent's outputs — answer relevance, prompt adherence, hallucination checks — using LLM-as-judge transforms, then alert on the resulting scores. They are an Agent Metric Monitor plus two evaluation-only fields: transforms (the eval prompts) and sampling_config (required — how many spans to score).

📘
Reference scope
This page covers MaC YAML configuration. For how agent monitors work, see Agent Monitors Overview. Alert conditions and the available metrics follow the Metric Monitor reference.

MaC key: agent_evaluation.

Quick Start

montecarlo:
  agent_evaluation:
    - name: cortex_answer_relevance
      description: Answer-relevance eval on the Cortex agent
      agent: "ANALYTICS:AGENTS.support_cortex_agent"
      transforms:
        - function: custom_prompt
          alias: prompt_adherence
          output_type: number
          prompt: "Adheres to the system prompt? {{prompts}} -> {{completions}}"
      sampling_config:
        count: 1000
      alert_conditions:
        - metric: NULL_RATE
          fields: [prompt_adherence]
          operator: AUTO
      schedule:
        type: fixed
        interval_minutes: 60
        start_time: "2025-01-01T00:00:00+00:00"
      aggregate_by: HOUR

Configuration

Evaluation monitors accept every Agent Metric Monitor field — agent, trace_table, alert_conditions, agent_span_filters, aggregate_by, is_agent_trace_aggregation / is_agent_conversation_aggregation, sensitivity, schedule, and the common fields — plus the two below.

transforms — evaluation prompts

array of objects · optional

LLM-as-judge evaluation prompts, declared at the top level alongside agent. Each transform produces a named score (its alias) that you can then alert on via alert_conditions.fields.

transforms:
  - function: custom_prompt
    alias: prompt_adherence
    output_type: number
    prompt: "Adheres to the system prompt? {{prompts}} -> {{completions}}"

sampling_config — how many spans to evaluate

object · required

How many spans an evaluation samples per run, controlling LLM-judge cost. Required on evaluation monitors.

sampling_config:
  count: 1000

For all other fields, see the Agent Metric Monitor reference.

Examples

Platform agent evaluation with transforms

An evaluation on a platform agent: a custom_prompt transform produces a prompt_adherence score, which the alert condition then monitors. sampling_config caps the spans scored per run.

montecarlo:
  agent_evaluation:
    - name: cortex_prompt_adherence
      description: Prompt-adherence eval on the Cortex agent
      agent: "ANALYTICS:AGENTS.support_cortex_agent"
      transforms:
        - function: custom_prompt
          alias: prompt_adherence
          output_type: number
          prompt: "Adheres to the system prompt? {{prompts}} -> {{completions}}"
      sampling_config:
        count: 1000
      alert_conditions:
        - metric: NULL_RATE
          fields: [prompt_adherence]
          operator: AUTO
      schedule:
        type: fixed
        interval_minutes: 60
        start_time: "2025-01-01T00:00:00+00:00"
      aggregate_by: HOUR

OpenTelemetry agent evaluation

The same eval against an OpenTelemetry agent, addressed by its service_name (default trace store, no trace_table).

montecarlo:
  agent_evaluation:
    - name: otel_answer_relevance
      description: Answer-relevance eval on an OTel agent
      agent: my-otel-agent # the agent's service_name
      transforms:
        - function: custom_prompt
          alias: answer_relevance
          output_type: number
          prompt: "Is the answer relevant to the question? {{prompts}} -> {{completions}}"
      sampling_config:
        count: 500
      alert_conditions:
        - metric: NULL_RATE
          fields: [answer_relevance]
          operator: AUTO
      schedule:
        type: fixed
        interval_minutes: 60
        start_time: "2025-01-01T00:00:00+00:00"
      aggregate_by: HOUR

Troubleshooting

Evaluation monitors share the agent-source and trace-table behavior of the Agent Metric Monitor. Evaluation-specific points:

Alerting on a transform score. Reference a transform's alias in alert_conditions.fields to alert on its score.
Setting both aggregation flags. is_agent_trace_aggregation and is_agent_conversation_aggregation are mutually exclusive — set at most one.
Forgetting PUT semantics on updates. Updating a monitor replaces its full configuration — omitted fields revert to defaults, including transforms, sampling_config, and the aggregation flags. Always specify the complete desired configuration.

Updated 17 days ago

Did this page help you?