Agent Evaluation Monitor
Overview
Agent evaluation monitors score an agent's outputs β answer relevance, prompt adherence, hallucination checks β using LLM-as-judge transforms, then alert on the resulting scores. They are an Agent Metric Monitor plus two evaluation-only fields: transforms (the eval prompts) and sampling_config (required β how many spans to score).
Reference scopeThis page covers MaC YAML configuration. For how agent monitors work, see Agent Monitors Overview. Alert conditions and the available metrics follow the Metric Monitor reference.
MaC key: agent_evaluation.
Quick Start
montecarlo:
agent_evaluation:
- name: cortex_answer_relevance
description: Answer-relevance eval on the Cortex agent
agent: "ANALYTICS:AGENTS.support_cortex_agent"
transforms:
- function: custom_prompt
alias: prompt_adherence
output_type: number
prompt: "Adheres to the system prompt? {{prompts}} -> {{completions}}"
sampling_config:
count: 1000
alert_conditions:
- metric: NULL_RATE
fields: [prompt_adherence]
operator: AUTO
schedule:
type: fixed
interval_minutes: 60
start_time: "2025-01-01T00:00:00+00:00"
aggregate_by: HOURConfiguration
Evaluation monitors accept every Agent Metric Monitor field β agent, trace_table, alert_conditions, agent_span_filters, aggregate_by, is_agent_trace_aggregation / is_agent_conversation_aggregation, sensitivity, schedule, and the common fields β plus the two below.
array of objects Β· optional
LLM-as-judge evaluation prompts, declared at the top level alongside agent. Each transform produces a named score (its alias) that you can then alert on via alert_conditions.fields.
transforms:
- function: custom_prompt
alias: prompt_adherence
output_type: number
prompt: "Adheres to the system prompt? {{prompts}} -> {{completions}}"object Β· required
How many spans an evaluation samples per run, controlling LLM-judge cost. Required on evaluation monitors.
sampling_config:
count: 1000For all other fields, see the Agent Metric Monitor reference.
Examples
Platform agent evaluation with transforms
An evaluation on a platform agent: a custom_prompt transform produces a prompt_adherence score, which the alert condition then monitors. sampling_config caps the spans scored per run.
montecarlo:
agent_evaluation:
- name: cortex_prompt_adherence
description: Prompt-adherence eval on the Cortex agent
agent: "ANALYTICS:AGENTS.support_cortex_agent"
transforms:
- function: custom_prompt
alias: prompt_adherence
output_type: number
prompt: "Adheres to the system prompt? {{prompts}} -> {{completions}}"
sampling_config:
count: 1000
alert_conditions:
- metric: NULL_RATE
fields: [prompt_adherence]
operator: AUTO
schedule:
type: fixed
interval_minutes: 60
start_time: "2025-01-01T00:00:00+00:00"
aggregate_by: HOUROpenTelemetry agent evaluation
The same eval against an OpenTelemetry agent, addressed by its service_name (default trace store, no trace_table).
montecarlo:
agent_evaluation:
- name: otel_answer_relevance
description: Answer-relevance eval on an OTel agent
agent: my-otel-agent # the agent's service_name
transforms:
- function: custom_prompt
alias: answer_relevance
output_type: number
prompt: "Is the answer relevant to the question? {{prompts}} -> {{completions}}"
sampling_config:
count: 500
alert_conditions:
- metric: NULL_RATE
fields: [answer_relevance]
operator: AUTO
schedule:
type: fixed
interval_minutes: 60
start_time: "2025-01-01T00:00:00+00:00"
aggregate_by: HOURTroubleshooting
Evaluation monitors share the agent-source and trace-table behavior of the Agent Metric Monitor. Evaluation-specific points:
- Alerting on a transform score. Reference a transform's
aliasinalert_conditions.fieldsto alert on its score. - Setting both aggregation flags.
is_agent_trace_aggregationandis_agent_conversation_aggregationare mutually exclusive β set at most one. - Forgetting PUT semantics on updates. Updating a monitor replaces its full configuration β omitted fields revert to defaults, including
transforms,sampling_config, and the aggregation flags. Always specify the complete desired configuration.
