Agent Observability Data Platform

Self-hosted data platform for Agent Observability β€” OpenTelemetry Collector and ClickHouse on EKS, deployed with Terraform

Overview

The Agent Observability data platform is a self-hosted pipeline that ingests OpenTelemetry trace data from your AI agents and stores it in ClickHouse for Monte Carlo to query. It runs entirely in your own AWS account and is deployed with a single Terraform module that provisions everything from the OpenTelemetry Collector through to ClickHouse on an EKS cluster.

Once deployed, Monte Carlo connects to the platform to power Trace Exploration and agent monitors β€” all queried through the Monte Carlo Agent.

πŸ“˜

This is a new deployment option that stores trace data directly in ClickHouse. It is distinct from the warehouse-based ingestion path described in Agent with OpenTelemetry Collector, which routes traces into a data warehouse (Snowflake, Databricks, BigQuery, or Athena).

☁️

AWS only. This self-hosted platform runs on Amazon EKS and is available for AWS accounts only. If your agents run on Azure or GCP, use the warehouse-based Agent with OpenTelemetry Collector path (Snowflake, Databricks, or BigQuery), or contact your Monte Carlo representative to discuss options.

πŸ“˜

Public artifacts. The platform is deployed from the terraform-aws-ao-data-platform module on the Terraform Registry, with the ao-data-platform Helm chart and ao-llm-worker image on Docker Hub. Pulling them requires no registry credentials β€” see Prerequisites.

Architecture

The platform has two independent data flows:

  • Ingestion β€” your instrumented agents send OpenTelemetry traces to the OpenTelemetry Collector, which writes them directly into ClickHouse.
  • Query β€” the Monte Carlo platform sends all of its SQL through the Monte Carlo Agent (a Lambda function running in the same VPC), which queries ClickHouse for Trace Exploration and agent monitors.
flowchart TB
  classDef ext fill:#F8FAFC,stroke:#64748B,color:#0F172A;
  classDef node fill:#FEF2F2,stroke:#DC2626,color:#7F1D1D;
  classDef net fill:#FFF7ED,stroke:#EA580C,color:#7C2D12;
  classDef agent fill:#EEF2FF,stroke:#4F46E5,color:#1E1B4B;

  APPS["Instrumented agents / apps"]:::ext
  MC["Monte Carlo platform"]:::ext
  BR["Amazon Bedrock"]:::ext

  subgraph ACCT["Your AWS account"]
    AGENT["Monte Carlo Agent Β· Lambda, in-VPC<br/>customer-deployed, outside EKS"]:::agent
    NLB1["Internal NLB Β· OTLP"]:::net
    NLB2["Internal NLB Β· ClickHouse"]:::net

    subgraph EKS["EKS cluster Β· montecarlo namespace"]
      OTEL["OpenTelemetry Collector"]:::node
      LLM["LLM worker"]:::node
      CH[("ClickHouse")]:::node
    end
  end

  APPS -->|"OTLP 4317 / 4318 Β· TLS"| NLB1
  NLB1 --> OTEL
  OTEL -->|"write traces"| CH
  MC -->|"SQL Β· Monte Carlo initiates"| AGENT
  AGENT -->|"forwards SQL Β· 8443 TLS Β· otel user"| NLB2
  NLB2 --> CH
  LLM -->|"read / write"| CH
  LLM -->|"evaluations Β· InvokeModel"| BR

  style ACCT fill:#FFFBEB,stroke:#EA580C,color:#7C2D12
  style EKS fill:#ECFEFF,stroke:#0891B2,color:#164E63

Components

The Terraform module deploys the following into your AWS account:

ComponentDescription
EKS cluster + VPCA new cluster and VPC, or your existing ones. Hosts the data-plane workloads below.
ClickHouseThe trace data store. Deployed and managed by the Altinity ClickHouse Operator.
OpenTelemetry CollectorReceives OTLP traces and writes them to ClickHouse.
LLM workerRuns evaluations against trace data using Amazon Bedrock (see Evaluation).
Cluster controllersAWS Load Balancer Controller, cert-manager, External Secrets Operator, and external-dns.
Supporting AWS resourcesACM certificates, Route 53 records, IAM/IRSA roles, and Secrets Manager secrets.

The Monte Carlo Agent (Lambda) is not part of this module β€” it is deployed separately and pointed at ClickHouse. See Deploy the agent and connect to Monte Carlo.

Ingestion

Instrumented agents send traces to the OpenTelemetry Collector over OTLP β€” gRPC on port 4317 or HTTP on port 4318, both TLS-terminated at a Network Load Balancer (NLB). The Collector batches incoming spans and writes them directly into ClickHouse using the ClickHouse exporter; there is no intermediate object storage or data warehouse.

The Collector can be hosted in-cluster (deployed by this module) or, where supported, hosted by Monte Carlo with only the data store running in your account.

Evaluation

The platform includes an LLM worker that runs evaluations against trace data. It reads pending work from ClickHouse, invokes a model through Amazon Bedrock in your account, and writes the results back to ClickHouse. The Bedrock region defaults to your deployment region and is configurable.

End-to-end setup

Getting agent traces into Monte Carlo is a four-step journey. This guide covers steps 2–3 (deploy and connect); the links below cover instrumenting your agents and creating monitors.

StepWhat you doWhere
1. InstrumentAdd the Monte Carlo OpenTelemetry SDK to your agents so they emit OTLP traces.Monte Carlo SDK; the instrument-agent skill in the Agent Toolkit
2. DeployProvision the OpenTelemetry Collector and ClickHouse in your AWS account with Terraform.Prerequisites β†’ Installation
3. ConnectDeploy the Monte Carlo Agent and hand off the ClickHouse connection.Connect to Monte Carlo
4. MonitorExplore traces and create agent monitors in Monte Carlo.Agent Monitors Overview

Deploying the platform

Work through these pages in order:

StepPage
1Prerequisites β€” tooling, AWS account and permissions, domains, and chart access
2Installation β€” configure and apply the Terraform module
3Deploy the agent and connect to Monte Carlo β€” deploy the Monte Carlo Agent and hand off credentials
β€”Configuration reference β€” TLS, retention, ClickHouse users, and resource sizing
β€”Self-managed Helm install β€” advanced: manage the Helm release yourself
β€”Troubleshooting & FAQ β€” common installation and runtime issues