Agent Observability Platform: Troubleshooting & FAQ
Common issues and questions when deploying and running the Agent Observability data platform
Common issues during installation and after, plus frequently asked questions. For the deployment steps themselves, see Installation and Connect to Monte Carlo.
Installation issues
The ClickHouse pod is stuck in Pending
PendingAlmost always a scheduling or volume placement problem on the dedicated ClickHouse node group:
-
Availability Zone mismatch. EBS volumes are AZ-locked. If the dedicated node group is in a different AZ than the existing ClickHouse persistent volume, the pod can't mount its volume. Confirm the
clickhouse_node_group.availability_zoneoutput matches the volume's AZ; override it if needed and re-apply. -
No matching node. The ClickHouse pod has a
nodeSelector/toleration for the tainteddedicated=clickhouse:NoSchedulenode group. Confirm the node group exists and a node isReady:kubectl get nodes -l dedicated=clickhouse kubectl describe pod -n montecarlo -l clickhouse.altinity.com/chi=otelThe pod's events will name the unsatisfied constraint (AZ, taint, or insufficient resources).
A TLS certificate never becomes Ready
Readykubectl get certificates -n montecarlo
kubectl describe certificate -n montecarlo clickhouse-server-tlsCheck, in order:
- cert-manager is installed and running β internal CollectorβClickHouse TLS is always enabled and depends on it. If you deployed into an existing cluster, confirm you didn't skip it with
helm.install_cert_manager = falsewhile it was actually absent. - The CA issuer is ready β the chart creates an
ao-data-platform-caissuer by default (tls.certManager.createCA = true). - DNS-01 validation (for the public ACM-fronted endpoints) β confirm
hosted_zone_idis set and the cert-manager IRSA role can manage Route 53 records.kubectl describeon theCertificateRequest/Ordershows the validation error.
An ExternalSecret is not SecretSynced
SecretSyncedkubectl get externalsecret -n montecarlo
kubectl describe externalsecret -n montecarlo ao-clickhouse-otel-credentialsThe External Secrets Operator syncs the ClickHouse passwords from AWS Secrets Manager into the cluster. If the status isn't SecretSynced:
- Confirm the
ClusterSecretStoreexists and isValid. - Confirm the ESO service account's IRSA role can read the Secrets Manager secret and decrypt with the KMS key.
- Confirm the referenced secret exists in Secrets Manager (the
clickhouse_*_credentials_secret_arnoutputs).
The schema-migration job doesn't complete
kubectl get jobs -n montecarlo
kubectl logs -n montecarlo job/clickhouse-schema-<n>The clickhouse-schema-<n> job (where <n> is the Helm release revision) creates the otel_traces database and tables and applies the TTLs. It waits for ClickHouse to accept connections as the otel user, so a failure here usually traces back to ClickHouse not being healthy or the otel secret not being synced (see above). The Collector and LLM worker block on this job via an init container, so if it never completes they stay in Init.
I'm on a chart version older than 1.3.0
The module requires ao-data-platform chart >= 1.3.0. On older charts you may see:
- ClickHouse pods that never schedule β pre-1.3.0 charts used an anti-affinity rule instead of the dedicated tainted node group, which deadlocks with the new node-group layout.
- A
readonly_usersecret but no SQL user β the read-only user was added in chart 1.2.0; below that, the Secrets Manager secret is created but no ClickHouse user is provisioned.
Pin helm.chart_version to >= 1.3.0.
(Existing cluster) Terraform says the OIDC provider already exists
Import the existing provider before applying:
terraform import 'module.ao_data_platform.aws_iam_openid_connect_provider.cluster[0]' <arn>kubectl / aws eks update-kubeconfig is denied
kubectl / aws eks update-kubeconfig is deniedYou need eks:DescribeCluster plus access to authenticate to the cluster. The principal that ran terraform apply is granted cluster-administrator access automatically. To use a different principal, add an EKS access entry for it.
After installation
Traces aren't arriving in ClickHouse
If everything is healthy but no trace data appears:
- NLB source ranges. If you set
otel_collector_nlb_allowed_source_ranges(orclickhouse_nlb_allowed_source_ranges), confirm the range your agents send from is included β an overly narrow list silently drops connections. Widen the list to the correct source CIDR (do not open it to0.0.0.0/0in production; scope it to the sending network). - Endpoint and ports. Confirm your agents target the OpenTelemetry Collector endpoint over OTLP β gRPC
4317or HTTP4318β with TLS. - Collector logs.
kubectl logs -n montecarlo -l app.kubernetes.io/name=opentelemetry-collectorshows receiver/exporter errors.
The Collector is running but isn't writing to ClickHouse
Check the Collector logs for ClickHouse exporter errors. Common causes: the schema-migration job hasn't completed (so otel_traces doesn't exist yet), or the otel credentials secret isn't synced. Verify both as described under Installation issues.
The LLM worker can't run evaluations
The LLM worker calls Amazon Bedrock. If evaluations fail:
- Permissions. Confirm the LLM worker's IRSA role allows
bedrock:InvokeModelfor the target model. - Region / model availability. The worker uses your deployment
regionby default. If the model you need isn't available there, sethelm.llm_worker.bedrock_regionto a region where it is. - Logs.
kubectl logs -n montecarlo -l app.kubernetes.io/component=llm-worker.
FAQ
Which ClickHouse user does Monte Carlo connect as?
Currently the otel user. Provide the ClickHouse endpoint and the otel credentials (from the clickhouse_otel_credentials_secret_arn output) to Monte Carlo β see Connect to Monte Carlo. A dedicated least-privilege user is planned for a future release.
Do I need to create a read-only ClickHouse user for Monte Carlo?
No. Monte Carlo's general ClickHouse integration guide describes creating a read-only user, but that doesn't apply here β the otel user is already provisioned with the permissions the Agent needs. The optional readonly_user is for your own external SQL clients (e.g. DataGrip), not for the Monte Carlo connection.
How do I change trace retention?
Set clickhouse_ttl_days (default 30). The schema job re-applies it to the telemetry tables on the next install or upgrade. See the Configuration reference.
How do I restrict who can reach ClickHouse?
Set clickhouse_nlb_allowed_source_ranges to your VPC CIDR or a specific list β these NLB source ranges are the primary network control. See Network access.
