Agent Observability Platform: Troubleshooting & FAQ

Common issues and questions when deploying and running the Agent Observability data platform

Common issues during installation and after, plus frequently asked questions. For the deployment steps themselves, see Installation and Connect to Monte Carlo.

Installation issues

The ClickHouse pod is stuck in Pending

Almost always a scheduling or volume placement problem on the dedicated ClickHouse node group:

  • Availability Zone mismatch. EBS volumes are AZ-locked. If the dedicated node group is in a different AZ than the existing ClickHouse persistent volume, the pod can't mount its volume. Confirm the clickhouse_node_group.availability_zone output matches the volume's AZ; override it if needed and re-apply.

  • No matching node. The ClickHouse pod has a nodeSelector/toleration for the tainted dedicated=clickhouse:NoSchedule node group. Confirm the node group exists and a node is Ready:

    kubectl get nodes -l dedicated=clickhouse
    kubectl describe pod -n montecarlo -l clickhouse.altinity.com/chi=otel

    The pod's events will name the unsatisfied constraint (AZ, taint, or insufficient resources).

A TLS certificate never becomes Ready

kubectl get certificates -n montecarlo
kubectl describe certificate -n montecarlo clickhouse-server-tls

Check, in order:

  • cert-manager is installed and running β€” internal Collector↔ClickHouse TLS is always enabled and depends on it. If you deployed into an existing cluster, confirm you didn't skip it with helm.install_cert_manager = false while it was actually absent.
  • The CA issuer is ready β€” the chart creates an ao-data-platform-ca issuer by default (tls.certManager.createCA = true).
  • DNS-01 validation (for the public ACM-fronted endpoints) β€” confirm hosted_zone_id is set and the cert-manager IRSA role can manage Route 53 records. kubectl describe on the CertificateRequest/Order shows the validation error.

An ExternalSecret is not SecretSynced

kubectl get externalsecret -n montecarlo
kubectl describe externalsecret -n montecarlo ao-clickhouse-otel-credentials

The External Secrets Operator syncs the ClickHouse passwords from AWS Secrets Manager into the cluster. If the status isn't SecretSynced:

  • Confirm the ClusterSecretStore exists and is Valid.
  • Confirm the ESO service account's IRSA role can read the Secrets Manager secret and decrypt with the KMS key.
  • Confirm the referenced secret exists in Secrets Manager (the clickhouse_*_credentials_secret_arn outputs).

The schema-migration job doesn't complete

kubectl get jobs -n montecarlo
kubectl logs -n montecarlo job/clickhouse-schema-<n>

The clickhouse-schema-<n> job (where <n> is the Helm release revision) creates the otel_traces database and tables and applies the TTLs. It waits for ClickHouse to accept connections as the otel user, so a failure here usually traces back to ClickHouse not being healthy or the otel secret not being synced (see above). The Collector and LLM worker block on this job via an init container, so if it never completes they stay in Init.

I'm on a chart version older than 1.3.0

The module requires ao-data-platform chart >= 1.3.0. On older charts you may see:

  • ClickHouse pods that never schedule β€” pre-1.3.0 charts used an anti-affinity rule instead of the dedicated tainted node group, which deadlocks with the new node-group layout.
  • A readonly_user secret but no SQL user β€” the read-only user was added in chart 1.2.0; below that, the Secrets Manager secret is created but no ClickHouse user is provisioned.

Pin helm.chart_version to >= 1.3.0.

(Existing cluster) Terraform says the OIDC provider already exists

Import the existing provider before applying:

terraform import 'module.ao_data_platform.aws_iam_openid_connect_provider.cluster[0]' <arn>

kubectl / aws eks update-kubeconfig is denied

You need eks:DescribeCluster plus access to authenticate to the cluster. The principal that ran terraform apply is granted cluster-administrator access automatically. To use a different principal, add an EKS access entry for it.

After installation

Traces aren't arriving in ClickHouse

If everything is healthy but no trace data appears:

  • NLB source ranges. If you set otel_collector_nlb_allowed_source_ranges (or clickhouse_nlb_allowed_source_ranges), confirm the range your agents send from is included β€” an overly narrow list silently drops connections. Widen the list to the correct source CIDR (do not open it to 0.0.0.0/0 in production; scope it to the sending network).
  • Endpoint and ports. Confirm your agents target the OpenTelemetry Collector endpoint over OTLP β€” gRPC 4317 or HTTP 4318 β€” with TLS.
  • Collector logs. kubectl logs -n montecarlo -l app.kubernetes.io/name=opentelemetry-collector shows receiver/exporter errors.

The Collector is running but isn't writing to ClickHouse

Check the Collector logs for ClickHouse exporter errors. Common causes: the schema-migration job hasn't completed (so otel_traces doesn't exist yet), or the otel credentials secret isn't synced. Verify both as described under Installation issues.

The LLM worker can't run evaluations

The LLM worker calls Amazon Bedrock. If evaluations fail:

  • Permissions. Confirm the LLM worker's IRSA role allows bedrock:InvokeModel for the target model.
  • Region / model availability. The worker uses your deployment region by default. If the model you need isn't available there, set helm.llm_worker.bedrock_region to a region where it is.
  • Logs. kubectl logs -n montecarlo -l app.kubernetes.io/component=llm-worker.

FAQ

Which ClickHouse user does Monte Carlo connect as?

Currently the otel user. Provide the ClickHouse endpoint and the otel credentials (from the clickhouse_otel_credentials_secret_arn output) to Monte Carlo β€” see Connect to Monte Carlo. A dedicated least-privilege user is planned for a future release.

Do I need to create a read-only ClickHouse user for Monte Carlo?

No. Monte Carlo's general ClickHouse integration guide describes creating a read-only user, but that doesn't apply here β€” the otel user is already provisioned with the permissions the Agent needs. The optional readonly_user is for your own external SQL clients (e.g. DataGrip), not for the Monte Carlo connection.

How do I change trace retention?

Set clickhouse_ttl_days (default 30). The schema job re-applies it to the telemetry tables on the next install or upgrade. See the Configuration reference.

How do I restrict who can reach ClickHouse?

Set clickhouse_nlb_allowed_source_ranges to your VPC CIDR or a specific list β€” these NLB source ranges are the primary network control. See Network access.