LLM Training & Observability

Overview

Monte Carlo leverages large language models (LLMs) to power some AI-driven features. Monte Carlo does not perform model training (building a model from the ground up by feeding it very large datasets) or fine-tuning (taking an already-trained model and adjusting it with specific datasets so that it produces domain-specific outputs) of LLMs.

Instead, Monte Carlo uses pre-trained models via Amazon Bedrock and focuses on ensuring reliable and performance-driven usage through careful prompt design, evaluation, and iterative improvements guided by observability tooling.

LLM Source and Training

As noted above, Monte Carlo uses models provided by Amazon Bedrock. These models are hosted entirely within Monte Carlo’s AWS environment.

❗️
Although Amazon Bedrock offers the ability to fine-tune some foundational models, Monte Carlo does not fine-tune or retrain them. We exclusively use the pre-trained versions provided.

Observability and Monitoring Tools

Monte Carlo employs a multi-layered observability and monitoring strategy, dependent on the needs of the feature or LLM, to ensure LLM-powered features meet enterprise expectations for accuracy, reliability, and performance.

DataDog

Monitors system-level health, performance metrics, and latency to ensure AI features maintain reliability and availability.

Agent Observability

LLM-as-judge or deterministic evaluations to assess the quality of agent responses to detect low-quality or faulty outputs and performance issues.

LangSmith

Used for prompt evaluation, versioning, and detailed trace analysis, helping us measure LLM behavior and accuracy across different contexts.

Iteration and Improvements

Monte Carlo follows a data-driven improvement process designed to ensure that AI features evolve in a predictable, measurable, and customer-centric way. Improvements are guided by performance metrics, feature observability, and user feedback loops, ensuring changes demonstrably enhance accuracy, reliability, and usability.

Prompt Engineering

Observability data guides adjustments to prompts to yield more accurate responses.

Feature Performance Analysis

Data from observability tools informs refinements in design and functionality.

Feedback Integration

Customer and internal feedback loops are incorporated into iterative updates.

Guardrail Evaluation

Monitoring tools help detect and minimize risks such as hallucinations, irrelevant outputs, or response latency.

Feature Specific Considerations

For additional information on specific features, click here.

Updated 2 months ago