Book a demo

Self-improving AI agents

Catch what your agent is getting wrong — before users complain.

Convoy learns what good and bad look like from your code and from the runs you flag as bad. We watch every agent run in production, surface silent failures with their root cause, and ship the fix.

OpenTelemetry · Vercel AI SDK · Mastra · OpenAI · Anthropic

What Convoy caught this week
12,847 runs analyzed
Agent promised a 60-day return window we don’t offer
Rule from your code
Your store policy says 30 days. Agent doubled it on 23 conversations.
Agent answered the wrong question on a cancellation
Like your last flag
User asked to cancel; agent replied about shipping. Same shape as Issue #138 you flagged Tuesday.
Agent looped on a failed knowledge search
Auto-validated fix
Got stuck retrying instead of moving on. Fix shipped to 5% on Tue; promoted to 100% Wednesday.

Used by AI startups from

Y CombinatorLauncha16z SpeedrunFounders Inc

Your tests pass. Your agent still gets it wrong.

The dangerous failures aren't exceptions — they're a confident, well-formatted reply that's subtly wrong. Tests pass. Logs look clean. You only find out when a user complains.

Without Convoy

Ship to production
Wait for a user to complain
Read through traces one by one to find the cause
Guess at the prompt fix

With Convoy

Convoy learns good vs bad from your code and the runs you flag as bad
Silent failures surface as Issues with their root cause
A concrete fix to your prompt, tool, or workflow
Validated on real traffic before it promotes

What Convoy does

From the first bad run to the fix that ships itself.

Most observability stops at “here's a trace.” Convoy watches every run, learns what good looks like in your domain, surfaces failures, and ships the fix.

Watches every run in production

Convoy ingests OpenTelemetry, so you keep the tracer you already use. Works across the Vercel AI SDK, Mastra, OpenAI, and Anthropic.

Knows what good looks like in your domain

Convoy learns the rules your agent should follow from your codebase, your prompts, and the runs you flag as bad in our UI. No eval suite to write — your code and your judgment are the eval.

Surfaces silent failures with root cause

Hundreds of bad runs collapse into a handful of named Issues, each with the actual reason your agent went off the rails. Hallucinations, looping tools, mis-routed intents.

Ships the fix and validates it

Convoy proposes a concrete change to your prompt, tool, or workflow — then ships it to a slice of users, watches the same signals it learned from your code, and promotes or rolls back automatically.

Detect

Every silent failure, surfaced as an Issue.

Clustered by root cause

Hundreds of bad runs collapse into a handful of named issues — the same root cause grouped together, not one alert per request.

Catches what evals miss

Tool retries that don't error, intents misrouted to the wrong workflow, hallucinations against your own docs — the failures that pass type checks and unit tests.

Slack and email alerts

New issue detected? You hear about it in the channel where you already work, not on a dashboard you forgot to open.

Silent failures — today
47 runs
Promised a 60-day return window we don't offer
Your store policy says 30 days. Agent doubled it on 23 conversations.
23×
Reply tone didn't match the customer's
Customer was frustrated; reply came back chipper and templated.
11×
Agent kept retrying a failed knowledge search
Search timed out; agent looped instead of moving on or asking for help.
8×
Issued a refund without verifying the user first
Your workflow requires user verification before any refund. Agent skipped it.
5×

Define good vs bad

Your code is the eval. Your flags sharpen it.

Learned from your codebase

Convoy reads your tool definitions, workflow code, and system prompts to figure out what your agent is supposed to do — without you writing a single test case.

Trained on your flagged runs

Flag a bad run once, and Convoy applies that judgment to every run after.

Updates as you ship

Change a prompt, add a tool, edit your knowledge base — the definition of good moves with you. No drift.

Your codebase
agent.ts, tools/*, system prompts
Your prompts
what you tell the model good looks like
Manually flagged runs
the bad replies you flagged in the UI
Convoy's learned model
Refunds must cite a real entry from your knowledge base
Angry messages go to escalate_to_human, not the FAQ flow
Empty knowledge-base results from search timeouts are not no-results
verify_user must run before any refund or account change

Fix

From root cause to a concrete fix.

A specific edit, not a hint

Convoy proposes an actual change to the prompt, tool schema, or workflow step that's causing the failure — with the exact lines added or removed.

Wired into your coding agent

Open the suggested change in Cursor or Claude Code with one click. Review it like any other code change before it ships.

Backed by the runs that proved it

Every fix links to the cluster of failing traces it was derived from. You can see exactly why Convoy thinks this is the right change.

Issue #142 · Hallucinated refund policy
23 runs
Root cause
The system prompt asks the agent to cite the refund policy, but never lists which products are non-refundable. On those products the model invents a window that isn't in your knowledge base.
// system_prompt.md +3 -1
- Cite our refund policy when relevant.
+ Cite our refund policy when relevant.
+ Non-refundable: gift cards, digital downloads, custom orders.
+ If the product isn't in the knowledge base, escalate — do not invent a window.
Generated from 23 flagged runs

Validate

On the roadmap

Try the fix. Keep what works.

Convoy doesn't guess at one fix and call it done. It tries a few variants on real traffic, watches which one actually improves the runs, and surfaces the winner.

Variants run on a slice

Convoy routes proposed fixes to a small share of users, session-sticky so each user sees a consistent experience.

Judged on what you taught it

The same rules Convoy learned from your code decide whether a variant is winning or losing. No new evals to write.

Keep the winner. Drop the rest.

Variants that improve quality get promoted. Variants that don't get rolled back. You see which change actually moved the needle, and why.

Issue #142 — Trying 2 fixes
1,318 runs analyzed
Variant A — add non-refundable list to prompt
92WINNER
Variant B — restructure refund step in workflow
81+7
Baseline (current prompt)
74
Variant A is +18 above baseline on your learned rules. Promoting to 100% in 2h unless you object.

Point your traces at Convoy.

Convoy ingests OpenTelemetry. Drop in the exporter you already know, or use one line of config in Next.js.

// instrumentation.ts
import { registerOTel } from "@vercel/otel";

export function register() {
  registerOTel({
    serviceName: "your-agent",
    traceExporter: "otlp",
  });
}

// .env
OTEL_EXPORTER_OTLP_ENDPOINT=https://ingest.convoylabs.com
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer ${CONVOY_API_KEY}
  • Standard OpenTelemetry — no Convoy SDK required.
  • Works with Vercel AI SDK, Mastra, OpenAI, Anthropic, and LangChain out of the box.
  • Bring your own tracer; point the exporter at Convoy.

Find out what your agent is getting wrong this week.

15-minute demo. We'll show you the silent failures your evals are missing — and walk through pricing for your team.