Self-healing AI agents

Catch what your agent is getting wrong — before users complain.

Convoy learns what good and bad look like from your code and from the runs you flag as bad. We watch every agent run in production, surface silent failures with their root cause, and ship the fix.

Book a demo Get started

OpenTelemetry · Vercel AI SDK · Mastra · OpenAI · Anthropic

What Convoy caught this week

12,847 runs analyzed

Agent promised a 60-day return window we don’t offer

Rule from your code

Your store policy says 30 days. Agent doubled it on 23 conversations.

Agent answered the wrong question on a cancellation

Like your last flag

User asked to cancel; agent replied about shipping. Same shape as Issue #138 you flagged Tuesday.

Agent looped on a failed knowledge search

Auto-validated fix

Got stuck retrying instead of moving on. Fix shipped to 5% on Tue; promoted to 100% Wednesday.

Used by AI startups from

Your tests pass. Your agent still gets it wrong.

The dangerous failures aren't exceptions — they're a confident, well-formatted reply that's subtly wrong. Tests pass. Logs look clean. You only find out when a user complains.

Without Convoy

Ship to production

Wait for a user to complain

Read through traces one by one to find the cause

Guess at the prompt fix

With Convoy

Convoy learns good vs bad from your code and the runs you flag as bad

Silent failures surface as Issues with their root cause

A concrete fix to your prompt, tool, or workflow

Validated on real traffic before it promotes

What Convoy does

From the first bad run to the fix that ships itself.

Most observability stops at “here's a trace.” Convoy watches every run, learns what good looks like in your domain, surfaces failures, and ships the fix.

Watches every run in production

Convoy ingests OpenTelemetry, so you keep the tracer you already use. Works across the Vercel AI SDK, Mastra, OpenAI, and Anthropic.

Knows what good looks like in your domain

Convoy learns the rules your agent should follow from your codebase, your prompts, and the runs you flag as bad in our UI. No eval suite to write — your code and your judgment are the eval.

Surfaces silent failures with root cause

Hundreds of bad runs collapse into a handful of named Issues, each with the actual reason your agent went off the rails. Hallucinations, looping tools, mis-routed intents.

Ships the fix and validates it

Convoy proposes a concrete change to your prompt, tool, or workflow — then ships it to a slice of users, watches the same signals it learned from your code, and promotes or rolls back automatically.

Detect

Every silent failure, surfaced as an Issue.

Clustered by root cause

Hundreds of bad runs collapse into a handful of named issues — the same root cause grouped together, not one alert per request.

Catches what evals miss

Tool retries that don't error, intents misrouted to the wrong workflow, hallucinations against your own docs — the failures that pass type checks and unit tests.

Slack and email alerts

New issue detected? You hear about it in the channel where you already work, not on a dashboard you forgot to open.

Silent failures — today

47 runs

Promised a 60-day return window we don't offer

Your store policy says 30 days. Agent doubled it on 23 conversations.

23×

Reply tone didn't match the customer's

Customer was frustrated; reply came back chipper and templated.

11×

Agent kept retrying a failed knowledge search

Search timed out; agent looped instead of moving on or asking for help.

8×

Issued a refund without verifying the user first

Your workflow requires user verification before any refund. Agent skipped it.

5×

Define good vs bad

Your code is the eval. Your flags sharpen it.

Learned from your codebase

Convoy reads your tool definitions, workflow code, and system prompts to figure out what your agent is supposed to do — without you writing a single test case.

Trained on your flagged runs

Flag a bad run once, and Convoy applies that judgment to every run after.

Updates as you ship

Change a prompt, add a tool, edit your knowledge base — the definition of good moves with you. No drift.

Your codebase

agent.ts, tools/*, system prompts

Your prompts

what you tell the model good looks like

Manually flagged runs

the bad replies you flagged in the UI

Convoy's learned model

Refunds must cite a real entry from your knowledge base

Angry messages go to escalate_to_human, not the FAQ flow

Empty knowledge-base results from search timeouts are not no-results

verify_user must run before any refund or account change

Fix

From root cause to a concrete fix.

A specific edit, not a hint

Convoy proposes an actual change to the prompt, tool schema, or workflow step that's causing the failure — with the exact lines added or removed.

Wired into your coding agent

Open the suggested change in Cursor or Claude Code with one click. Review it like any other code change before it ships.

Backed by the runs that proved it

Every fix links to the cluster of failing traces it was derived from. You can see exactly why Convoy thinks this is the right change.

Issue #142 · Hallucinated refund policy

23 runs

Root cause

The system prompt asks the agent to cite the refund policy, but never lists which products are non-refundable. On those products the model invents a window that isn't in your knowledge base.

// system_prompt.md +3 -1

- Cite our refund policy when relevant.

+ Cite our refund policy when relevant.

+ Non-refundable: gift cards, digital downloads, custom orders.

+ If the product isn't in the knowledge base, escalate — do not invent a window.

Generated from 23 flagged runs

Validate

On the roadmap

Try the fix. Keep what works.

Convoy doesn't guess at one fix and call it done. It tries a few variants on real traffic, watches which one actually improves the runs, and surfaces the winner.

Variants run on a slice

Convoy routes proposed fixes to a small share of users, session-sticky so each user sees a consistent experience.

Judged on what you taught it

The same rules Convoy learned from your code decide whether a variant is winning or losing. No new evals to write.

Keep the winner. Drop the rest.

Variants that improve quality get promoted. Variants that don't get rolled back. You see which change actually moved the needle, and why.

Issue #142 — Trying 2 fixes

1,318 runs analyzed

Variant A — add non-refundable list to prompt

92WINNER

Variant B — restructure refund step in workflow

81+7

Baseline (current prompt)

Variant A is +18 above baseline on your learned rules. Promoting to 100% in 2h unless you object.

Point your traces at Convoy.

Convoy ingests OpenTelemetry. Drop in the exporter you already know, or use one line of config in Next.js.

// instrumentation.ts
import { registerOTel } from "@vercel/otel";

export function register() {
  registerOTel({
    serviceName: "your-agent",
    traceExporter: "otlp",
  });
}

// .env
OTEL_EXPORTER_OTLP_ENDPOINT=https://ingest.convoylabs.com
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer ${CONVOY_API_KEY}

Standard OpenTelemetry — no Convoy SDK required.
Works with Vercel AI SDK, Mastra, OpenAI, Anthropic, and LangChain out of the box.
Bring your own tracer; point the exporter at Convoy.

FAQ

Questions, answered.

Convoy is an observability and self-healing platform for AI agents. It ingests your OpenTelemetry traces, learns what good and bad look like from your codebase and the runs you flag, surfaces silent failures with their root cause, proposes a concrete fix to your prompt, tool, or workflow, and validates it on real traffic before promoting it.

Traditional observability stops at showing you a trace, and evals only catch the failures you thought to write tests for. Convoy watches every production run, learns the rules your agent should follow directly from your code and your flagged runs, and clusters silent failures by root cause — then proposes the actual edit that fixes them. You don't write or maintain an eval suite; your code and your judgment are the eval.

No. Convoy reads your tool definitions, workflow code, and system prompts to infer what your agent is supposed to do, and you sharpen that by flagging bad runs in the UI. Flag a bad reply once and Convoy applies that judgment to every run after, so there is no eval suite to write or keep up to date.

Convoy ingests standard OpenTelemetry, so it works with the tracer you already use. It supports the Vercel AI SDK, Mastra, OpenAI, Anthropic, and LangChain out of the box — no Convoy SDK required. You bring your own tracer and point the OTLP exporter at Convoy's ingest endpoint.

Convoy catches the silent failures that pass type checks and unit tests: hallucinations against your own docs, intents mis-routed to the wrong workflow, skipped verification steps, and multi-agent failures, where one agent's output quietly breaks the next one downstream. These are confident, well-formatted answers that are subtly wrong — the failures you normally only find when a user complains.

For each issue Convoy proposes a specific edit — exact lines added or removed in the prompt, tool schema, or workflow step that's causing the failure — backed by the cluster of failing traces it was derived from. You can open the suggested change in Cursor or Claude Code with one click and review it like any other code change before it ships.

Find out what your agent is getting wrong this week.

15-minute demo. We'll show you the silent failures your evals are missing — and walk through pricing for your team.