Prompt Self-Evolution

Prompt self-evolution is a technique for systematically improving prompt text through automated evaluation, mutation, and selection — inspired by evolutionary algorithms. Instead of hand-tuning prompts through trial and error, this approach treats prompt optimization as a measurable search problem.

This document explains the principles behind the GEPA (Genetic-Pareto Prompt Evolution) algorithm as implemented in hermes-agent-self-evolution, and how RingClaw plans to apply these ideas to improve its prompts. See Plan 006 for the implementation roadmap.

Why Prompts Need Evolution

Prompts are the most impactful yet least rigorously tested component of an LLM-powered system. A single word change can shift classification boundaries, add or remove failure modes, and change output quality. Yet most teams iterate on prompts through:

Manual tuning — change text, test a few examples, deploy
Reactive fixes — wait for user bug reports, patch the prompt
No regression testing — no way to know if a fix broke something else

Prompt self-evolution replaces this with a data-driven loop:

The GEPA Algorithm

GEPA (Genetic-Pareto Prompt Evolution) combines ideas from genetic algorithms with LLM-powered mutation. Published at ICLR 2026, it operates entirely via API calls — no GPU training required.

Core Loop

Key steps:

Population initialization — start with the baseline prompt plus N random perturbations
Evaluation — run each variant against a dataset of test cases through the actual agent
Scoring — LLM-as-judge rates each output on multiple dimensions
Trace reflection — GEPA reads why things failed (the execution trace), not just that they failed
Targeted mutation — propose changes that specifically address observed failures
Constraint gates — reject variants that violate size, growth, or test suite constraints
Pareto selection — keep variants on the Pareto frontier (no single variant dominates another on all dimensions)

What Makes GEPA Different

Traditional prompt optimization (e.g. DSPy's MIPRO) mutates prompts blindly — random insertions, deletions, rewordings. GEPA's key innovation is reflective mutation: it reads the execution traces of failed test cases and proposes targeted fixes.

Approach	Mutation Strategy	Cost per Iteration
Random rewrite	Replace random sections	Low
MIPRO	Generate instruction candidates	Medium
GEPA	Read failure traces → propose targeted fix	Medium

Example: if the intent classifier misclassifies "总结 John 的代码" as summarize instead of chat, GEPA would read the execution trace, identify that the prompt doesn't distinguish "summarize code" from "summarize chat messages", and propose adding a clarification rule.

LLM-as-Judge Scoring

Each agent output is scored on three dimensions by a separate LLM acting as a judge:

Dimensions

Dimension	Weight	What It Measures
Correctness	0.5	Did the agent produce the correct output for this task?
Procedure following	0.3	Did the agent follow the prompt's instructions and workflow?
Conciseness	0.2	Was the response appropriately concise without missing key info?

Composite Score Formula

raw_score = 0.5 × correctness + 0.3 × procedure + 0.2 × conciseness
final_score = max(0, raw_score - length_penalty)

The length penalty discourages prompt bloat:

0% penalty at ≤90% of size limit
Linear ramp from 0% to 30% penalty between 90%-100% of limit
Formula: penalty = min(0.3, (ratio - 0.9) × 3.0) where ratio = size / limit

Judge Feedback

Beyond numeric scores, the judge produces textual feedback — specific, actionable suggestions for improvement. This feedback is fed into GEPA's reflective mutation step, creating a feedback loop where the optimizer can read why a score was low and propose targeted fixes.

Execution Trace Reflection

The most important insight from GEPA: scores alone are not enough. Knowing that a test case scored 0.3 on correctness tells you that it failed, but not why. The execution trace reveals the failure mechanism.

What's in a Trace

A trace captures the full interaction:

Input: "总结 John 的代码"
Prompt used: [full IntentPrompt text]
Agent output: "summarize"
Expected: "chat"
Judge feedback: "The prompt says to classify as 'summarize' if the user
  wants to summarize CHAT HISTORY, but the message asks about code,
  not chat messages. The prompt lacks a rule for distinguishing
  'summarize code' from 'summarize chat messages'."

How Reflection Works

Collect all failing test cases + their traces from the current generation
Group failures by category (e.g. "boundary cases", "compound intents")
Feed grouped failures to the mutation LLM with the prompt: "Here is the current prompt, here are the failures grouped by type. For each failure group, propose a specific change to the prompt text that would fix it."
The mutation LLM proposes changes grounded in real failure data, not random perturbation

This is fundamentally different from blind mutation — each proposed change has a reason backed by execution evidence.

Constraint Gates

Every evolved variant must pass hard constraints before being considered. A single constraint failure = immediate rejection.

Constraint Types

Constraint	Limit	Rationale
Size limit	≤ 15,000 chars (skill/prompt)	Prevents unbounded growth that hits context limits
Growth limit	≤ 20% over baseline	Prevents runaway prompt expansion per iteration
Non-empty	> 0 chars	Ensures the optimizer didn't produce empty output
Test suite	100% pass	Ensures the evolved prompt doesn't break existing behavior
Structural integrity	Valid format	Ensures required sections/fields are preserved

Why Constraints Matter

Without constraints, prompt optimizers tend to:

Grow without bound — adding more and more examples and rules
Overfit to eval data — adding hyper-specific rules that break generalization
Destroy structure — rewriting sections in ways that break parsing

Constraints act as evolutionary pressure against these failure modes.

Evaluation Data Sources

The quality of evolution depends directly on the quality of evaluation data. There are three sources, each with different tradeoffs:

1. Golden Datasets (Hand-Curated)

Manually written test cases with known-correct expected behaviors.

json

{
  "task_input": "总结 John 的代码",
  "expected_behavior": "chat",
  "difficulty": "hard",
  "category": "boundary"
}

Pros	Cons
Highest quality	Labor-intensive to create
Covers known edge cases	Limited scale (typically 15-30 cases)
Traceable to real bugs (via source PRs)	May not cover unknown failure modes

2. Synthetic Generation (LLM-Generated)

A strong LLM reads the prompt text and generates diverse test cases automatically.

Pros	Cons
Scalable — generate hundreds of cases	May not reflect real usage patterns
Covers diverse scenarios	Can be too easy or too hard
Cheap and fast	Quality depends on generator model

3. Session History Mining (Real Usage)

Extract real user messages from session logs and score them for relevance.

Pros	Cons
Reflects actual usage patterns	Requires stored conversation history
Discovers unknown failure modes	Privacy concerns with user data
Most realistic evaluation	Cold-start problem for new systems

Recommended Strategy

Start with golden datasets derived from historical bug-fix PRs (highest signal-to-noise ratio), then supplement with synthetic data for coverage breadth. Session mining is the strongest source but requires conversation logging infrastructure.

Application to RingClaw

RingClaw has 5 centralized prompts in messaging/prompts.go, each controlling a different aspect of the bot's behavior:

Prompt	Purpose	Evolution Priority
ActionPrompt	Instruct agent to generate ACTION blocks	HIGH — most complex, most failure modes
IntentPrompt	Classify user intent (summarize/task/note/event/chat)	HIGH — boundary cases between summarize and chat
NameExtractPrompt	Extract person name from message	MEDIUM — time word pollution, compound sentences
SummaryPrompt	Generate chat message summaries	LOW — works well
HeartbeatPrompt	Scheduled health check responses	LOW — works well

RingClaw Evolution Pipeline

Current Golden Datasets

RingClaw has 77 hand-curated test cases derived from historical bug-fix PRs:

Dataset	Cases	Key Sources
`datasets/prompts/intent/golden.jsonl`	34	PR #34 (boundary), PR #62 (time words), PR #91 (absolute dates)
`datasets/prompts/name_extract/golden.jsonl`	23	PR #40 (compound), PR #62 (time word pollution), PR #91 (absolute dates)
`datasets/prompts/action/golden.jsonl`	20	PR #40 (pronouns), PR #68 (person ID misuse)

Each test case is tagged with source_pr for traceability and note explaining why the case is challenging.

Baseline Results

Evaluated with deepseek-chat via Cloudflare AI Gateway:

Prompt	Score	Scoring Method	Key Failures
IntentPrompt	30/34 (88.2%)	Exact match	"总结代码/文档/PR" misclassified as summarize
NameExtractPrompt	21/23 (91.3%)	Exact match	Compound instructions ("总结 maxwell 并创建任务")
ActionPrompt	13/20 (65.0%)	LLM-as-judge	Pronoun handling, multi-action, card generation

Automated Mutation Results

Using --evolve, the IntentPrompt improved from 88.2% → 91.2% in a single round. The mutation added one clarifying sentence: "The summarize intent applies ONLY to summarizing chat history or messages" — a targeted fix for the boundary cases.

CI Integration

The eval pipeline is integrated into CI with three mechanisms:

Automatic eval on change — CI runs eval when messaging/prompts.go, datasets/prompts/, or scripts/eval_prompt.go change
Job Summary + PR Comment — eval results posted to GitHub Actions Summary and as a PR comment
Manual evolve workflow — evolve.yml (workflow_dispatch) runs --evolve on all prompts and creates a PR if improvement ≥ 3%

Usage

bash

# Evaluate all prompts
go run scripts/eval_prompt.go --prompt all

# Evaluate with markdown report
go run scripts/eval_prompt.go --prompt intent --markdown report.md

# Compare an alternative prompt
go run scripts/eval_prompt.go --prompt intent --compare path/to/new_prompt.md

# Evolve prompts (5 rounds of mutation)
go run scripts/eval_prompt.go --prompt all --evolve 5

# Evolve with minimum improvement threshold
go run scripts/eval_prompt.go --prompt intent --evolve 10 --min-improvement 3

Environment Variables

Variable	Required	Default	Description
`LLM_API_KEY`	Yes	—	API key for any OpenAI-compatible provider
`LLM_MODEL`	No	`deepseek-chat`	Model name
`LLM_BASE_URL`	No	`https://api.deepseek.com`	API endpoint (supports Cloudflare AI Gateway)

Prompt Self-Evolution ​

Why Prompts Need Evolution ​

The GEPA Algorithm ​

Core Loop ​

What Makes GEPA Different ​

LLM-as-Judge Scoring ​

Dimensions ​

Composite Score Formula ​

Judge Feedback ​

Execution Trace Reflection ​

What's in a Trace ​

How Reflection Works ​

Constraint Gates ​

Constraint Types ​

Why Constraints Matter ​

Evaluation Data Sources ​

1. Golden Datasets (Hand-Curated) ​

2. Synthetic Generation (LLM-Generated) ​

3. Session History Mining (Real Usage) ​

Recommended Strategy ​

Application to RingClaw ​

RingClaw Evolution Pipeline ​

Current Golden Datasets ​

Baseline Results ​

Automated Mutation Results ​

CI Integration ​

Usage ​

Environment Variables ​

Further Reading ​