Building a Local-First Agent Framework in Rust (Part 19): Measuring Behavior With eval
See Part 0 for the latest table of contents and sample code. New chapters will be added over time.
Chapter 19: Measuring Behavior With eval
By Chapter 18, the model boundary has a clearer shape. The system prompt tells the model what envelope to emit. The parser tolerates the common markdown-fence wrapper. The recovery loop still gives the model another chance when the output is wrong.
That is useful, but it does not answer the next question.
Does it work reliably?
This post is also available on Medium. If you’re a paid Medium member and happen to read it there, it helps fund my next cup of coffee. Much appreciated ☕️😄
With normal software, we often want one deterministic answer: pass or fail. The function returns the expected value, or it does not. The test is green, or it is red.
A local model is different. The same prompt may work four times and fail once. It may call the right tool, but sometimes format the final answer differently. It may behave better after the system prompt, but not perfectly. If we only look at one run, we can fool ourselves in both directions. A lucky pass can hide instability, and one unlucky failure can make a usable path look broken.
This chapter adds abcb eval.
It is not a replacement for unit tests. It is a characterization command. By characterization, I mean that it describes the behavior of the current system under the current model, rather than enforcing a fixed pass/fail contract. It runs a small set of fixtures against the configured model, repeats each fixture several times, and prints a scorecard.
The sample code for this chapter is in chapter19/abcb/.
19.1 A New Subcommand
The CLI gets one new command:
File: abcb/crates/abcb-cli/src/main.rs
/// How many times `abcb eval` runs each fixture by default. A nondeterministic
/// model needs repetition: the pass *rate* is the signal, not a single result.
const DEFAULT_EVAL_RUNS: usize = 5;
File: abcb/crates/abcb-cli/src/main.rs
#[derive(Debug, Subcommand)]
enum Command {
// ...
/// Run evaluation fixtures against the configured model and print a scorecard.
Eval {
/// How many times to run each fixture (the pass rate over repeated runs).
#[arg(long, default_value_t = DEFAULT_EVAL_RUNS)]
runs: usize,
},
}
The default is five runs per fixture. That number is not special. It is just enough to make the point: the command is looking for a rate, not a single answer.
If a fixture passes five out of five times, that means something different from one out of one. If it passes three out of five times, that is not "green," but it is also more informative than one vague failure. It tells us that the behavior exists, but the boundary is not stable enough yet.
The command is wired into main in the usual way:
File: abcb/crates/abcb-cli/src/main.rs
match cli.command {
Command::Doctor => run_doctor().await?,
Command::Chat { message, mock, log } => run_chat(message, mock, log).await?,
Command::Replay { path } => run_replay(path)?,
Command::Run { message, mock } => run_run(message, mock).await?,
Command::Eval { runs } => run_eval_command(runs).await?,
}
This is the first command whose main job is not to do a task for the user. Its job is to measure the agent.
19.2 Report, Not Gate
It would be tempting to make eval behave like a test runner: return success if everything passes, return failure if anything fails.
For now, I do not want that.
eval is meant to report. It prints a scorecard and exits normally. The deterministic pieces inside it are still unit-tested, but the actual abcb eval command, running against the user's configured local model server, is not a CI gate.
That split matters. Unit tests are for code we expect to behave deterministically. Live model evaluation is for behavior that may vary across model versions, local runtimes, prompt changes, and even repeated runs. If eval immediately became a red/green gate, it would push us to pretend the model is more deterministic than it is.
So Chapter 19 keeps two layers separate.
- The judging logic is deterministic and testable.
- The live model scorecard is empirical and report-only.
That gives us useful data without turning the local model into a flaky build dependency.
19.3 Fixtures And Expectations
An eval case is a prompt plus an expectation:
File: abcb/crates/abcb-cli/src/main.rs
/// What a fixture expects of a run. Judged from the run's outcome + summary
/// (which already exist post-run), so a fixture is pure data.
enum Expectation {
/// The agent loop reached a final answer (didn't error or exhaust steps).
Completes,
/// The run invoked the named tool at least once.
CallsTool { name: String },
/// The final answer contains this substring.
FinalContains { text: String },
}
This is intentionally small. We are not building a general eval framework yet. We only need three questions:
- Did the loop complete?
- Did it call a particular tool?
- Did the final answer contain a particular piece of text?
Those three expectations are enough to probe the behavior we care about in this stage. Can the model stay inside the loop? Can it choose a tool? Can it use a tool and produce the expected answer?
The label method turns each expectation into text for the scorecard:
File: abcb/crates/abcb-cli/src/main.rs
impl Expectation {
/// A short label for the scorecard, e.g. `calls session_note_append`.
fn label(&self) -> String {
match self {
Expectation::Completes => "completes".to_string(),
Expectation::CallsTool { name } => format!("calls {name}"),
Expectation::FinalContains { text } => format!("final contains {text:?}"),
}
}
}
Then the case itself is just data:
File: abcb/crates/abcb-cli/src/main.rs
/// One evaluation case: a prompt and what a good run of it looks like.
struct EvalCase {
name: String,
prompt: String,
expectation: Expectation,
}
Notice what is not here. There is no provider. There is no file path. There is no event log writer. A fixture says what to ask and how to judge the run afterward. The runner decides how to execute it.
19.4 Built-In Fixtures, For Now
The fixtures are defined in code:
File: abcb/crates/abcb-cli/src/main.rs
fn default_fixtures() -> Vec<EvalCase> {
vec![
EvalCase {
name: "say-hi".to_string(),
prompt: "Say hi in one short sentence.".to_string(),
expectation: Expectation::Completes,
},
EvalCase {
name: "append-note".to_string(),
prompt: "Use a tool to remember this note: buy milk.".to_string(),
expectation: Expectation::CallsTool {
name: "session_note_append".to_string(),
},
},
EvalCase {
name: "add-numbers".to_string(),
prompt: "What is 3 plus 4? Use a tool, then give the number.".to_string(),
expectation: Expectation::FinalContains {
text: "7".to_string(),
},
},
]
}
This could have been a JSON or TOML file. I am not doing that yet.
Keeping fixtures in code has a few advantages at this stage. The Expectation enum type-checks the cases. Renaming an expectation variant breaks compilation instead of silently breaking a data file. The fixtures live next to the runner that understands them. The list is also still small enough that code is easier to read than a separate loader and schema.
The runner still accepts a slice of fixtures, not this exact Vec:
File: abcb/crates/abcb-cli/src/main.rs
async fn run_eval(
provider: &mut impl Provider,
registry: &ToolRegistry,
fixtures: &[EvalCase],
runs: usize,
) -> Vec<FixtureResult> {
// ...
}
That &[EvalCase] matters. default_fixtures() creates and owns a Vec<EvalCase>, but run_eval does not need ownership of the list. It only needs to read each case. A slice means "some borrowed list of cases." Today the list comes from default_fixtures(). Later, a file loader could return a Vec<EvalCase> and pass &fixtures to the same runner. The runner does not need to know where the cases came from.
19.5 Judging From The Outcome And Summary
The judge is pure. Given the same expectation, outcome, and summary, it always returns the same boolean:
File: abcb/crates/abcb-cli/src/main.rs
fn judge(
expectation: &Expectation,
outcome: &Result<String, LoopError>,
summary: &RunSummary,
) -> bool {
match expectation {
Expectation::Completes => outcome.is_ok(),
Expectation::CallsTool { name } => summary.tools_called.iter().any(|t| t == name),
Expectation::FinalContains { text } => {
outcome.as_ref().map(|a| a.contains(text)).unwrap_or(false)
}
}
}
This function does not call the model. It does not read files. It does not inspect the full event log directly. It receives two things that already exist after a run:
outcome: did the loop return a final answer or an error?summary: the read-model from Chapter 17, projected from the event log.
That is a nice reuse. Chapter 17 built RunSummary so a human could understand what happened after abcb run. Chapter 19 uses the same summary to judge a fixture. We do not need a separate eval-only data structure when the run summary already captures the signals we care about.
Expectation::CallsTool is the clearest example. The loop does not need a new callback for eval. The event log already recorded tool results. RunSummary already collected tools_called. The judge only asks whether the expected tool name appears in that list.
19.5.1 Result::as_ref() In The Final-Text Judge
The final-text expectation has one compact Rust expression:
File: abcb/crates/abcb-cli/src/main.rs
Expectation::FinalContains { text } => {
outcome.as_ref().map(|a| a.contains(text)).unwrap_or(false)
}
outcome is borrowed as &Result<String, LoopError>. We do not want to move the String out of it. We only want to look at the successful answer, if there is one.
as_ref() turns:
&Result<String, LoopError>
into:
Result<&String, &LoopError>
Then map runs only on the Ok case. If the run completed, a.contains(text) checks the answer. If the run failed, the Err stays an error. Finally, unwrap_or(false) says that a failed run does not satisfy a final-answer text expectation.
This is a small pattern, but it is useful: borrow the inside of a Result, transform only the success case, and collapse it into a boolean.
19.6 One Trial Runs The Real Loop
An eval trial is not a fake shortcut. It drives the same agent loop:
File: abcb/crates/abcb-cli/src/main.rs
async fn eval_trial(
provider: &mut impl Provider,
registry: &ToolRegistry,
prompt: &str,
) -> (Result<String, LoopError>, RunSummary) {
let mut session = Session::start();
session.push_message(Message::new(Role::System, system_prompt(registry)));
session.push_message(Message::new(Role::User, prompt));
let mut events: Vec<u8> = Vec::new();
let outcome = run_loop(
provider,
registry,
&mut session,
DEFAULT_MAX_STEPS,
&mut events,
&mut AllowAll,
)
.await;
let logged = read_events(events.as_slice()).unwrap_or_default();
let summary = RunSummary::from_run(&logged, &outcome, "eval".to_string());
(outcome, summary)
}
This is the important part of the chapter.
The trial creates a session, injects the same system prompt from Chapter 18, adds the fixture prompt as the user message, and calls run_loop. The provider can be real or mock, but the loop itself is the same loop.
That means eval measures the actual path we care about: prompt, provider, parser, tools, event log, recovery, final answer.
The only difference is storage.
19.6.1 Why provider Is Passed Directly
One small Rust detail is easy to miss in the run_loop call. eval_trial receives provider as a mutable reference:
File: abcb/crates/abcb-cli/src/main.rs
provider: &mut impl Provider,
So inside this function, provider is already a mutable reference. That is why the call passes provider directly to run_loop:
File: abcb/crates/abcb-cli/src/main.rs
let outcome = run_loop(
provider,
registry,
&mut session,
DEFAULT_MAX_STEPS,
&mut events,
&mut AllowAll,
)
.await;
Writing &mut provider again would mean borrowing the local reference variable itself. Here, we only want to pass along the mutable access it already represents. Rust will reborrow that mutable reference for the duration of the call.
The other mutable arguments are different. session, events, and AllowAll are local values owned by eval_trial, so the call has to create mutable borrows of them:
File: abcb/crates/abcb-cli/src/main.rs
&mut session,
&mut events,
&mut AllowAll,
The &mut prefix means "lend this value mutably for the call." The callee can modify the value while it has the borrow, but ownership stays in eval_trial. After run_loop returns, eval_trial still owns session and events. In this function we only keep using events, because we read the in-memory log back into a summary.
abcb run writes an event log under a session directory. eval_trial writes events into memory:
File: abcb/crates/abcb-cli/src/main.rs
let mut events: Vec<u8> = Vec::new();
This works because Chapter 14 made the event sink generic over Write. A file implements Write. A Vec<u8> also implements Write. The loop does not care which one it receives.
That earlier design choice pays off here. Eval trials are throwaway. They should not create session directories just to produce a temporary scorecard. But they still need an event stream so RunSummary can do its job. An in-memory Vec<u8> gives us both.
19.6.2 Why events.as_slice() Works
After the loop finishes, the in-memory event bytes are read back:
File: abcb/crates/abcb-cli/src/main.rs
let logged = read_events(events.as_slice()).unwrap_or_default();
events is a Vec<u8>. read_events expects something that implements BufRead:
File: abcb/crates/abcb-core/src/lib.rs
pub fn read_events(reader: impl BufRead) -> Result<Vec<LoggedEvent>, EventLogError> {
// ...
}
A byte slice can act like a small in-memory reader. events.as_slice() gives a &[u8], and &[u8] implements the reading traits needed here.
Rust note: why not passVec<u8>directly?Vec<u8>is an owned growable buffer. It stores bytes, but it does not by itself represent "where we are" while reading through those bytes. A reader needs that position. A byte slice reader,&[u8], can advance by shortening the slice as bytes are consumed, so the standard library implements reading traits for it. If we wanted to read from an owned vector while keeping an explicit cursor, we could also usestd::io::Cursor<Vec<u8>>. Here,events.as_slice()is the simplest choice because we only need to read the bytes once after writing them.
The last part, unwrap_or_default(), means "if reading succeeds, use the events; if it fails, use the default value." For Vec<LoggedEvent>, the default is an empty vector. This is a lenient choice, but a narrow one. The loop just wrote these bytes into memory, so a parse failure should be extremely unlikely. If it somehow happens, eval can still judge the trial from the outcome, with an empty summary.
This is another example of designing around standard library traits instead of concrete file types. Once the event reader accepts a generic reader, tests, replay, run summaries, and eval can all reuse it with different storage.
19.7 Running Each Fixture Several Times
The eval runner repeats each case:
File: abcb/crates/abcb-cli/src/main.rs
/// The result of running one fixture `runs` times.
struct FixtureResult {
name: String,
label: String,
passed: usize,
runs: usize,
}
File: abcb/crates/abcb-cli/src/main.rs
async fn run_eval(
provider: &mut impl Provider,
registry: &ToolRegistry,
fixtures: &[EvalCase],
runs: usize,
) -> Vec<FixtureResult> {
let mut results = Vec::new();
for case in fixtures {
let mut passed = 0;
for _ in 0..runs {
let (outcome, summary) = eval_trial(provider, registry, &case.prompt).await;
if judge(&case.expectation, &outcome, &summary) {
passed += 1;
}
}
results.push(FixtureResult {
name: case.name.clone(),
label: case.expectation.label(),
passed,
runs,
});
}
results
}
The aggregation is intentionally plain. For each case, start passed at zero. Run the trial runs times. Judge each result. Increment the counter when the expectation is met. Store the final passed/runs pair.
This is not statistically sophisticated. It is not meant to be. At this stage, I want something I can run locally and understand immediately.
If append-note is 5/5, that gives me confidence that the model can choose the note tool from the prompt and description alone. If it is 1/5, I should not guess. I should inspect the failures and decide whether the tool description, the system prompt, or the parser boundary needs to improve.
19.8 Generic Over Provider
The signature of run_eval is the same kind of seam we introduced much earlier:
File: abcb/crates/abcb-cli/src/main.rs
async fn run_eval(
provider: &mut impl Provider,
registry: &ToolRegistry,
fixtures: &[EvalCase],
runs: usize,
) -> Vec<FixtureResult> {
// ...
}
This is one of the reasons Chapter 4 introduced Provider as a trait. At that time, the abstraction may have looked like preparation for multiple model backends. Here it gives us something more immediate: the same eval runner can run in two different contexts. The real command passes an OpenAiCompatProvider. The tests pass a MockProvider. The runner does not know which one it received. It only knows the provider can complete a session.
That makes a nondeterministic feature testable. The live model is not deterministic, but the eval machinery is. We can test the judge, the repetition logic, the tallying, and the partial-pass case with scripted mock responses.
The provider abstraction is not just for swapping backends. It is also how we draw the line between deterministic infrastructure and probabilistic behavior.
19.9 Isolating Tool Side Effects
run_eval_command builds the real provider and the default registry:
File: abcb/crates/abcb-cli/src/main.rs
async fn run_eval_command(runs: usize) -> Result<(), Box<dyn Error>> {
let config = load_required_config()?;
let mut provider = build_provider(&config)?;
let notes = std::env::temp_dir().join("abcb-eval-notes.jsonl");
let registry = default_registry(notes);
let fixtures = default_fixtures();
let results = run_eval(&mut provider, ®istry, &fixtures, runs).await;
println!("abcb eval: {runs} runs each");
// ...
Ok(())
}
The notes path is important:
File: abcb/crates/abcb-cli/src/main.rs
let notes = std::env::temp_dir().join("abcb-eval-notes.jsonl");
let registry = default_registry(notes);
The eval fixtures may call tools. One fixture specifically asks the model to remember a note. I do not want that probe to write into the project's real .abcb/notes.jsonl file.
So eval points the file-backed tools at a temp-directory path instead. The agent still uses the real tools, but their side effects are isolated from the project.
The filename is fixed, so this temporary notes file may be reused across eval runs and it is not cleaned up by the command. That is acceptable for this version because the scorecard does not read note contents from that file. The file only exists so session_note_append and session_note_search have somewhere isolated to write and read if the model calls them. Eval judges whether a tool was called from the in-memory event summary, not from the note file itself. The important boundary is that eval does not write into the project's real notes.
This is a diagnostic command. It should observe behavior, not leave surprising project memory behind.
19.10 The Scorecard
The command prints a small scorecard:
File: abcb/crates/abcb-cli/src/main.rs
println!("abcb eval: {runs} runs each");
let mut total_passed = 0;
let mut total = 0;
for result in &results {
println!(
" {:<14} {}/{} {}",
result.name, result.passed, result.runs, result.label
);
total_passed += result.passed;
total += result.runs;
}
println!("overall: {total_passed}/{total}");
The output is intentionally boring. A typical run looks like this in shape:
abcb eval: 5 runs each
say-hi 5/5 completes
append-note 4/5 calls session_note_append
add-numbers 5/5 final contains "7"
overall: 14/15
The number is the point. If we change the system prompt, tool descriptions, or parser tolerance, we can run the same fixtures again and compare behavior.
This is also where a practical design question becomes answerable. Do we need a formal per-tool argument schema in the prompt, or is the tool name plus description enough for the local model we are using? We do not have to argue only from intuition. We can add a fixture, run it repeatedly, and look at the rate.
19.11 Testing The Judge
The pure judge functions are easy to test:
File: abcb/crates/abcb-cli/src/main.rs
#[test]
fn judge_calls_tool_reads_the_summary() {
let expect = Expectation::CallsTool {
name: "session_note_append".into(),
};
assert!(judge(
&expect,
&Ok("ok".into()),
&summary_with(vec!["session_note_append".into()])
));
assert!(!judge(&expect, &Ok("ok".into()), &summary_with(vec![])));
}
There is no model here. The test constructs a summary, calls judge, and checks the boolean.
The final-answer expectation also makes the failure case explicit:
File: abcb/crates/abcb-cli/src/main.rs
#[test]
fn judge_final_contains_reads_the_answer() {
let expect = Expectation::FinalContains { text: "7".into() };
let s = summary_with(vec![]);
assert!(judge(&expect, &Ok("the answer is 7".into()), &s));
assert!(!judge(&expect, &Ok("the answer is six".into()), &s));
assert!(!judge(
&expect,
&Err(LoopError::MaxStepsExceeded { max_steps: 5 }),
&s
));
}
A failed run cannot satisfy a final-answer expectation. That is not a model judgment. It is just the rule encoded in judge.
19.12 Testing The Repetition Logic
The runner itself is tested with MockProvider:
File: abcb/crates/abcb-cli/src/main.rs
#[tokio::test]
async fn run_eval_tallies_the_pass_rate() {
let fixtures = vec![EvalCase {
name: "f".into(),
prompt: "p".into(),
expectation: Expectation::Completes,
}];
let mut provider = MockProvider::new([
r#"{"kind":"final","content":"a"}"#,
r#"{"kind":"final","content":"b"}"#,
]);
let registry = ToolRegistry::new();
let results = run_eval(&mut provider, ®istry, &fixtures, 2).await;
assert_eq!(results.len(), 1);
assert_eq!(results[0].passed, 2);
assert_eq!(results[0].runs, 2);
}
Two scripted final envelopes produce two completed runs, so the fixture passes twice.
The partial-pass test is more interesting:
File: abcb/crates/abcb-cli/src/main.rs
#[tokio::test]
async fn run_eval_counts_failures_in_the_rate() {
let fixtures = vec![EvalCase {
name: "f".into(),
prompt: "p".into(),
expectation: Expectation::Completes,
}];
let mut provider = MockProvider::new([r#"{"kind":"final","content":"a"}"#]);
let registry = ToolRegistry::new();
let results = run_eval(&mut provider, ®istry, &fixtures, 2).await;
assert_eq!(results[0].passed, 1);
assert_eq!(results[0].runs, 2);
}
The mock provider has only one response, but the fixture runs twice. The first trial completes. The second trial asks the mock provider for another response, and the provider is exhausted. That failure is counted in the denominator.
This is exactly the behavior we want from eval. It does not stop at the first failure. It records the rate.
19.13 Why This Is A CLI Command, Not #[ignore]
Rust has ignored tests. We could write a test that talks to the live model and mark it #[ignore].
I prefer a subcommand here.
An ignored test still looks like part of the test suite. It invites a red/green mindset. It also mixes local machine setup, model availability, and test semantics in a place where I want deterministic tests to stay deterministic.
abcb eval is more honest. It says: this is a tool for measuring the behavior of your configured model. It needs your local server. It may vary. It prints a scorecard.
The deterministic parts still live under cargo test. The live model probe lives under abcb eval.
That boundary keeps the project calmer.
Evaluation note: how small is this eval?abcb evalis a local characterization tool, not a general benchmark. Larger agent evaluations, such as GAIA, usually define a curated task set, hide or control expected answers, require capabilities like tool use, browsing, multi-step reasoning, or multimodal handling, and report aggregate scores across many tasks. Coding-agent benchmarks use a similar idea in a different domain: give the agent a task, run it in a controlled environment, and score the result with a verifier.
Our eval keeps only the smallest useful version of that shape. It has fixtures, repeated runs, expectations, and a scorecard. But it does not have a held-out dataset, human-validated tasks, rich environment simulation, cost tracking, latency tracking, or a public leaderboard. That gap is intentional. At this point, I do not need a benchmark for comparing all agents. I need a local instrument for asking whether this framework, this prompt, this tool registry, and this model are moving in the right direction.
19.14 What Changed
Chapter 19 adds abcb eval, a report-only command for measuring agent behavior across repeated runs.
The framework lesson is that model behavior should be measured as behavior, not assumed from a single run. A pass rate is more honest than a lucky pass.
The Rust lesson is that the earlier seams keep paying off. Provider lets the same runner work with a real model and a mock. Write lets the same loop write to a file or an in-memory Vec<u8>. &[EvalCase] lets the runner accept built-in fixtures today and a loaded fixture list later.
The design lesson is side-effect isolation. Eval uses the real loop and real tools, but it keeps event logs in memory and points note tools at a temp notes file. A diagnostic command should be able to probe behavior without quietly changing the project.
The next chapter moves from pedagogical tools toward more useful real-world tools. Once we can measure whether the agent follows the loop, we can give it more consequential things to do.
To be continued..