We built our own LLM eval harness in 200 lines of TS · here is the file
Frameworks are great until they get in the way. Here is a 200-line TS eval harness that runs in CI, blocks regressions and prints a diff.
Frameworks are great until they get in the way. Here is a 200-line TS eval harness that runs in CI, blocks regressions and prints a diff.
We have a longer post about evals-as-code in CI. People read it and asked us 'OK but what file do you actually paste in'. Fair. Here is the harness we use as the starting point on every new RAG / agent project. About 200 lines of TypeScript, no framework dependency beyond `zod` and a fetch client to your model.
It is intentionally small. promptfoo and Braintrust are great, but on a fresh project you do not need them yet. You need a gold set in version control, three judge functions and a CI gate. Add the framework later, when you have hundreds of cases and 5+ engineers running evals.
import { readFileSync, writeFileSync } from "node:fs";
import { z } from "zod";
import OpenAI from "openai";
const openai = new OpenAI();
const Case = z.object({
id: z.string(),
prompt: z.string(),
expected: z.string(),
judge: z.enum(["exact", "similarity", "llm"]),
threshold: z.number().min(0).max(1).optional(),
tags: z.array(z.string()).default([]),
});
type Case = z.infer<typeof Case>;
type Result = {
id: string;
prompt: string;
expected: string;
actual: string;
judge: Case["judge"];
score: number;
pass: boolean;
tags: string[];
};
function loadGold(): Case[] {
return readFileSync("evals/gold.jsonl", "utf8")
.split("\n")
.filter(Boolean)
.map((l, i) => {
const parsed = Case.safeParse(JSON.parse(l));
if (!parsed.success) throw new Error(`bad case at line ${i}: ${parsed.error.message}`);
return parsed.data;
});
}
async function callModel(prompt: string): Promise<string> {
const r = await openai.chat.completions.create({
model: process.env.MODEL ?? "gpt-4.1-mini",
messages: [{ role: "user", content: prompt }],
temperature: 0,
});
return r.choices[0]?.message?.content ?? "";
}
async function embed(text: string): Promise<number[]> {
const r = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
});
return r.data[0]!.embedding;
}
function cosine(a: number[], b: number[]): number {
let dot = 0, na = 0, nb = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i]! * b[i]!;
na += a[i]! ** 2;
nb += b[i]! ** 2;
}
return dot / (Math.sqrt(na) * Math.sqrt(nb));
}
async function judgeLLM(expected: string, actual: string): Promise<number> {
const r = await openai.chat.completions.create({
model: "gpt-4.1-mini",
temperature: 0,
messages: [{
role: "system",
content: "Score 0..1 how well ACTUAL satisfies EXPECTED. Reply only the number.",
}, {
role: "user",
content: `EXPECTED:\n${expected}\n\nACTUAL:\n${actual}`,
}],
});
const n = Number((r.choices[0]?.message?.content ?? "0").trim());
return Number.isFinite(n) ? Math.max(0, Math.min(1, n)) : 0;
}
async function score(c: Case, actual: string): Promise<number> {
if (c.judge === "exact") return c.expected.trim() === actual.trim() ? 1 : 0;
if (c.judge === "similarity") {
const [a, b] = await Promise.all([embed(c.expected), embed(actual)]);
return cosine(a, b);
}
return judgeLLM(c.expected, actual);
}
function aggregate(results: Result[]): Record<string, number> {
const tags = new Set(results.flatMap(r => r.tags));
tags.add("all");
const out: Record<string, number> = {};
for (const t of tags) {
const sub = t === "all" ? results : results.filter(r => r.tags.includes(t));
out[t] = sub.length === 0 ? 0 : sub.reduce((s, r) => s + r.score, 0) / sub.length;
}
return out;
}
function loadBaseline(): Record<string, number> {
try { return JSON.parse(readFileSync("evals/baseline.json", "utf8")); }
catch { return {}; }
}
function renderReport(
current: Record<string, number>,
baseline: Record<string, number>,
threshold: number
): { md: string; failed: boolean } {
let md = `| metric | baseline | current | delta | status |\n|---|---|---|---|---|\n`;
let failed = false;
const keys = new Set([...Object.keys(current), ...Object.keys(baseline)]);
for (const k of keys) {
const b = baseline[k] ?? 0;
const c = current[k] ?? 0;
const d = c - b;
const drop = d < -threshold;
if (drop) failed = true;
const status = drop ? "FAIL" : d > 0.005 ? "up" : "flat";
md += `| ${k} | ${b.toFixed(3)} | ${c.toFixed(3)} | ${d.toFixed(3)} | ${status} |\n`;
}
return { md, failed };
}
(async () => {
const cases = loadGold();
const results: Result[] = [];
for (const c of cases) {
const actual = await callModel(c.prompt);
const s = await score(c, actual);
const pass = s >= (c.threshold ?? 0.7);
results.push({ id: c.id, prompt: c.prompt, expected: c.expected, actual, judge: c.judge, score: s, pass, tags: c.tags });
process.stdout.write(`${pass ? "." : "x"}`);
}
console.log("\n");
const current = aggregate(results);
const baseline = loadBaseline();
const { md, failed } = renderReport(current, baseline, 0.05);
writeFileSync("evals/report.md", md);
console.log(md);
if (failed) {
console.error("Regression beyond 0.05 threshold. Failing build.");
process.exit(1);
}
})();GitHub Actions step calls `npx tsx evals/run.ts`. The script returns non zero when a regression is detected. The same step uploads `evals/report.md` as a PR comment via `actions/github-script`. Total CI time on a 50 case gold set: 90-120 seconds.
When you have 200+ cases, 5+ engineers running evals, or you start needing fancy things like trace replay, A/B model comparison and dashboarding. Until then, the file above wins on every axis: zero vendor lock, every line auditable, integrates with whatever CI you have.
Copy this. Strip what you do not need. Add what you do. Stop letting framework choice be the reason your evals are not in CI yet.

Founder, DField Solutions
I've shipped production products from fintech to creator-tooling · for startups and enterprises, from Budapest to San Francisco.
Let's talk about your project. 30 minutes, no strings.