DField SolutionsLoading · Töltődik
Skip to content

We have a longer post about evals-as-code in CI. People read it and asked us 'OK but what file do you actually paste in'. Fair. Here is the harness we use as the starting point on every new RAG / agent project. About 200 lines of TypeScript, no framework dependency beyond `zod` and a fetch client to your model.

It is intentionally small. promptfoo and Braintrust are great, but on a fresh project you do not need them yet. You need a gold set in version control, three judge functions and a CI gate. Add the framework later, when you have hundreds of cases and 5+ engineers running evals.

What the harness does

  • Loads a gold set from `evals/gold.jsonl` (one JSON per line, schema validated by zod).
  • Runs each case through your model under test.
  • Scores with three judge modes · exact match, embedding similarity, LLM as judge.
  • Aggregates per metric, compares to `evals/baseline.json`, fails CI when any metric drops by more than the threshold.
  • Prints a markdown diff report. Friendly to GitHub PR comments.

The full file · `evals/run.ts`

import { readFileSync, writeFileSync } from "node:fs";
import { z } from "zod";
import OpenAI from "openai";

const openai = new OpenAI();

const Case = z.object({
  id: z.string(),
  prompt: z.string(),
  expected: z.string(),
  judge: z.enum(["exact", "similarity", "llm"]),
  threshold: z.number().min(0).max(1).optional(),
  tags: z.array(z.string()).default([]),
});
type Case = z.infer<typeof Case>;

type Result = {
  id: string;
  prompt: string;
  expected: string;
  actual: string;
  judge: Case["judge"];
  score: number;
  pass: boolean;
  tags: string[];
};

function loadGold(): Case[] {
  return readFileSync("evals/gold.jsonl", "utf8")
    .split("\n")
    .filter(Boolean)
    .map((l, i) => {
      const parsed = Case.safeParse(JSON.parse(l));
      if (!parsed.success) throw new Error(`bad case at line ${i}: ${parsed.error.message}`);
      return parsed.data;
    });
}

async function callModel(prompt: string): Promise<string> {
  const r = await openai.chat.completions.create({
    model: process.env.MODEL ?? "gpt-4.1-mini",
    messages: [{ role: "user", content: prompt }],
    temperature: 0,
  });
  return r.choices[0]?.message?.content ?? "";
}

async function embed(text: string): Promise<number[]> {
  const r = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return r.data[0]!.embedding;
}

function cosine(a: number[], b: number[]): number {
  let dot = 0, na = 0, nb = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i]! * b[i]!;
    na += a[i]! ** 2;
    nb += b[i]! ** 2;
  }
  return dot / (Math.sqrt(na) * Math.sqrt(nb));
}

async function judgeLLM(expected: string, actual: string): Promise<number> {
  const r = await openai.chat.completions.create({
    model: "gpt-4.1-mini",
    temperature: 0,
    messages: [{
      role: "system",
      content: "Score 0..1 how well ACTUAL satisfies EXPECTED. Reply only the number.",
    }, {
      role: "user",
      content: `EXPECTED:\n${expected}\n\nACTUAL:\n${actual}`,
    }],
  });
  const n = Number((r.choices[0]?.message?.content ?? "0").trim());
  return Number.isFinite(n) ? Math.max(0, Math.min(1, n)) : 0;
}

async function score(c: Case, actual: string): Promise<number> {
  if (c.judge === "exact") return c.expected.trim() === actual.trim() ? 1 : 0;
  if (c.judge === "similarity") {
    const [a, b] = await Promise.all([embed(c.expected), embed(actual)]);
    return cosine(a, b);
  }
  return judgeLLM(c.expected, actual);
}

function aggregate(results: Result[]): Record<string, number> {
  const tags = new Set(results.flatMap(r => r.tags));
  tags.add("all");
  const out: Record<string, number> = {};
  for (const t of tags) {
    const sub = t === "all" ? results : results.filter(r => r.tags.includes(t));
    out[t] = sub.length === 0 ? 0 : sub.reduce((s, r) => s + r.score, 0) / sub.length;
  }
  return out;
}

function loadBaseline(): Record<string, number> {
  try { return JSON.parse(readFileSync("evals/baseline.json", "utf8")); }
  catch { return {}; }
}

function renderReport(
  current: Record<string, number>,
  baseline: Record<string, number>,
  threshold: number
): { md: string; failed: boolean } {
  let md = `| metric | baseline | current | delta | status |\n|---|---|---|---|---|\n`;
  let failed = false;
  const keys = new Set([...Object.keys(current), ...Object.keys(baseline)]);
  for (const k of keys) {
    const b = baseline[k] ?? 0;
    const c = current[k] ?? 0;
    const d = c - b;
    const drop = d < -threshold;
    if (drop) failed = true;
    const status = drop ? "FAIL" : d > 0.005 ? "up" : "flat";
    md += `| ${k} | ${b.toFixed(3)} | ${c.toFixed(3)} | ${d.toFixed(3)} | ${status} |\n`;
  }
  return { md, failed };
}

(async () => {
  const cases = loadGold();
  const results: Result[] = [];
  for (const c of cases) {
    const actual = await callModel(c.prompt);
    const s = await score(c, actual);
    const pass = s >= (c.threshold ?? 0.7);
    results.push({ id: c.id, prompt: c.prompt, expected: c.expected, actual, judge: c.judge, score: s, pass, tags: c.tags });
    process.stdout.write(`${pass ? "." : "x"}`);
  }
  console.log("\n");
  const current = aggregate(results);
  const baseline = loadBaseline();
  const { md, failed } = renderReport(current, baseline, 0.05);
  writeFileSync("evals/report.md", md);
  console.log(md);
  if (failed) {
    console.error("Regression beyond 0.05 threshold. Failing build.");
    process.exit(1);
  }
})();

How it ties to CI

GitHub Actions step calls `npx tsx evals/run.ts`. The script returns non zero when a regression is detected. The same step uploads `evals/report.md` as a PR comment via `actions/github-script`. Total CI time on a 50 case gold set: 90-120 seconds.

Where this falls short

  • No retry / rate limit handling. Add when your gold set goes over 200 cases.
  • No cost tracking. We add it when the eval bill goes over 1 EUR / run.
  • No streaming. Every case runs sequentially. For larger sets, parallelise with `Promise.all` chunks of 5.
  • LLM judge is only as good as its prompt. Treat it as a coarse gate, not as a peer reviewer.

When to graduate to a framework

When you have 200+ cases, 5+ engineers running evals, or you start needing fancy things like trace replay, A/B model comparison and dashboarding. Until then, the file above wins on every axis: zero vendor lock, every line auditable, integrates with whatever CI you have.

Copy this. Strip what you do not need. Add what you do. Stop letting framework choice be the reason your evals are not in CI yet.

ShareXLinkedIn#
Dezso Mezo
By

Dezso Mezo

Founder, DField Solutions

I've shipped production products from fintech to creator-tooling · for startups and enterprises, from Budapest to San Francisco.

Keep reading
RELATED PROJECTS
Let's talk

Would rather build together?

Let's talk about your project. 30 minutes, no strings.