26 April 2026·11 min read

We built our own LLM eval harness in 200 lines of TS · here is the file

Frameworks are great until they get in the way. Here is a 200-line TS eval harness that runs in CI, blocks regressions and prints a diff.

Last verified26 April 2026

By Dezso MezoFounder, DField Solutions

ShareX LinkedIn#

We built our own LLM eval harness in 200 lines of TS · here is the file

We have a longer post about evals-as-code in CI. People read it and asked us 'OK but what file do you actually paste in'. Fair. Here is the harness we use as the starting point on every new RAG / agent project. About 200 lines of TypeScript, no framework dependency beyond `zod` and a fetch client to your model.

It is intentionally small. promptfoo and Braintrust are great, but on a fresh project you do not need them yet. You need a gold set in version control, three judge functions and a CI gate. Add the framework later, when you have hundreds of cases and 5+ engineers running evals.

What the harness does

Loads a gold set from `evals/gold.jsonl` (one JSON per line, schema validated by zod).
Runs each case through your model under test.
Scores with three judge modes · exact match, embedding similarity, LLM as judge.
Aggregates per metric, compares to `evals/baseline.json`, fails CI when any metric drops by more than the threshold.
Prints a markdown diff report. Friendly to GitHub PR comments.

The full file · `evals/run.ts`

import { readFileSync, writeFileSync } from "node:fs";
import { z } from "zod";
import OpenAI from "openai";

const openai = new OpenAI();

const Case = z.object({
  id: z.string(),
  prompt: z.string(),
  expected: z.string(),
  judge: z.enum(["exact", "similarity", "llm"]),
  threshold: z.number().min(0).max(1).optional(),
  tags: z.array(z.string()).default([]),
});
type Case = z.infer<typeof Case>;

type Result = {
  id: string;
  prompt: string;
  expected: string;
  actual: string;
  judge: Case["judge"];
  score: number;
  pass: boolean;
  tags: string[];
};

function loadGold(): Case[] {
  return readFileSync("evals/gold.jsonl", "utf8")
    .split("\n")
    .filter(Boolean)
    .map((l, i) => {
      const parsed = Case.safeParse(JSON.parse(l));
      if (!parsed.success) throw new Error(`bad case at line ${i}: ${parsed.error.message}`);
      return parsed.data;
    });
}

async function callModel(prompt: string): Promise<string> {
  const r = await openai.chat.completions.create({
    model: process.env.MODEL ?? "gpt-4.1-mini",
    messages: [{ role: "user", content: prompt }],
    temperature: 0,
  });
  return r.choices[0]?.message?.content ?? "";
}

async function embed(text: string): Promise<number[]> {
  const r = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return r.data[0]!.embedding;
}

function cosine(a: number[], b: number[]): number {
  let dot = 0, na = 0, nb = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i]! * b[i]!;
    na += a[i]! ** 2;
    nb += b[i]! ** 2;
  }
  return dot / (Math.sqrt(na) * Math.sqrt(nb));
}

async function judgeLLM(expected: string, actual: string): Promise<number> {
  const r = await openai.chat.completions.create({
    model: "gpt-4.1-mini",
    temperature: 0,
    messages: [{
      role: "system",
      content: "Score 0..1 how well ACTUAL satisfies EXPECTED. Reply only the number.",
    }, {
      role: "user",
      content: `EXPECTED:\n${expected}\n\nACTUAL:\n${actual}`,
    }],
  });
  const n = Number((r.choices[0]?.message?.content ?? "0").trim());
  return Number.isFinite(n) ? Math.max(0, Math.min(1, n)) : 0;
}

async function score(c: Case, actual: string): Promise<number> {
  if (c.judge === "exact") return c.expected.trim() === actual.trim() ? 1 : 0;
  if (c.judge === "similarity") {
    const [a, b] = await Promise.all([embed(c.expected), embed(actual)]);
    return cosine(a, b);
  }
  return judgeLLM(c.expected, actual);
}

function aggregate(results: Result[]): Record<string, number> {
  const tags = new Set(results.flatMap(r => r.tags));
  tags.add("all");
  const out: Record<string, number> = {};
  for (const t of tags) {
    const sub = t === "all" ? results : results.filter(r => r.tags.includes(t));
    out[t] = sub.length === 0 ? 0 : sub.reduce((s, r) => s + r.score, 0) / sub.length;
  }
  return out;
}

function loadBaseline(): Record<string, number> {
  try { return JSON.parse(readFileSync("evals/baseline.json", "utf8")); }
  catch { return {}; }
}

function renderReport(
  current: Record<string, number>,
  baseline: Record<string, number>,
  threshold: number
): { md: string; failed: boolean } {
  let md = `| metric | baseline | current | delta | status |\n|---|---|---|---|---|\n`;
  let failed = false;
  const keys = new Set([...Object.keys(current), ...Object.keys(baseline)]);
  for (const k of keys) {
    const b = baseline[k] ?? 0;
    const c = current[k] ?? 0;
    const d = c - b;
    const drop = d < -threshold;
    if (drop) failed = true;
    const status = drop ? "FAIL" : d > 0.005 ? "up" : "flat";
    md += `| ${k} | ${b.toFixed(3)} | ${c.toFixed(3)} | ${d.toFixed(3)} | ${status} |\n`;
  }
  return { md, failed };
}

(async () => {
  const cases = loadGold();
  const results: Result[] = [];
  for (const c of cases) {
    const actual = await callModel(c.prompt);
    const s = await score(c, actual);
    const pass = s >= (c.threshold ?? 0.7);
    results.push({ id: c.id, prompt: c.prompt, expected: c.expected, actual, judge: c.judge, score: s, pass, tags: c.tags });
    process.stdout.write(`${pass ? "." : "x"}`);
  }
  console.log("\n");
  const current = aggregate(results);
  const baseline = loadBaseline();
  const { md, failed } = renderReport(current, baseline, 0.05);
  writeFileSync("evals/report.md", md);
  console.log(md);
  if (failed) {
    console.error("Regression beyond 0.05 threshold. Failing build.");
    process.exit(1);
  }
})();

How it ties to CI

GitHub Actions step calls `npx tsx evals/run.ts`. The script returns non zero when a regression is detected. The same step uploads `evals/report.md` as a PR comment via `actions/github-script`. Total CI time on a 50 case gold set: 90-120 seconds.

Where this falls short

No retry / rate limit handling. Add when your gold set goes over 200 cases.
No cost tracking. We add it when the eval bill goes over 1 EUR / run.
No streaming. Every case runs sequentially. For larger sets, parallelise with `Promise.all` chunks of 5.
LLM judge is only as good as its prompt. Treat it as a coarse gate, not as a peer reviewer.

When to graduate to a framework

When you have 200+ cases, 5+ engineers running evals, or you start needing fancy things like trace replay, A/B model comparison and dashboarding. Until then, the file above wins on every axis: zero vendor lock, every line auditable, integrates with whatever CI you have.

Copy this. Strip what you do not need. Add what you do. Stop letting framework choice be the reason your evals are not in CI yet.

ShareX LinkedIn#

Dezso Mezo

Founder, DField Solutions

I've shipped production products from fintech to creator-tooling · for startups and enterprises, from Budapest to San Francisco.

ABOUT Let's talk

Keep reading

30 Sept 2026·11 min read

DField Q3 2026 roundup · what shifted, what we shipped, what is broken

Three months in. SZEP 2.0 live, NAV v3 cutover, AI Act enforcement, OWASP LLM Top 10 v2. Hard numbers, one strong opinion on the consulting tier.

Read

01 Jul 2026·11 min read

DField Q2 2026 roundup · what shifted, what we shipped, what is broken

Four months in. Eleven shipped projects, real before/after numbers, one strong opinion on what the consulting tier got wrong this quarter.

Read

26 Apr 2026·9 min read

RAG's three failure modes · and the diagnostic table we use on every audit

Three failure modes, one table. 30 minutes of diagnosis, then you know what to fix. Stop guessing.

Read

RELATED PROJECTS

AI solutions · Website & online shop · Mobile app (iPhone + Android) · 2026AIHealthIQAI reads your wearable data, spots changes, recommends treatment and lifestyle tweaks.

AI solutions · Website & online shop · 20263D AI PropertyAI-generated 3D properties and interiors, walkable in FPV · with a manual editor and drone view.

AI solutions · Cybersecurity · Website & online shop · 2026Use AI EasilyAn AI firm's website · home of Hungary's first dedicated AI-security practice.

Let's talk

Would rather build together?

Let's talk about your project. 30 minutes, no strings.

Let's talk