DField SolutionsMérnöki stúdió · Budapest
Loading · Töltődik
Skip to content

Eval harness

Related service AI solutions

DEFINITION

An eval harness is the runnable infrastructure that, on every model bump, every prompt change, and every release, automatically runs a fixed test set, computes metrics (accuracy, factuality, refusal rate, latency, cost), stores results as a time series, and blocks release if any threshold drops. Saying we tested it usually means a developer played a few prompts through once by hand. An eval harness is the CI-wired, regression-catching, version-comparing version of that. Without one, every model bump is flying blind. A serious LLM stack today is a dataset, a runner (Promptfoo, Inspect, in-house), a scoring layer (LLM-as-judge plus deterministic asserts), and a dashboard where yesterday's run sits next to the new one.

RELATED TERMS06
  • RAG (Retrieval-Augmented Generation)

    An AI architecture where the model retrieves relevant documents from your own data before answering, and only reasons over that context. Kills ~80% of hallucinations.

  • LLM (Large Language Model)

    A neural model with billions of parameters (GPT-4, Claude, Mistral) that generates text. In production we never use one bare · always wrapped in retrieval and guardrails.

  • Embedding

    A vector representation of text (e.g. 1536 floats). If two embeddings are close, the meanings are close. In RAG we use this to pick relevant chunks.

  • Vector database

    A database specialised for fast approximate-nearest-neighbour search over embedding vectors (pgvector, Qdrant, Weaviate). The engineering base of RAG retrieval.

  • Eval (LLM evaluation)

    An automated test suite that runs ~50–200 'golden' questions against the model before every release and checks that quality metrics (accuracy, factuality, latency) clear the threshold.

  • Guardrail

    An input- or output-layer that filters the model's prompt/response (PII scrubbers, prompt-injection detectors, JSON-schema validation, topic blocks). Not before/after the model · around it.

MENTIONED IN THE BLOG08