EvalLedger

EvalLedger is a web app (with a lightweight CLI) for teams who keep “evaluating” LLMs in spreadsheets and Slack threads. It stores every evaluation run as an immutable, queryable record: prompt/template version, model/provider, parameters, retrieval context, tool calls, datasets, and human ratings. You define eval suites (unit tests, regression sets, safety checks) and run them in CI to catch silent model drift, prompt changes, or provider updates before they hit production. The product focuses on auditability and reproducibility over flashy dashboards: deterministic replays, signed artifacts, and clear diffs of what changed between runs. It supports both automated metrics (exact match, rubric LLM-judge with calibration) and human review workflows with sampling and inter-rater agreement. This is an AI app + traditional app combination: AI for judging and clustering failures, traditional for ledgering, access control, and CI integration.

← Back to idea list