EvalForge
EvalForge is a web app (with optional CLI) for teams shipping LLM features who keep breaking quality without noticing. It lets you define realistic test suites (prompts, expected traits, rubrics), run them automatically on every model or prompt change, and track drift over time. Instead of vague “looks good” reviews, it produces scored reports: factuality checks, policy compliance, tone constraints, and task-specific metrics using a mix of deterministic checks and judge models. It also supports dataset versioning, golden conversations, and A/B comparisons between providers (OpenAI, Anthropic, etc.) so you can switch models without flying blind. This is not a magic quality button: you still need domain-specific test cases, and judge models can be biased. The value is operational—making evaluation repeatable, auditable, and tied to releases.