TrainBudget
TrainBudget is a web app (with optional CLI) that enforces cost and time guardrails on model training. Instead of being another experiment tracker, it answers a blunt question: “Should this run continue?” It monitors live training signals (loss curves, gradient stats, throughput, eval metrics) and cloud spend in near real time, then triggers actions: early-stop, downscale, pause, or switch to cheaper instances when progress stalls. Teams set policies like max $/run, max $/metric gain, or “stop if validation hasn’t improved in N steps.” It integrates with common stacks (PyTorch Lightning, Hugging Face Trainer, Ray Train) and major clouds, producing a simple postmortem report that highlights wasted spend and the exact moment a run went off the rails. It won’t magically improve your model; it will reduce avoidable waste and enforce discipline.