RLGuardrail

RLGuardrail is a web app (with optional CLI) that stress-tests reinforcement learning agents for reward hacking, unsafe exploration, and brittle policies before deployment. You upload an environment wrapper (Gymnasium-compatible) and your trained policy checkpoint, then run a battery of adversarial evaluations: randomized reward perturbations, observation noise, action delays, and constraint-violation probes. The app produces a “failure atlas” showing where the agent exploits loopholes, plus reproducible test seeds and minimal counterexamples. It also generates regression tests so future training runs don’t reintroduce the same exploit. This is an AI app + traditional app: it uses automated search to find policy failures and an LLM-assisted report writer to summarize issues and suggest environment/reward fixes. It’s realistic: it won’t guarantee safety, but it will reliably surface many common RL failure modes teams otherwise miss.

← Back to idea list