FailoverForge

FailoverForge is a web app (with an optional CLI agent) that continuously tests database replication and failover the way production actually breaks: network partitions, packet loss, clock skew, disk pressure, primary crashes, and misconfigured replication slots. Instead of only showing lag metrics, it runs controlled, scheduled chaos drills against staging or a guarded production canary, then scores RPO/RTO outcomes and highlights the exact step where recovery failed. It generates runbooks from successful drills and flags drift when topology, parameters, or versions change. The MVP focuses on Postgres streaming replication and MySQL async replication, with safe-guards like blast-radius limits, approval gates, and automatic rollback. This is not glamorous: it’s operational hygiene. But teams routinely discover their “tested” failover is untested, and the first real outage is when they learn the truth.

← Back to idea list