TraceTriage

TraceTriage is a web app that helps teams debug serverless incidents by turning noisy distributed traces and logs into a prioritized “what broke first” timeline. It ingests OpenTelemetry traces plus cloud-provider logs, then groups failures by likely root cause (timeouts, throttling, downstream latency, IAM/permission errors, bad deploys). Instead of hunting across CloudWatch, X-Ray, and dashboards, engineers get a single incident view: the first anomalous span, the blast radius (functions, queues, APIs), and the exact request attributes that correlate with failures. It also flags common serverless footguns like hidden retries, concurrency spikes, and mis-sized timeouts. This is not a magic autopilot; it’s a practical triage tool that saves hours during on-call by narrowing the search space and producing actionable hypotheses with supporting evidence.

← Back to idea list