Show HN: An Incident Intelligence Layer that learns from real oncall work

I’ve worked in operations and SRE for years, mostly in environments where outages, oncall fatigue and recurring incidents were normal. Like many here, I’ve been through the 3 AM pages, the tribal knowledge, the fixes that disappear in scrolling Slack threads, and the runbooks nobody touches until it’s too late.

I also live with Type-1 diabetes. That forces me to run extremely disciplined systems in my personal life: continuous monitoring, feedback loops, automated corrections, stability under stress. It shaped how I think about infrastructure in an unexpected way.

I have oncall in my blood literally. My blood sugar is basically a live monitoring system.

And that made me notice something strange about current observability stacks:

We measure everything except the one thing that actually resolves incidents: the human problem-solving process.

Every outage generates knowledge, but most of it evaporates: - shell history disappears - Slack conversations drift away - senior engineers fix silently - runbooks rot - context is lost - the same incident happens again and is solved again

So I’m exploring a new layer for the SRE stack: an Incident Intelligence Layer.

High-level idea (no deep tech here):

- troubleshooting sessions become structured, anonymous traces - each incident type gets a shared knowledge feed - engineers upvote or downvote solutions - a local LLM summarizes recurring patterns - a sanitized layer allows safe use of a public LLM - repeated successful solutions gradually become recommended actions or potential automation candidates

The goal is simple: every outage should make the system smarter, not just the engineer who fixed it.

I’m working on an early MVC: - a minimal session recorder that emits structured JSON - basic incident-type feeds - voting - a first pass of local LLM summarization

Not a full product. Just exploring the space and validating whether others see the same gap.

Would love to talk with people who: - work in SRE or oncall - build observability or internal tooling - have tried to reduce repeated incidents - think about AI-assisted remediation - or have built infra startups before

If this resonates, feel free to DM me here on HN. Happy to share more privately.