I’ve worked in operations and SRE for years, mostly in environments where outages, oncall fatigue and recurring incidents were normal. Like many here, I’ve been through the 3 AM pages, the tribal knowledge, the fixes that disappear in scrolling Slack threads, and the runbooks nobody touches until it’s too late.
I also live with Type-1 diabetes. That forces me to run extremely disciplined systems in my personal life: continuous monitoring, feedback loops, automated corrections, stability under stress. It shaped how I think about infrastructure in an unexpected way.
I have oncall in my blood literally. My blood sugar is basically a live monitoring system.
And that made me notice something strange about current observability stacks:
We measure everything except the one thing that actually resolves incidents: the human problem-solving process.
Every outage generates knowledge, but most of it evaporates:
- shell history disappears
- Slack conversations drift away
- senior engineers fix silently
- runbooks rot
- context is lost
- the same incident happens again and is solved again
So I’m exploring a new layer for the SRE stack: an Incident Intelligence Layer.
High-level idea (no deep tech here):
- troubleshooting sessions become structured, anonymous traces
- each incident type gets a shared knowledge feed
- engineers upvote or downvote solutions
- a local LLM summarizes recurring patterns
- a sanitized layer allows safe use of a public LLM
- repeated successful solutions gradually become recommended actions or potential automation candidates
The goal is simple: every outage should make the system smarter, not just the engineer who fixed it.
I’m working on an early MVC:
- a minimal session recorder that emits structured JSON
- basic incident-type feeds
- voting
- a first pass of local LLM summarization
Not a full product. Just exploring the space and validating whether others see the same gap.
Would love to talk with people who:
- work in SRE or oncall
- build observability or internal tooling
- have tried to reduce repeated incidents
- think about AI-assisted remediation
- or have built infra startups before
If this resonates, feel free to DM me here on HN. Happy to share more privately.
reply