top of page

AI for On-Call: Route Alerts, Suppress Noise, Add Runbook Context

On-call engineers aren’t burning out because they don’t care. They’re burning out because modern systems page too often, with too little context.

If you’ve ever been woken up to an alert that turns out to be “known,” “expected,” or “not actionable,” you already know the problem:

  • Too many alerts

  • Not enough signal

  • No shared context

  • Humans doing correlation at 3:00am

This page explains what “AI for on-call” should actually mean in practice, what works (and what doesn’t), and a simple approach you can implement without turning your ops team into an R&D lab.

The real issue isn’t alerting, it’s decision quality

 

Most alerting systems are good at one thing: detecting that something changed.

 

They’re usually bad at the things humans actually need:

  • Is this real or noise?

  • Is this urgent or can it wait?

  • Who owns it?

  • What changed upstream?

  • What’s the safest next step?

 

“AI for on-call” is valuable when it improves those decisions, not when it produces more notifications with fancier wording.

What “AI for on-call” should do 

 

1) Suppress noise

 

Before waking a human up, the system should remove:

  • duplicates (“same alarm, 30 times”)

  • flapping (“on/off/on/off”)

  • expected events (maintenance windows, scheduled curtailment, planned restarts)

  • low-actionability alerts (informational, “FYI”)

  • alerts that don’t correlate to impact

 

2) Route to the right person (or team)

​

Routing isn’t “who’s on call.” It’s:

  • who owns this asset / subsystem

  • who has the right access

  • who can actually take action

  • who is best suited (domain + skills)

  • what the escalation path is if they don’t respond

3) Attach a context pack before paging

If you do wake someone up, the page should already include:

  • what changed (not just the alarm name)

  • relevant telemetry around the event

  • related alarms grouped into one incident

  • recent changes (deploys, config edits, vendor interventions)

  • the most likely causes (ranked)

  • the correct runbook section + “first 3 safe checks”

  • whether this is happening elsewhere (fleet pattern)

 

The goal is simple: reduce time-to-understanding and time-to-action.

Why most “AI alerting” fails

 

A lot of vendors pitch “AI for ops,” but it falls apart when the inputs are messy.

 

Common failure modes:

  • AI is bolted onto raw tags with inconsistent naming

  • alerts have no asset model (no topology, no hierarchy, no ownership)

  • runbooks live in PDFs and SharePoint with no linking to signals

  • every customer/site is “unique,” so the model can’t generalize

  • AI can summarize alerts, but can’t prove why it thinks something is urgent

​​

If your data foundation is fragmented, AI doesn’t create clarity, it creates confident noise.

A practical architecture that actually works

 

You don’t need a "sci-fi autonomous NOC.” You need a simple pipeline that turns raw alerts into one clean decision.

​

Step 1: Normalize signals (so AI isn’t guessing)

 

Standardize:

  • alarm types and severity

  • asset identifiers

  • subsystem names

  • units, timestamps, and states

  • site / fleet context

 

If a site changes an OPC tag or Modbus register, the normalization layer should absorb that change so downstream logic stays stable.

 

Step 2: Convert alerts into incidents

 

Instead of 50 pages, create one incident:

  • group alarms by time window + subsystem + shared cause

  • dedupe repeated alarms

  • detect flapping and suppress until stable

  • apply maintenance/override rules

 

Step 3: Attach context

 

Build an automated “vitals report” per incident:

  • last 15–60 minutes of key signals

  • alarms + events in the same window

  • last known good state

  • “similar incident” links (previous occurrences)

  • runbook link + relevant section

 

Step 4: Route + escalate

 

Use rules first, AI second:

  • ownership map (asset / vendor / contract boundary)

  • severity thresholds

  • “wake up” policy (what pages after-hours vs what waits)

  • follow-the-sun routing if you have coverage

  • escalation paths and timeouts

 

AI can assist by ranking likely causes and summarizing, but routing should stay deterministic unless you have very high confidence.

Noise suppression patterns that reliably reduce pages

 

These are boring on purpose.

  • Deduplication: collapse identical alarms within a window

  • Hysteresis: require a condition to persist before paging

  • Flap detection: suppress on/off transitions until stable

  • Topology-aware grouping: treat “downstream alarms” as symptoms

  • Maintenance windows: don’t page on planned actions

  • Impact gating: page on impact to output, safety, SLA, compliance

  • Fleet context: if it’s happening everywhere, it’s probably upstream

I

f you implement only two things, do:

  1. incident grouping

  2. context packs with runbook links

 

That alone often cuts pages dramatically.

What the "vitals report" should look like (example using battery energy storage systems)

 

When the page goes out, include:

 

Incident summary

  • “BESS Site 12: PCS comms loss; started 02:13; ongoing”

​

Why this matters

  • “Impacts availability + can prevent dispatch; SLA risk if persists > 30 min”

​

What changed

  • “PCS heartbeat dropped; subsequent alarms consistent with upstream comms loss”

​

Key signals

  • MW command vs MW actual, DC bus voltage, Aux Power Feeder Status, PCS status, network health, UPS status

​

Recent changes

  • “Firmware update 2 days ago” / “new firewall rule yesterday” / “vendor remote session”

​

Runbook

  • Link to exact section: “PCS comms loss – first checks”

  • Bullet list: first 3 safe steps

 

Escalation

  • Who owns it (internal vs OEM vs site ops)

  • Who was paged + next escalation time

 

This turns a 3am page from “what is this?” into “I know exactly what to do next.”

Build in-house vs build-with a data layer

 

You can build this yourself, but you need the normalized data layer to make it scalable.

​

The hidden work isn’t “AI.” It’s:

  • keeping integrations alive across many sites/vendors

  • handling edge cases and drift (tags, firmware, network changes)

  • governance + access controls across multiple stakeholders

  • security patching and monitoring

  • normalizing datasets across sites, OEMs, and Integrators

  • maintaining runbook linkage and ownership maps as orgs change

​

A pragmatic approach is:

  • build your incident policies + ownership logic (your domain)

  • use a stable data foundation so everything downstream stays sane (Phaseshift)

​

At Phaseshift, our point of view is simple: AI for on-call only works when your alert streams are normalized, noise is suppressed, and every page includes the right runbook and telemetry context.

 

If your team is dealing with alert fatigue and on-call burnout, especially in energy or industrial operations, and you want to compare notes on a practical pilot, reach out. We’re happy to walk through what’s worked and what hasn’t.

bottom of page