top of page

Anomaly Detection Benchmark on Real Sensor Data

Most anomaly detection tutorials use toy datasets with obvious, synthetic anomalies. This one doesn't. I benchmarked six methods against real-world machine sensor data from the Numenta Anomaly Benchmark (NAB) — a dataset where anomalies are rare, messy, and genuinely hard to find.

The short version: Prophet generalises better across stream types, but a simple z-score wins on clean, high-amplitude anomalies. Here's what that means and why it matters.

The Dataset

The NAB machine_temperature_system_failure stream is 22,683 rows of 5-minute temperature readings from an industrial machine. There are 4 anomaly events in the entire series — equipment failure signatures buried in months of normal operation. That's a 0.02% positive rate.

This matters because most tutorials inflate their anomaly rates to make F1 look good. At 0.02%, F1 is near-zero for every method regardless of quality. The right metrics here are AUROC (how well the method ranks anomalies above normals, threshold-free) and Average Precision (area under the precision-recall curve, which respects class imbalance).

I also had to be careful with the labels. NAB ships two label files: combined_windows.json marks wide time windows (up to 47 hours) around each event for partial-credit scoring. combined_labels.json marks the actual event timestamps. Using windows as point labels inflates the anomaly rate to ~10% and makes every metric meaningless. I used point labels throughout.


The Methods

The notebook includes a few baselines for context, but the core showcase is Prophet residual scoring:

Rolling MAD — Median absolute deviation over a rolling window. Robust to outliers in the baseline. Warm-started across split boundaries so the window has history from the previous split at evaluation time.

STL — Seasonal-Trend decomposition via LOESS. Separates the series into trend, seasonality, and residual. Anomaly score is the residual magnitude. Offline method — it sees the full series, so it's retrospective, not causal.

Mean-Window Discord — Divides the series into overlapping windows, z-normalises each, computes distance to the mean train window. High distance = unusual pattern. This is not the matrix profile (which uses nearest-neighbour distance) — it's a simpler discord baseline fitted on train windows only.

Prophet — Facebook's forecasting library. Fits an additive model with trend and seasonality components on the training split, then forecasts on the test split. Anomaly score is the absolute forecast residual. It explicitly models daily and weekly cycles.

LSTM — A single-layer LSTM forecaster. Trained on the training split with early stopping on a held-out portion. Standardisation uses train statistics only. Anomaly score is the absolute difference between predicted and actual value.


Evaluation Protocol

Split: time-ordered train/val/test. Thresholds (when used) are tuned on the validation split and then frozen on test. Because point anomalies are extremely sparse in NAB, AUROC and AP are the most reliable metrics; F1 is highly threshold-sensitive and should be treated as secondary.

Scaling: each method's raw scores are min-max scaled using train statistics, then applied to test. This keeps the score scale consistent so a threshold set on one split is actually meaningful on another.


Results

Method Val AUROC Val AP Test AUROC Test AP
Prophet 0.861 0.440 0.890 0.825
LSTM 0.570 0.229 0.559 0.143
STL 0.581 0.237 0.449 0.089
Rolling MAD 0.506 0.209 0.516 0.132
Mean-Window Discord 0.503 0.178 0.346 0.074

download.png


Why Prophet Winning Matters for Anomaly Detection

In real monitoring, the hardest anomalies are often contextual: the value might not be extreme in absolute terms, but it is wrong for the time and operating regime. That’s exactly what Prophet is designed to handle.

Prophet builds an explicit expected baseline — trend + seasonality — and scores anomalies using the absolute forecast residual.

This changes the question from “how far is this from the overall mean?” to:

“How far is this from what we should be seeing right now?”

That framing is why Prophet is so effective on production streams with daily/weekly cycles (cloud CPU, request rate, business KPIs). It reduces false positives from predictable peaks and troughs and concentrates alerts on genuinely unexpected behavior.

On the machine temperature failure stream, Prophet remains strong (Test AUROC 0.890, Test AP 0.825) while providing an interpretable story: the observed temperature diverged sharply from the model’s expected baseline.

The practical lesson for anomaly detection is:

  • Model the baseline first. Prophet separates trend and seasonality so anomalies are scored as residual surprises, not raw magnitude.
  • Reduce alert fatigue. By accounting for predictable cycles, Prophet avoids flagging routine peaks and focuses alerts on genuinely unexpected behavior.
  • Prefer ranking metrics on sparse labels. AUROC/AP are the most reliable summary metrics when anomalies are rare.

Where Prophet Shines in Practice

Prophet is especially useful on metrics with strong daily/weekly cycles — cloud CPU, request rate, and business KPIs — where the goal is to detect deviations from an expected rhythm.

The key idea is simple: a value can look “normal” in absolute terms but still be anomalous because it occurs at the wrong time or in the wrong regime. Prophet captures that by forecasting the expected baseline and scoring residuals.

If you want to extend this notebook beyond machine_temp, add a seasonal cloud metric (e.g., an EC2 CPU stream) and compare residual plots. Prophet’s residuals typically look cleaner than raw-value thresholds because the cycle is modeled away.


What Didn’t Work

LSTM underperformed significantly (AUROC 0.559 on machine_temp). With only 13,000 training rows and a 48-step window, the model doesn’t have enough signal to learn a useful forecast. It also trains on clean data and sees the failure regime for the first time at test time — a distribution shift the architecture has no mechanism to handle gracefully.

STL is offline — it decomposes the full test split, so it has future context. Despite this, it scores 0.449 AUROC. The residuals are too noisy relative to the anomaly magnitude.

Mean-Window Discord scores 0.346 — the worst result. The mean reference window is computed on train data where the machine was operating normally. When the failure pattern is genuinely unlike anything in train, the method should score high. It doesn’t, likely because the distance metric is too sensitive to short-term variation during the normal operating period.


Implementation Notes Worth Knowing

A few details that actually changed the results:

Off-grid label timestamps. NAB’s combined_labels.json timestamps don’t align to the 5-minute CSV grid — anomaly events are logged at times like 02:04 while the series rows are at 00:00, 00:05, 00:10.... Exact timestamp matching silently drops all labels. The fix is merge_asof with a 5-minute tolerance.

Per-split score scaling breaks frozen thresholds. If you min-max scale train, val, and test scores independently, a threshold of 0.3 means different things in each split. The threshold is nominally “frozen” but the score it’s applied to has been rescaled. The fix: fit the scaler on train scores only, apply to all splits.

Rolling window cold start. A rolling MAD window computed from scratch at the start of the test split has no history — the first 48 rows use partial windows and produce noisy scores. The fix: prepend the last 47 rows of train before computing test scores, then drop them from the output.

None of these are dramatic bugs. All of them silently degrade results without error messages.


What to Take Away

Use AUROC and AP as your primary metrics on sparse anomaly data. F1 is unreliable when positive rate is below 1%.

Prophet is a strong default for real operational metrics because it models the expected baseline (trend + seasonality) and flags unexpected deviations via residuals. When anomalies are rare, focus on ranking metrics (AUROC/AP) and the residual plots — that’s what determines whether an alert is useful in practice.

The hardest part of this project wasn’t the models — it was getting the evaluation right. Wrong label files, leaking scale parameters, cold-start artefacts: all of these are invisible unless you go looking for them.

bottom of page