turingtower/blog

From Threshold Alerts to Information-Theoretic Stability: Quantifying Behavioral Drift

Modern vulnerability management measures exposure; it rarely measures behavioral change. By applying information-theoretic divergence scoring (e.g., Jensen — Shannon or KL divergence), security teams can quantify how far an asset’s current alert distribution has shifted from its historical norm. In simple terms: instead of asking “Did alerts spike?”, this approach asks “Did the mix of behavior change meaningfully?” A bounded divergence score becomes a stability index that tracks regime shifts over time.

The argument we want to defend here is narrower than this preamble suggests, and we want to put it directly. Alert volume is the wrong object of measurement, and most security telemetry pipelines are optimizing it because it is easy to count, not because it is informative. The right object is the distribution of behavior, meaning what categories of activity an entity produces, in what proportions, over rolling time windows. Divergence measures over those distributions produce something volume cannot: a quantitative signal of structural change. What matters is the shift from counting events to scoring distributions.

Two things this argument is not. It is not event level anomaly detection in a new wrapper. And it is not contingent on machine learning, despite the math; the substrate is information theory, which is older, formal, and easier to defend.

How a Stability Score Is Constructed

Consider a stylized asset that historically produces alerts in a 70/20/10 split across authentication, process execution, and network categories. This week the total alert count is unchanged, but the mix becomes 30/20/50. No category necessarily spikes beyond a threshold, and no individual event is anomalous. The behavioral composition has changed materially, and any volume-based or event-level detector will miss it.

Running Jensen — Shannon divergence (base-2 logarithm convention) on the two distributions yields a score of approximately 0.16 on the [0, 1] scale. If this asset’s baseline previously oscillated below 0.05 across windows, that jump is the signal. The distribution has moved structurally, and the score makes that movement legible without any reference to volume or individual event severity.

We use JSD here for the illustration because the distribution is sparse and nominal, and JSD works as a divergence measure to reason about on the page: bounded, symmetric, and well-behaved on zero bins. On ordinal or structured supports, where bins have meaningful distance between them, Wasserstein is the more honest choice because it accounts for that structure. The argument is about the family of divergence-based measurements, not about JSD specifically.

This is also where the approach separates from anomaly detection, since that is the comparison it gets compared to most often. Anomaly detection evaluates events against a learned distribution of events; the object of measurement is a point. Stability scoring evaluates a distribution against another distribution; the object of measurement is the shape. These operations can return contradictory verdicts: an entity can have zero anomalous events in a window while its overall distribution has shifted materially.

The baseline matters more than the divergence measure. A global baseline averages away the asset-specific patterns that carry the signal; a static baseline decays as legitimate operational change accumulates. The baseline must be entity-scoped and rolling. It must also persist-filter. A single elevated score is a data point, not a finding. Without persistence filtering, the noise floor that makes threshold alerting exhausting reappears in divergence form.

The Real Objection

The strongest critique of this approach lives at the schema layer. Distributions are themselves a modeling choice. By selecting what categories define an entity’s distribution (alert types, operation types, MITRE techniques, table-access patterns), you have smuggled human judgment back into the system at the schema level. Divergence measures do not give you objectivity. They give you a different bias, hidden behind nicer math.

This is correct, and we want to engage with it rather than dispatch it.

The defense is not that the categorization scheme is bias-free. It is that the categorization scheme concentrates the bias into one visible, auditable decision. With threshold-based detection, the same human judgment is distributed across hundreds of individually-tuned rules, each with its own implicit assumptions about what matters. The bias exists; it is just diffuse and ungovernable. With distributional drift, the bias is consolidated into the schema. You can review the schema. You can version it. You can challenge it. A single auditable decision is not the same as objectivity, but it is a strictly better governance surface than the rule-by-rule tuning it replaces.

This trade, diffuse implicit bias for concentrated explicit bias, is the actual case we are making.

Reframing Vulnerability Management As Continuous Behavioral Integrity

Traditional vulnerability programs optimize for discovery and remediation velocity. Proactive security requires continuous validation that controls and entities are behaving as expected. Exposure without behavioral context produces static risk; exposure plus drift produces dynamic risk insight.

Exposure tells you what could be exploited. It is a snapshot. Behavioral drift tells you what is actually changing inside that surface while it operates. The interesting assets are not the ones with the most CVEs. They are the ones where the behavioral distribution is shifting while the exposure surface remains open.

Detecting Regime Change Before Incidents Emerge

Regime change detection focuses on sustained shifts in system behavior rather than isolated anomalies. When the distribution of alerts for an asset materially diverges from its historical state, it may indicate attack progression, misconfiguration, telemetry failure, or policy regression. Persistent divergence signals structural change.

Persistent divergence signals separate regime change from operational noise. Deployments and maintenance windows produce divergence that reverts. For instance, a web server’s drift normally sits around 4%. On day 11, it spikes to 18% before dropping back to 5% the next day; deployments or similar events produce this pattern. When the spike instead holds at 17% across days 13, 14, and 15, that persistence is the signal of a structural shift. Ranking entities by magnitude and persistence selects for the assets whose behavior has structurally changed and stayed changed.

Operationalizing Proactive Security With Stability Scoring

Behavioral stability monitoring can be layered onto existing telemetry without new sensors. By visualizing drift over time and correlating it with vulnerability and control posture data, security teams gain a defensible, measurable proactive capability.

The same construction generalizes. Population Stability Index has done analogous work in credit risk monitoring and ML feature drift detection for decades, the same family of measure for the same purpose. Latency distributions in SRE, trace span composition in observability, fraud event taxonomies in financial monitoring: each is a categorical telemetry stream subject to the same measurement primitive, with different schemas and different baseline rhythms. Each deserves its own treatment, and we will return to them separately.

The implementation surface is small. Normalize the telemetry you already collect into categorical distributions per entity. Compute divergence against rolling baselines. Persist-filter. Trend. Dashboards before alerts. You cannot threshold what you have not seen.