Beyond 99.9%: A Deep Dive into a User-Centric Weighted Sampling Algorithm for Availability

Published: 2026-06-26
Author: DP
Views: 0
Content
## The Problem: Limitations of Traditional Availability In service monitoring, we often use "availability" to measure a service's stability. The most common calculation is `successful_requests / total_requests`. However, this simple metric can be misleading in many scenarios: 1. **The Averaging Trap**: If a service goes down for an hour at the beginning of the month but runs perfectly for the rest, its monthly availability will still be very high (e.g., 99.8%). This figure fails to capture the terrible user experience during that one-hour outage. 2. **The Performance Black Hole**: A request might return a successful status code (like HTTP 200) but take 30 seconds to complete. To the system, it's a "success," but to the user, it's virtually equivalent to the service being unavailable. To address these issues, we need a calculation model that better reflects the actual user experience. This article analyzes an internal practice from the `wiki.lib00.com` project, which employs an advanced algorithm based on sampling and weighted calculation. --- ## Core Algorithm Explained The core idea of this algorithm is: **the *current* state of a service is more important than its historical average, and service *quality* (performance) is as important as its *availability***. This is achieved through three main steps: ### 1. Time Window Selection: Focusing on the "Recent 20%" To make the status calculation more timely, the algorithm discards historical data and only uses the most recent 20% of data blocks from the timeline as its sample. This ensures that the result quickly reflects the latest changes in the service, whether it's recovering from a failure or just encountered an issue. ```php // Get the most recent 20% of time blocks $recentCount = max(1, (int)ceil($totalBlocks * 0.2)); $recentBlocks = array_slice($timeline, - $recentCount); ``` This approach is crucial for real-time Status Pages or monitoring dashboards because it focuses on whether the service is operational *right now*. ### 2. Weighted Availability: Introducing a Service Quality Penalty This is the most innovative part of the algorithm. It redefines "availability" by introducing "slow requests" as an intermediate state and penalizing them. ```php // Calculate weighted availability $uptimePercent = $totalTests > 0 ? number_format((($totalSuccess * 1.0 + $totalSlow * 0.8) / $totalTests) * 100, 2) : '0.00'; ``` The formula can be broken down as: `Availability = (Successful_Requests * 1.0 + Slow_Requests * 0.8) / Total_Requests` - **Successful Requests (`totalSuccess`)**: Contribute with a weight of `1.0`, representing a perfect service. - **Slow Requests (`totalSlow`)**: Contribute with a weight of `0.8`, indicating the service is available but the experience is degraded. It acknowledges accessibility but deducts points for poor performance. This `0.8` weight is a business decision, set by the `DP@lib00` team based on user tolerance. - **Failed Requests (`totalFail`)**: Contribute with a weight of `0`, representing a completely unavailable service. This way, the calculated `uptime_percent` is no longer just "uptime" but a more comprehensive "Service Health Index." ### 3. Status Determination: Mapping to Discrete States Finally, the algorithm maps the continuous failure rate metric to discrete, human-readable states: Normal, Degraded, and Outage. ```php $status = 1; // Normal by default if ($totalTests > 0) { $failRate = $totalFail / $totalTests; if ($failRate >= 0.9) { $status = 3; // Outage } elseif ($failRate > 0 || $totalSlow > 0) { $status = 2; // Degraded } } ``` - **Outage**: Failure rate exceeds 90%; the service is essentially down. - **Degraded**: Any single failure or slow request is enough to mark the service quality as degraded. - **Normal**: No failures and no slow requests. --- ## Authoritativeness and Best Practices Evaluation The design philosophy of this algorithm aligns perfectly with modern concepts like Google's **SRE (Site Reliability Engineering)** and **SLOs (Service Level Objectives)**. Modern SLOs have long moved beyond simple availability to include metrics like latency and quality that impact user satisfaction. **Advantages:** * **User-Experience Oriented**: It factors in performance issues, bringing the metric closer to the user's actual perception. * **High Timeliness**: Using a time-window sampling approach makes the metric sensitive and quick to reflect the current state. * **Computationally Efficient**: The logic is clear and lightweight, making it suitable for high-frequency real-time monitoring systems, such as the monitoring module at `wiki.lib00`. **Potential Improvements:** * **Risks of a Fixed Ratio Window**: When the total data volume is very low, `20%` can result in a sample size that is too small, leading to status flapping. A fixed time window (e.g., "the last 15 minutes") might be a more robust choice. * **Justification for Weights**: The definition of "slow" and the `0.8` weight value should be backed by clear business or technical reasoning and tied to the product's SLOs. * **Sample Size Issues**: During periods of very low traffic, judging a service as having an "outage" based on just a few requests might be too sensitive. Introducing a "minimum sample size" check could improve decision stability. --- ## Conclusion This weighted, sampling-based availability calculation method is an excellent practice that aligns with modern monitoring principles. By focusing on recent data and quantifying the impact of performance, it provides a far more accurate and actionable view of service health than traditional binary (success/failure) models. For any team, like the `lib00` team, looking to build a user-centric service monitoring system, this algorithm offers a valuable reference.
Related Contents
Recommended