PHP Log Aggregation Performance Tuning: Database vs. Application Layer - The Ultimate Showdown for Millions of Records

Published: 2026-01-06
Author: DP
Views: 14
Category: PHP
Content
## Background In developing monitoring or logging systems, data aggregation is a core task. We often need to process massive amounts of data and transform it into meaningful statistical metrics. Recently, a developer on an internal project at `wiki.lib00.com` encountered a typical problem: how to efficiently execute a PHP script to aggregate monitoring logs on a server with limited resources (1-core CPU, 2GB RAM)? What is a safe number of logs to process in a single aggregation batch? The user's initial aggregation query was as follows: ```sql SELECT service_id, node_id, profile_id, ? as hour_start, COUNT(*) as count_total, SUM(CASE WHEN result_id = ? THEN 1 ELSE 0 END) as count_success, SUM(CASE WHEN result_id = ? THEN 1 ELSE 0 END) as count_slow, SUM(CASE WHEN result_id = ? THEN 1 ELSE 0 END) as count_fail, (SUM(CASE WHEN result_id IN (?, ?) THEN 1 ELSE 0 END) / COUNT(*)) as availability, AVG(latency_ms) as latency_avg, MIN(latency_ms) as latency_min, MAX(latency_ms) as latency_max, MAX(latency_ms) as latency_p95 FROM monitor_logs WHERE timestamp >= ? AND timestamp < ? GROUP BY service_id, node_id, profile_id; ``` Based on this scenario, we explored three common approaches, analyzed their pros and cons, and arrived at the best practice. --- ## Approach 1: Optimized Database Aggregation (The Right Way) This is the most direct and efficient solution. Let the database do what it does best: filtering and aggregating data. However, for it to run efficiently, several key conditions must be met. #### 1. Indexing is the Soul of Performance The core of the query is `WHERE timestamp >= ? AND timestamp < ?`. If the `timestamp` field is not indexed, MySQL will be forced to perform a full table scan. This would be a disaster for a table with millions or even tens of millions of rows. Therefore, **the first priority is to ensure a proper index exists**. A high-performance composite index can cover both the query and grouping conditions: ```sql -- Add a composite index to the monitor_logs table ALTER TABLE monitor_logs ADD INDEX idx_ts_service_node (timestamp, service_id, node_id, profile_id); ``` With this index, the database can locate the data for the target time range extremely quickly, significantly reducing I/O and CPU consumption. #### 2. Control Cardinality and Batch Size * **Cardinality**: The number of unique combinations for `GROUP BY service_id, node_id, profile_id`. The more combinations, the more overhead the database incurs to maintain intermediate results in memory. Insufficient memory could lead to the use of temporary disk tables, causing a sharp decline in performance. * **Batch Size**: Don't try to aggregate data over a very long time range (like a full day or week) at once. The best practice is to **process in small time-window batches** (e.g., executing once every 5 or 10 minutes). This ensures that the amount of data processed by a single query remains within a manageable range. With these optimizations, for a 1-core, 2GB server, processing **500,000 to 1,000,000** logs per aggregation is a relatively safe range. For high real-time requirements, this could be lowered to **100,000 to 300,000**. #### 3. Correct the P95 Calculation It's worth noting that using `MAX(latency_ms) as latency_p95` in the original SQL is incorrect. `MAX` retrieves the maximum value (P100). Calculating the precise percentile (like P95) in MySQL is relatively complex and resource-intensive, often requiring window functions (MySQL 8.0+). In this scenario, it's acceptable to use `MAX` as a rough performance indicator or to calculate it by sampling in the application layer. --- ## Approach 2: Splitting the Query (The Classic N+1 Trap) A common misconception is: if there are only 10-20 `node_id`s, can I split one `GROUP BY` query into 10-20 individual queries with a `WHERE node_id = ?` condition? **The answer is: Absolutely not!** This is a classic anti-pattern, also known as the "N+1 query problem." | Feature | Single Aggregation Query (Recommended) | Multiple Split Queries (Not Recommended) | | :--- | :--- | :--- | | **Data Scan** | **Once**, very efficient | **10-20 times**, with significant redundant scanning | | **Database Overhead** | **Low** (1 parse & plan) | **High** (10-20 parses & plans) | | **Network Latency** | **Low** (1 round-trip) | **High** (10-20 round-trips) | | **Code Complexity** | **Low** (one SQL) | **High** (requires looping and SQL concatenation) | | **Resource Utilization** | **Excellent** (DB can perform global optimization) | **Poor** (no global optimization possible) | **Conclusion**: A single query is far superior in performance to multiple queries. The database's query optimizer can make globally optimal decisions when it sees the entire task, an ability it loses when the query is split. We should optimize the aggregation SQL, not try to work around it. --- ## Approach 3: Processing in PHP (The Memory and Efficiency Nightmare) An even more extreme idea is: what if I fetch all the raw data into PHP memory at once and then use PHP code (e.g., in CLI mode) to loop and calculate? Would that be faster? **The answer is: This is the worst choice in almost every scenario.** #### Why is this an anti-pattern? 1. **Massive Data Transfer**: Database aggregation returns only a few dozen rows of results, whereas PHP processing requires transferring **millions of raw log entries** from the database to the PHP process over the network. This is the biggest bottleneck. 2. **Horrific Memory Consumption**: PHP arrays have significant memory overhead. On a 2GB server, loading a few hundred thousand rows is enough to exhaust PHP's `memory_limit`, causing the script to crash. This is a lesson learned from `DP@lib00`'s practical experience. 3. **Low Computational Efficiency**: The database executes aggregations using compiled C/C++ code, which is **orders of magnitude faster** than a `foreach` loop in an interpreted language like PHP. **A fitting analogy**: It's like letting a professional chef (the database) prepare ingredients in the kitchen, instead of bringing all the raw meat and vegetables (raw data) to your dining table (PHP memory) to process yourself with a small knife. The difference in efficiency is night and day. --- ## Final Conclusion For tasks like PHP log aggregation, the best practices are: 1. **Trust Your Database**: Let the database handle what it excels at: data filtering and aggregation. 2. **Create Efficient Indexes**: Build appropriate composite indexes for the `WHERE` and `GROUP BY` fields in your queries, especially time-based fields. 3. **Process in Batches**: Break down large tasks into smaller, time-windowed batches to keep system load stable and predictable. 4. **Avoid Anti-patterns**: Firmly avoid common performance traps like "N+1 queries" and "processing massive raw data in the application layer." By following these principles, validated by the team at `wiki.lib00`, you can build an efficient, stable, and scalable log aggregation system, even on servers with limited resources.
Related Contents