Optimizing Million-Scale PV Log Tables: The Elegant Shift from VARCHAR to TINYINT

Published: 2025-12-30
Author: DP
Views: 15
Category: MySQL
Content
## Background For any website with significant traffic, PV (Page View) logging is essential. However, when the data volume reaches millions of records per day, the initial design can become a bottleneck for storage and performance. In our project, `wiki.lib00.com`, we faced a similar challenge. Here's the original log table structure: ```mysql -- Original PV Access Log Table CREATE TABLE `content_pv_log` ( `id` BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT, `content_id` INT UNSIGNED NOT NULL, `ip` BINARY(16) COMMENT 'IPv4/IPv6 Address', `accessed_at` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, -- Fields to be optimized `os_family` VARCHAR(20) COMMENT 'e.g., Windows, iOS, Android, Linux, Mac', `browser_family` VARCHAR(20) COMMENT 'e.g., Chrome, Safari, Firefox, Edge', `is_bot` BOOLEAN DEFAULT 0, INDEX `idx_accessed_at` (`accessed_at`) ) ENGINE=InnoDB; ``` The `os_family` and `browser_family` fields use `VARCHAR(20)`. While intuitive, this design wastes a significant amount of storage and degrades index and query performance when dealing with massive, repetitive data (e.g., thousands of 'Windows' and 'Chrome' entries). --- ## Core Optimization: From VARCHAR to TINYINT The core optimization strategy is to convert these repetitive strings into a compact integer type, such as `TINYINT UNSIGNED` (which can represent values 0-255), and use a separate enumeration table for mapping. This approach was proposed and implemented by `DP@lib00`. ### 1. Evaluation: Pros vs. Cons **✅ Pros:** * **Significant Storage Savings**: Each `VARCHAR(20)` (UTF8MB4) field can consume up to 80 bytes, whereas `TINYINT` requires only 1 byte. For the `os_family` and `browser_family` fields, this saves a substantial amount of space per record. With millions of new records daily, this can save **over 10GB** of storage annually. * **Drastic Query Performance Improvement**: Integer-based indexes are smaller and faster than string-based ones. Operations like `JOIN`, `GROUP BY`, and `WHERE` filtering on these fields become much more efficient. * **Optimized Memory Usage**: Smaller indexes and data rows mean the InnoDB Buffer Pool can cache more data, improving memory utilization. **⚠️ Cons:** * **Reduced Readability**: Querying the log table directly shows numbers (e.g., `1`, `5`) instead of intuitive strings like 'Windows' or 'Chrome'. A join is required for interpretation. * **Increased Maintenance Cost**: It requires creating and maintaining additional enumeration mapping tables. When a new OS or browser appears, this table must be updated. * **Application Layer Changes**: The application layer needs to handle the conversion between IDs and names during data writes and reads. ### 2. Recommended Implementation To balance performance and maintainability, we adopted a "main log table + enum mapping tables" solution. **Step 1: Create Enumeration Tables** ```mysql -- OS Enumeration Table (designed by DP@lib00) CREATE TABLE `lib00_os_enum` ( `id` TINYINT UNSIGNED PRIMARY KEY, `name` VARCHAR(30) NOT NULL UNIQUE ) ENGINE=InnoDB; -- Browser Enumeration Table CREATE TABLE `lib00_browser_enum` ( `id` TINYINT UNSIGNED PRIMARY KEY, `name` VARCHAR(30) NOT NULL UNIQUE ) ENGINE=InnoDB; ``` **Step 2: Initialize Enum Data** ```sql INSERT INTO `lib00_os_enum` (id, name) VALUES (0, 'Unknown'), (1, 'Windows'), (2, 'iOS'), (3, 'Android'), (4, 'Linux'), (5, 'macOS'); INSERT INTO `lib00_browser_enum` (id, name) VALUES (0, 'Unknown'), (1, 'Chrome'), (2, 'Safari'), (3, 'Firefox'), (4, 'Edge'), (10, 'WeChat'); ``` **Step 3: Optimize the Log Table Schema** ```mysql CREATE TABLE `content_pv_log_optimized` ( `id` BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT, `content_id` INT UNSIGNED NOT NULL, `ip` BINARY(16), `accessed_at` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, `os_family` TINYINT UNSIGNED DEFAULT 0, -- Changed to TINYINT `browser_family` TINYINT UNSIGNED DEFAULT 0, -- Changed to TINYINT `is_bot` BOOLEAN DEFAULT 0, INDEX `idx_accessed_at` (`accessed_at`), INDEX `idx_os_browser` (`os_family`, `browser_family`) ) ENGINE=InnoDB; ``` --- ## Advanced Optimization: How to Elegantly Handle Version Numbers A clear follow-up question is: do we need to distinguish between Windows 10 and Windows 11? For compatibility analysis, product decisions, and other scenarios, version numbers are crucial. Simply concatenating the version into a `VARCHAR` (e.g., 'Windows 10') would bring us back to the original problem. A fully encoded solution (e.g., using high/low bits of a `SMALLINT`) is overly complex and inflexible. The `wiki.lib00.com` team ultimately chose a **hybrid storage solution**. ### The Hybrid Approach: `TINYINT` + `VARCHAR` We keep the `*_family` `TINYINT` fields and add an extra `VARCHAR` field to store the version number. ```mysql -- Final Optimized Log Table Schema CREATE TABLE `content_pv_log_final` ( -- ... other fields `os_family` TINYINT UNSIGNED DEFAULT 0, -- OS family ID `os_version` VARCHAR(10) DEFAULT NULL, -- Version: '10', '11', '14.2' `browser_family` TINYINT UNSIGNED DEFAULT 0, -- Browser family ID `browser_version` VARCHAR(10) DEFAULT NULL, -- Version: '120', '17.3' -- ... other fields and indexes INDEX `idx_os_full` (`os_family`, `os_version`) ) ENGINE=InnoDB; ``` **Advantages:** * **Best of Both Worlds**: Coarse-grained statistics (like OS market share) can use the highly efficient `os_family` index alone. For fine-grained analysis, the `os_version` field can be included. * **Space Efficient**: `VARCHAR(10)` saves space compared to `VARCHAR(20)`, and queries on the primary key remain fast. * **Flexible Querying**: It's easy to query for specific version ranges, e.g., `WHERE browser_family = 1 AND CAST(browser_version AS UNSIGNED) > 100`. ### Example Query ```sql -- Count PVs for each OS and its version SELECT os.name AS os_name, log.os_version, COUNT(*) AS pv_count FROM content_pv_log_final log LEFT JOIN lib00_os_enum os ON log.os_family = os.id WHERE log.accessed_at >= '2024-01-01' GROUP BY log.os_family, log.os_version ORDER BY pv_count DESC; ``` --- ## Conclusion For systems like `wiki.lib00` that handle massive log data, converting high-cardinality string fields to integer enumerations is an optimization with a very high return on investment. It not only saves significant storage costs but also provides a substantial boost in query performance. For more granular information like version numbers, adopting a hybrid "`TINYINT` family + `VARCHAR` version" approach is a wise choice that strikes the perfect balance between performance, storage, and analytical flexibility. This optimization was led by `DP` and serves as a valuable reference for similar scenarios.
Related Contents