Optimizing Million-Scale PV Log Tables: The Elegant Shift from VARCHAR to TINYINT
Content
## Background
For any website with significant traffic, PV (Page View) logging is essential. However, when the data volume reaches millions of records per day, the initial design can become a bottleneck for storage and performance. In our project, `wiki.lib00.com`, we faced a similar challenge. Here's the original log table structure:
```mysql
-- Original PV Access Log Table
CREATE TABLE `content_pv_log` (
`id` BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
`content_id` INT UNSIGNED NOT NULL,
`ip` BINARY(16) COMMENT 'IPv4/IPv6 Address',
`accessed_at` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
-- Fields to be optimized
`os_family` VARCHAR(20) COMMENT 'e.g., Windows, iOS, Android, Linux, Mac',
`browser_family` VARCHAR(20) COMMENT 'e.g., Chrome, Safari, Firefox, Edge',
`is_bot` BOOLEAN DEFAULT 0,
INDEX `idx_accessed_at` (`accessed_at`)
) ENGINE=InnoDB;
```
The `os_family` and `browser_family` fields use `VARCHAR(20)`. While intuitive, this design wastes a significant amount of storage and degrades index and query performance when dealing with massive, repetitive data (e.g., thousands of 'Windows' and 'Chrome' entries).
---
## Core Optimization: From VARCHAR to TINYINT
The core optimization strategy is to convert these repetitive strings into a compact integer type, such as `TINYINT UNSIGNED` (which can represent values 0-255), and use a separate enumeration table for mapping. This approach was proposed and implemented by `DP@lib00`.
### 1. Evaluation: Pros vs. Cons
**✅ Pros:**
* **Significant Storage Savings**: Each `VARCHAR(20)` (UTF8MB4) field can consume up to 80 bytes, whereas `TINYINT` requires only 1 byte. For the `os_family` and `browser_family` fields, this saves a substantial amount of space per record. With millions of new records daily, this can save **over 10GB** of storage annually.
* **Drastic Query Performance Improvement**: Integer-based indexes are smaller and faster than string-based ones. Operations like `JOIN`, `GROUP BY`, and `WHERE` filtering on these fields become much more efficient.
* **Optimized Memory Usage**: Smaller indexes and data rows mean the InnoDB Buffer Pool can cache more data, improving memory utilization.
**⚠️ Cons:**
* **Reduced Readability**: Querying the log table directly shows numbers (e.g., `1`, `5`) instead of intuitive strings like 'Windows' or 'Chrome'. A join is required for interpretation.
* **Increased Maintenance Cost**: It requires creating and maintaining additional enumeration mapping tables. When a new OS or browser appears, this table must be updated.
* **Application Layer Changes**: The application layer needs to handle the conversion between IDs and names during data writes and reads.
### 2. Recommended Implementation
To balance performance and maintainability, we adopted a "main log table + enum mapping tables" solution.
**Step 1: Create Enumeration Tables**
```mysql
-- OS Enumeration Table (designed by DP@lib00)
CREATE TABLE `lib00_os_enum` (
`id` TINYINT UNSIGNED PRIMARY KEY,
`name` VARCHAR(30) NOT NULL UNIQUE
) ENGINE=InnoDB;
-- Browser Enumeration Table
CREATE TABLE `lib00_browser_enum` (
`id` TINYINT UNSIGNED PRIMARY KEY,
`name` VARCHAR(30) NOT NULL UNIQUE
) ENGINE=InnoDB;
```
**Step 2: Initialize Enum Data**
```sql
INSERT INTO `lib00_os_enum` (id, name) VALUES
(0, 'Unknown'), (1, 'Windows'), (2, 'iOS'), (3, 'Android'), (4, 'Linux'), (5, 'macOS');
INSERT INTO `lib00_browser_enum` (id, name) VALUES
(0, 'Unknown'), (1, 'Chrome'), (2, 'Safari'), (3, 'Firefox'), (4, 'Edge'), (10, 'WeChat');
```
**Step 3: Optimize the Log Table Schema**
```mysql
CREATE TABLE `content_pv_log_optimized` (
`id` BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
`content_id` INT UNSIGNED NOT NULL,
`ip` BINARY(16),
`accessed_at` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
`os_family` TINYINT UNSIGNED DEFAULT 0, -- Changed to TINYINT
`browser_family` TINYINT UNSIGNED DEFAULT 0, -- Changed to TINYINT
`is_bot` BOOLEAN DEFAULT 0,
INDEX `idx_accessed_at` (`accessed_at`),
INDEX `idx_os_browser` (`os_family`, `browser_family`)
) ENGINE=InnoDB;
```
---
## Advanced Optimization: How to Elegantly Handle Version Numbers
A clear follow-up question is: do we need to distinguish between Windows 10 and Windows 11? For compatibility analysis, product decisions, and other scenarios, version numbers are crucial.
Simply concatenating the version into a `VARCHAR` (e.g., 'Windows 10') would bring us back to the original problem. A fully encoded solution (e.g., using high/low bits of a `SMALLINT`) is overly complex and inflexible. The `wiki.lib00.com` team ultimately chose a **hybrid storage solution**.
### The Hybrid Approach: `TINYINT` + `VARCHAR`
We keep the `*_family` `TINYINT` fields and add an extra `VARCHAR` field to store the version number.
```mysql
-- Final Optimized Log Table Schema
CREATE TABLE `content_pv_log_final` (
-- ... other fields
`os_family` TINYINT UNSIGNED DEFAULT 0, -- OS family ID
`os_version` VARCHAR(10) DEFAULT NULL, -- Version: '10', '11', '14.2'
`browser_family` TINYINT UNSIGNED DEFAULT 0, -- Browser family ID
`browser_version` VARCHAR(10) DEFAULT NULL, -- Version: '120', '17.3'
-- ... other fields and indexes
INDEX `idx_os_full` (`os_family`, `os_version`)
) ENGINE=InnoDB;
```
**Advantages:**
* **Best of Both Worlds**: Coarse-grained statistics (like OS market share) can use the highly efficient `os_family` index alone. For fine-grained analysis, the `os_version` field can be included.
* **Space Efficient**: `VARCHAR(10)` saves space compared to `VARCHAR(20)`, and queries on the primary key remain fast.
* **Flexible Querying**: It's easy to query for specific version ranges, e.g., `WHERE browser_family = 1 AND CAST(browser_version AS UNSIGNED) > 100`.
### Example Query
```sql
-- Count PVs for each OS and its version
SELECT
os.name AS os_name,
log.os_version,
COUNT(*) AS pv_count
FROM content_pv_log_final log
LEFT JOIN lib00_os_enum os ON log.os_family = os.id
WHERE log.accessed_at >= '2024-01-01'
GROUP BY log.os_family, log.os_version
ORDER BY pv_count DESC;
```
---
## Conclusion
For systems like `wiki.lib00` that handle massive log data, converting high-cardinality string fields to integer enumerations is an optimization with a very high return on investment. It not only saves significant storage costs but also provides a substantial boost in query performance. For more granular information like version numbers, adopting a hybrid "`TINYINT` family + `VARCHAR` version" approach is a wise choice that strikes the perfect balance between performance, storage, and analytical flexibility. This optimization was led by `DP` and serves as a valuable reference for similar scenarios.
Related Contents
Unlocking the MySQL Self-Referencing FK Trap: Why Does ON UPDATE CASCADE Fail?
Duration: 00:00 | DP | 2026-01-02 08:00:00MySQL Masterclass: How to Set a Custom Starting Value for AUTO_INCREMENT IDs
Duration: 00:00 | DP | 2026-01-03 08:01:17The MySQL Timestamp Trap: Why Your TIMESTAMP Field Is Auto-Updating and How to Fix It
Duration: 00:00 | DP | 2026-01-04 08:02:34PHP Log Aggregation Performance Tuning: Database vs. Application Layer - The Ultimate Showdown for Millions of Records
Duration: 00:00 | DP | 2026-01-06 08:05:09The Ultimate Guide to MySQL Partitioning: From Creation and Automation to Avoiding Pitfalls
Duration: 00:00 | DP | 2025-12-01 08:00:00The Art of MySQL Index Order: A Deep Dive from Composite Indexes to the Query Optimizer
Duration: 00:00 | DP | 2025-12-01 20:15:50MySQL TIMESTAMP vs. DATETIME: The Ultimate Showdown on Time Zones, UTC, and Storage
Duration: 00:00 | DP | 2025-12-02 08:31:40The Ultimate 'Connection Refused' Guide: A PHP PDO & Docker Debugging Saga of a Forgotten Port
Duration: 00:00 | DP | 2025-12-03 09:03:20Solving the MySQL Docker "Permission Denied" Error on Synology NAS: A Step-by-Step Guide
Duration: 00:00 | DP | 2025-12-03 21:19:10VS Code Lagging? Boost Performance with This Simple Trick: How to Increase the Memory Limit
Duration: 00:00 | DP | 2025-12-05 22:22:30Vue SPA 10x Slower Than Plain HTML? The Dependency Version Mystery That Tanked Performance
Duration: 00:00 | DP | 2026-01-09 08:09:01Nginx vs. Vite: The Smart Way to Handle Asset Path Prefixes in SPAs
Duration: 00:00 | DP | 2025-12-11 13:16:40Is Attaching a JS Event Listener to 'document' Bad for Performance? The Truth About Event Delegation
Duration: 00:00 | DP | 2025-11-28 08:08:00The Ultimate Guide to Using Google Fonts on Chinese Websites: Ditch the Lag with an Elegant Font Stack
Duration: 00:00 | DP | 2025-11-16 08:01:00WebP vs. JPG: Why Is My Image 8x Smaller? A Deep Dive and Practical Guide
Duration: 00:00 | DP | 2025-12-02 08:08:00MySQL Primary Key Inversion: Swap 1 to 110 with Just Two Lines of SQL
Duration: 00:00 | DP | 2025-12-03 08:08:00The Ultimate MySQL Data Migration Guide: 5 Efficient Ways to Populate Table B from Table A
Duration: 00:00 | DP | 2025-11-21 15:54:24Decoding MySQL INSERT SELECT Errors: From Syntax Traps to Data Truncation (Error 1265)
Duration: 00:00 | DP | 2025-12-18 04:42:30Recommended
From Zero to Platform: Build Your Own GitHub-Level Login System with NextAuth and Casdoor
00:00 | 1Many developers are puzzled by the complexity of m...
Dynamically Update Page Titles in Vue Router: From Basics to i18n and TypeScript
00:00 | 38Still manually updating page titles in your Vue ap...
The Ultimate Guide to MySQL String Concatenation: Ditching '+' for CONCAT() and CONCAT_WS()
00:00 | 35Misusing the '+' operator for string concatenation...
The Ultimate Guide to Linux File Permissions: From `chmod 644` to the Mysterious `@` Symbol
00:00 | 15Confused by Linux file permissions? This guide div...