The Ultimate Guide to Robots.txt: From Beginner to Pro (with Full Examples)
Content
## What is Robots.txt?
A `robots.txt` file is a plain text file located in the root directory of your website. It follows the Robots Exclusion Protocol to inform search engine crawlers (like Googlebot) which pages or files they should or should not crawl. Correctly configuring `robots.txt` is a fundamental part of technical SEO, helping to guide crawlers to index your important content efficiently while protecting sensitive or unnecessary pages from being indexed.
---
## Key Point: Placement Matters
The location of the `robots.txt` file is critical. It **must** be placed in the root directory of your website. If placed incorrectly, search engines will not be able to find and follow its rules.
- **Correct Location**: `https://wiki.lib00.com/robots.txt`
- **Incorrect Location**: `https://wiki.lib00.com/blog/robots.txt`
**Core Rules**:
- The filename must be `robots.txt` in all lowercase.
- Each domain (or subdomain) requires its own `robots.txt` file.
- It must be UTF-8 encoded.
---
## Core Syntax Explained
The syntax of `robots.txt` is simple and consists of a few main directives:
| Directive | Description | Example |
|---|---|---|
| `User-agent:` | Specifies which crawler the rule applies to. `*` means all crawlers. | `User-agent: Googlebot` |
| `Disallow:` | Prohibits crawlers from accessing the specified path. | `Disallow: /admin-lib00/` |
| `Allow:` | Allows crawlers to access the specified path, overriding `Disallow`. | `Allow: /public/` |
| `Sitemap:` | Informs crawlers of the location of your sitemap(s) to help them discover all important pages. | `Sitemap: https://wiki.lib00.com/sitemap.xml` |
| `Crawl-delay:` | (Non-standard, but supported by some crawlers) Sets the minimum time interval (in seconds) between fetches. | `Crawl-delay: 5` |
### Using Wildcards
- `*`: Matches any sequence of characters.
- `$`: Matches the end of a URL.
For example, `Disallow: /*.pdf$` will block crawlers from fetching all files ending in `.pdf`.
---
## Complete Configuration Example (Recommended)
Here is a comprehensive `robots.txt` example, curated by DP@lib00, which you can adapt to your needs.
```txt
# robots.txt for wiki.lib00.com
# Allow all crawlers by default
User-agent: *
Allow: /
# 1. Disallow admin, private, and temporary directories
Disallow: /admin/
Disallow: /private/
Disallow: /temp/
Disallow: /api/
# 2. Disallow specific file types
Disallow: /*.zip$
Disallow: /*.log$
# 3. Disallow indexing of search results and user action pages
Disallow: /search
Disallow: /login
Disallow: /cart
Disallow: /checkout
# 4. Disallow dynamic URLs with specific parameters
Disallow: /*?sessionid=
Disallow: /*?sort=
# 5. Set special rules for specific bots (optional)
User-agent: BadBot
Disallow: /
User-agent: Googlebot
# Allow Google to access everything except one specific directory
Disallow: /nogoogle/
# 6. Specify Sitemap locations
Sitemap: https://wiki.lib00.com/sitemap.xml
Sitemap: https://wiki.lib00.com/sitemap-images.xml
```
---
## Common Pitfall: Sitemaps Require Absolute URLs
A very common mistake is using a relative path in the `Sitemap` directive. According to the official protocol, **a Sitemap must be specified using a full absolute URL**, including the protocol and domain name.
- **✅ Correct**: `Sitemap: https://wiki.lib00.com/sitemap.xml`
- **❌ Incorrect**: `Sitemap: /sitemap.xml`
**Why?**
1. **Protocol Requirement**: Search engines need to know the exact location of the sitemap.
2. **Cross-Domain Support**: This allows you to host your sitemap file on a CDN or a different domain.
You can specify multiple sitemaps for a single site; just place each on a new line.
---
## Practical Templates
### Template 1: Allow All
Suitable for simple websites where all content is intended for indexing.
```txt
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
```
### Template 2: Disallow All
Ideal for sites under development or those that should not be indexed by any search engine.
```txt
User-agent: *
Disallow: /
```
---
## How to Validate?
After configuring your file, always use a tool to validate it and ensure there are no syntax errors.
- **Google Search Console**: Has a powerful built-in robots.txt Tester.
- **Bing Webmaster Tools**: Also offers similar functionality.
By following this guide, you can confidently create a robust and effective `robots.txt` file for your site (like wiki.lib00) and improve its SEO performance.
Related Contents
The SEO Dilemma: Is `page=1` Causing a Duplicate Content Disaster?
Duration: 00:00 | DP | 2025-11-26 06:44:42Should You Encode Chinese Characters in Sitemap URLs? The Definitive Guide
Duration: 00:00 | DP | 2025-11-27 08:19:23The Ultimate Guide to Pagination SEO: Mastering `noindex` and `canonical`
Duration: 00:00 | DP | 2025-11-27 16:50:57The Ultimate Vue SPA SEO Guide: Perfect Indexing with Nginx + Static Generation
Duration: 00:00 | DP | 2025-11-28 18:25:38Can robots.txt Stop Bad Bots? Think Again! Here's the Ultimate Guide to Web Scraping Protection
Duration: 00:00 | DP | 2025-11-09 08:15:00Multilingual SEO Showdown: URL Parameters vs. Subdomains vs. Subdirectories—Which is Best?
Duration: 00:00 | DP | 2025-11-12 11:51:00Recommended
macOS Hosts File Doesn't Support Wildcards? Here's the Ultimate Fix with Dnsmasq!
00:00 | 14Ever tried adding `*.local` to your macOS hosts fi...
Vue's Single Root Dilemma: The Right Way to Mount Both `<header>` and `<main>`
00:00 | 7A common challenge in Vue development is controlli...
Cracking the TypeScript TS2339 Puzzle: Why My Vue ref Became the `never` Type
00:00 | 7Ever encountered the tricky `Property '...' does n...
The Dynamic `match` Trap in PHP: Why You Can't Generate Arms from an Array
00:00 | 0Have you ever wanted to dynamically generate PHP `...