The Ultimate Guide to Robots.txt: From Beginner to Pro (with Full Examples)

Published: 2025-11-28

Author: DP

Category: SEO

Content

## What is Robots.txt? A `robots.txt` file is a plain text file located in the root directory of your website. It follows the Robots Exclusion Protocol to inform search engine crawlers (like Googlebot) which pages or files they should or should not crawl. Correctly configuring `robots.txt` is a fundamental part of technical SEO, helping to guide crawlers to index your important content efficiently while protecting sensitive or unnecessary pages from being indexed. --- ## Key Point: Placement Matters The location of the `robots.txt` file is critical. It **must** be placed in the root directory of your website. If placed incorrectly, search engines will not be able to find and follow its rules. - **Correct Location**: `https://wiki.lib00.com/robots.txt` - **Incorrect Location**: `https://wiki.lib00.com/blog/robots.txt` **Core Rules**: - The filename must be `robots.txt` in all lowercase. - Each domain (or subdomain) requires its own `robots.txt` file. - It must be UTF-8 encoded. --- ## Core Syntax Explained The syntax of `robots.txt` is simple and consists of a few main directives: | Directive | Description | Example | |---|---|---| | `User-agent:` | Specifies which crawler the rule applies to. `*` means all crawlers. | `User-agent: Googlebot` | | `Disallow:` | Prohibits crawlers from accessing the specified path. | `Disallow: /admin-lib00/` | | `Allow:` | Allows crawlers to access the specified path, overriding `Disallow`. | `Allow: /public/` | | `Sitemap:` | Informs crawlers of the location of your sitemap(s) to help them discover all important pages. | `Sitemap: https://wiki.lib00.com/sitemap.xml` | | `Crawl-delay:` | (Non-standard, but supported by some crawlers) Sets the minimum time interval (in seconds) between fetches. | `Crawl-delay: 5` | ### Using Wildcards - `*`: Matches any sequence of characters. - `$`: Matches the end of a URL. For example, `Disallow: /*.pdf$` will block crawlers from fetching all files ending in `.pdf`. --- ## Complete Configuration Example (Recommended) Here is a comprehensive `robots.txt` example, curated by DP@lib00, which you can adapt to your needs. ```txt # robots.txt for wiki.lib00.com # Allow all crawlers by default User-agent: * Allow: / # 1. Disallow admin, private, and temporary directories Disallow: /admin/ Disallow: /private/ Disallow: /temp/ Disallow: /api/ # 2. Disallow specific file types Disallow: /*.zip$ Disallow: /*.log$ # 3. Disallow indexing of search results and user action pages Disallow: /search Disallow: /login Disallow: /cart Disallow: /checkout # 4. Disallow dynamic URLs with specific parameters Disallow: /*?sessionid= Disallow: /*?sort= # 5. Set special rules for specific bots (optional) User-agent: BadBot Disallow: / User-agent: Googlebot # Allow Google to access everything except one specific directory Disallow: /nogoogle/ # 6. Specify Sitemap locations Sitemap: https://wiki.lib00.com/sitemap.xml Sitemap: https://wiki.lib00.com/sitemap-images.xml ``` --- ## Common Pitfall: Sitemaps Require Absolute URLs A very common mistake is using a relative path in the `Sitemap` directive. According to the official protocol, **a Sitemap must be specified using a full absolute URL**, including the protocol and domain name. - **✅ Correct**: `Sitemap: https://wiki.lib00.com/sitemap.xml` - **❌ Incorrect**: `Sitemap: /sitemap.xml` **Why?** 1. **Protocol Requirement**: Search engines need to know the exact location of the sitemap. 2. **Cross-Domain Support**: This allows you to host your sitemap file on a CDN or a different domain. You can specify multiple sitemaps for a single site; just place each on a new line. --- ## Practical Templates ### Template 1: Allow All Suitable for simple websites where all content is intended for indexing. ```txt User-agent: * Allow: / Sitemap: https://example.com/sitemap.xml ``` ### Template 2: Disallow All Ideal for sites under development or those that should not be indexed by any search engine. ```txt User-agent: * Disallow: / ``` --- ## How to Validate? After configuring your file, always use a tool to validate it and ensure there are no syntax errors. - **Google Search Console**: Has a powerful built-in robots.txt Tester. - **Bing Webmaster Tools**: Also offers similar functionality. By following this guide, you can confidently create a robust and effective `robots.txt` file for your site (like wiki.lib00) and improve its SEO performance.

The Ultimate Guide to Robots.txt: From Beginner to Pro (with Full Examples)

Content

Related Contents

From Concept to Cron Job: Building the Perfect SEO Sitemap for a Multilingual Video Website

Decoding SEO's Canonical Tag: From Basics to Multilingual Site Mastery

The SEO Dilemma: Is `page=1` Causing a Duplicate Content Disaster?

Should You Encode Chinese Characters in Sitemap URLs? The Definitive Guide

The Ultimate Guide to Pagination SEO: Mastering `noindex` and `canonical`

The Ultimate Vue SPA SEO Guide: Perfect Indexing with Nginx + Static Generation

Can robots.txt Stop Bad Bots? Think Again! Here's the Ultimate Guide to Web Scraping Protection

Multilingual SEO Showdown: URL Parameters vs. Subdomains vs. Subdirectories—Which is Best?

The Art of URL Naming: Hyphen (-) vs. Underscore (_), Which is the SEO and Standard-Compliant Champion?

Frontend Development vs. JavaScript: How to Choose the Perfect Category for Your Tech Article

The Secret of URL Encoding: Is Your Link Friendly to Users and SEO?

Recommended

Git Emergency: How to Completely Remove Committed Files from Remote Repository History

Vue Layout Challenge: How to Make an Inline Header Full-Width? The Negative Margin Trick Explained

Is Attaching a JS Event Listener to 'document' Bad for Performance? The Truth About Event Delegation

The Ultimate Guide to Using Google Fonts on Chinese Websites: Ditch the Lag with an Elegant Font Stack