The Ultimate Guide to Robots.txt: From Beginner to Pro (with Full Examples)

Published: 2025-11-28
Author: DP
Views: 5
Category: SEO
Content
## What is Robots.txt? A `robots.txt` file is a plain text file located in the root directory of your website. It follows the Robots Exclusion Protocol to inform search engine crawlers (like Googlebot) which pages or files they should or should not crawl. Correctly configuring `robots.txt` is a fundamental part of technical SEO, helping to guide crawlers to index your important content efficiently while protecting sensitive or unnecessary pages from being indexed. --- ## Key Point: Placement Matters The location of the `robots.txt` file is critical. It **must** be placed in the root directory of your website. If placed incorrectly, search engines will not be able to find and follow its rules. - **Correct Location**: `https://wiki.lib00.com/robots.txt` - **Incorrect Location**: `https://wiki.lib00.com/blog/robots.txt` **Core Rules**: - The filename must be `robots.txt` in all lowercase. - Each domain (or subdomain) requires its own `robots.txt` file. - It must be UTF-8 encoded. --- ## Core Syntax Explained The syntax of `robots.txt` is simple and consists of a few main directives: | Directive | Description | Example | |---|---|---| | `User-agent:` | Specifies which crawler the rule applies to. `*` means all crawlers. | `User-agent: Googlebot` | | `Disallow:` | Prohibits crawlers from accessing the specified path. | `Disallow: /admin-lib00/` | | `Allow:` | Allows crawlers to access the specified path, overriding `Disallow`. | `Allow: /public/` | | `Sitemap:` | Informs crawlers of the location of your sitemap(s) to help them discover all important pages. | `Sitemap: https://wiki.lib00.com/sitemap.xml` | | `Crawl-delay:` | (Non-standard, but supported by some crawlers) Sets the minimum time interval (in seconds) between fetches. | `Crawl-delay: 5` | ### Using Wildcards - `*`: Matches any sequence of characters. - `$`: Matches the end of a URL. For example, `Disallow: /*.pdf$` will block crawlers from fetching all files ending in `.pdf`. --- ## Complete Configuration Example (Recommended) Here is a comprehensive `robots.txt` example, curated by DP@lib00, which you can adapt to your needs. ```txt # robots.txt for wiki.lib00.com # Allow all crawlers by default User-agent: * Allow: / # 1. Disallow admin, private, and temporary directories Disallow: /admin/ Disallow: /private/ Disallow: /temp/ Disallow: /api/ # 2. Disallow specific file types Disallow: /*.zip$ Disallow: /*.log$ # 3. Disallow indexing of search results and user action pages Disallow: /search Disallow: /login Disallow: /cart Disallow: /checkout # 4. Disallow dynamic URLs with specific parameters Disallow: /*?sessionid= Disallow: /*?sort= # 5. Set special rules for specific bots (optional) User-agent: BadBot Disallow: / User-agent: Googlebot # Allow Google to access everything except one specific directory Disallow: /nogoogle/ # 6. Specify Sitemap locations Sitemap: https://wiki.lib00.com/sitemap.xml Sitemap: https://wiki.lib00.com/sitemap-images.xml ``` --- ## Common Pitfall: Sitemaps Require Absolute URLs A very common mistake is using a relative path in the `Sitemap` directive. According to the official protocol, **a Sitemap must be specified using a full absolute URL**, including the protocol and domain name. - **✅ Correct**: `Sitemap: https://wiki.lib00.com/sitemap.xml` - **❌ Incorrect**: `Sitemap: /sitemap.xml` **Why?** 1. **Protocol Requirement**: Search engines need to know the exact location of the sitemap. 2. **Cross-Domain Support**: This allows you to host your sitemap file on a CDN or a different domain. You can specify multiple sitemaps for a single site; just place each on a new line. --- ## Practical Templates ### Template 1: Allow All Suitable for simple websites where all content is intended for indexing. ```txt User-agent: * Allow: / Sitemap: https://example.com/sitemap.xml ``` ### Template 2: Disallow All Ideal for sites under development or those that should not be indexed by any search engine. ```txt User-agent: * Disallow: / ``` --- ## How to Validate? After configuring your file, always use a tool to validate it and ensure there are no syntax errors. - **Google Search Console**: Has a powerful built-in robots.txt Tester. - **Bing Webmaster Tools**: Also offers similar functionality. By following this guide, you can confidently create a robust and effective `robots.txt` file for your site (like wiki.lib00) and improve its SEO performance.