HostMyPage - Low Cost Dedicated Server - Shared Hosting

Full Version: What is the robots.txt file?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
[Image: robots.png]



Robots.txt is a text file webmasters create to instruct web robots ( search engine robots ) which pages on your website to crawl or not to crawl.

The robots.txt file is primarily used to specify which parts of your website should be crawled by spiders or web crawlers. It can specify different rules for different spiders.

Googlebot is an example of a spider. It’s deployed by Google to crawl the Internet and record information about websites so it knows how high to rank different websites in search results. User-agent: *

Disallow: /

Using this syntax in a robots.txt file would tell all web crawlers not to crawl any pages of the website, including the homepage.
  • Allowing all web crawlers access to all content
User-agent: *

Disallow:

Using this syntax in a robots.txt file tells web crawlers to crawl all pages of the website, including the homepage.
  • Blocking a specific web crawler from a specific folder
User-agent: Googlebot

Disallow: /xyz-subfolder/

This syntax tells only Google’s crawler (user-agent name Googlebot) not to crawl any pages that contain the URL string.
  • Blocking a specific web crawler from a specific web page
User-agent: Bingbot

Disallow: /xyz-subfolder/blocked-page.html

This syntax tells only Bing’s crawler (user-agent name Bing) to avoid crawling the specific page.

There are two important considerations when using /robots.txt:

Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
The /robots.txt file is a publicly available file. Anyone can see what sections of your server you don’t want robots to use.