Importance Of Robots.txt File In SEO: I have explained how search engines work and what is crawling and indexing in this article. Simply put, search engine robots or web robots, called crawlers, follow links to discover web pages. The web pages are then indexed so that they appear in the SERP when people look for information.
Now, you may not want all the pages in your website to be crawled and indexed for whatever reason. For example, there is no point in crawling and indexing the Member Only Login Page of a website and utilize the search engines’ resources unnecessarily. How do we do that?
That’s where Robots.txt file comes to the scene.
In this article, I will tell you what is Robots.txt file and what is the importance of Robots.txt file in SEO. Read on.
What Is Robots.txt File? What Is The Importance Of Robots.txt File In SEO?
Let us see what is Robots.txt in the form of bullet points:
- It is a text file.
- The file resides in the root directory/top-level directory of a site.
- It is used to communicate instructions to web robots regarding which pages of a particular website they can or cannot crawl and index.
- It is part of the Robots Exclusion Protocol (REP). REP is, basically, group of web standards used to communicate with robots.
- However, web robots may or may not follow the instructions. For example, email address harvesters which are used by spammers as well as malware robots tend to ignore the robots.txt file.
- The robots.txt is a publicly available file which is accessible to everyone (just include/robots.txt at the end of the root domain of a website). You shouldn’t try to hide private information using robots.txt.
Basic Format Of Robots.txt File
The basic format of robots.txt is as shown below:
User-agent: [user-agent]Disallow: [URL not to be crawled]
Here, [User-agent] is the name of the web robot for which the instruction/directive in the next line is given.
Disallow: [URL not to be crawled] is the direction which asks the web robot not to crawl the given URL string.
This two lines, the name of the concerned web robot (1st line) and the directive for it (2nd line) together form a robots.txt file. However, there can be multiple sets of user agents and directives in the same robots.txt file. In such cases, note down the following points:
- The sets must be separated by a line break.
- One set can have multiple directives for that particular web robot.
User-agent: *Disallow: /products/content/Disallow: /local/enterprise/confirmDisallow: /researchtools/ose/
When a “*” is used, it means the directives that follows are for all web robots.
If the file contained this:
User-agent: msnbotDisallow: /products/content/Disallow: /local/enterprise/confirmDisallow: /researchtools/ose/
It would have meant the three disallow directives are only for msnbot and not for other web robots.
- Each directive, disallow or allow, is applicable to the web robot mentioned in that particular set only.
How Does Robots.txt Work?
When a search engine bot arrives at a website, it first visits the robots.txt file and reads the directives in it. Based on the directives, it then starts crawling and indexing. If there is any disallow directive for any particular URL, the bot will not crawl it. Common keywords or directives used in robots.txt file are as under:
- Disallow: This directive is used to tell the user agent not to crawl a particular URL.
- Allow: This directive tells Googlebot (which is the only for which this directive applies) that a particular page or sub-folder can be crawled even though the parent page or folder is not allowed to crawl.
- User-agent: It is the name of the crawler to which the directives are given.
- Crawl Delay: This tells the crawler how much time (in seconds) should it wait before crawling the content.
Importance Of Robots.txt File In SEO
As said above, robots.txt file is used mainly to stop search engines from accessing certain pages of your website, which need not be crawled and indexed. This is especially important for websites that have a huge content. The process of crawling and indexing such big websites can have a negative impact on the performance of the website. By using the robots.txt file, you can direct the crawlers not to waste its resources on certain pages of the website.
The robots.txt file can also be used to prevent duplicate content issue, by blocking search engines from accessing pages with duplicate content. Even specific type of files on your site can also be excluded from appearing in the SERPs using robots.txt.
With robots.txt, you can specify the location of sitemaps and tell search engine bots the time duration for which they should wait before crawling. The latter helps in preventing overloading of your servers.
All of these basically contributes to Technical SEO.
I hope I have been able to explain what is robots.txt file and what is the importance of robots.txt file in SEO. For any clarification, drop a comment below and I will respond at the earliest.