What is Robots.txt? A Comprehensive Guide and Introduction
Table of Contents
What is robots.txt?
Robots.txt is a simple text file within a website, with instructions inside telling search engine crawlers and other web robots how to crawl the pages on the specific website. While most standard users probably want search engine crawlers to index their page, there are some websites that would rather avoid it - or have some pages not index instead.
Using the robots.txt file, webmasters can provide site-wide, subdirectory-wide, or page-wide restrictions for web robots. The pages specified with a special tag won’t be crawled by web crawlers, and as such will not display on any search engines. Every crawler is defined as a user agent, and different instructions can be delivered to specific user agents by modifying the robots.txt file.
Which robots read the robots.txt file?
The robots.txt file is intended for all automation systems entering the site. This applies not only to the most obvious search engine robots from the SEO point of view. Bots to which directives of this file are addressed are also automatic archiving machines (such as Web Archive), programs that download the site to a local drive(e.g. HTTrack Website Copier), website analysis tools (including SEO tools such as Xenu, but also the Majestic SEO and Ahrefs bots), etc.
Of course, it is easy to guess that in many cases creators should not worry about directives. On the other hand, some robots allow their users to choose whether to comply with the detected directives.
How important is a robots.txt file?
There are several functions that the robots.txt file fulfills. The first one is to prevent your website from overloading, since too many requests from crawlers could easily slow down your website or even render it inaccessible. The robots.txt file can be used for HTML-based web pages, PDF files, and all other non-media formats that Google can index. Unimportant, duplicate, and similar pages can also be disabled from crawling using robots.txt - however, the robots.txt file is not an effective way of preventing Google from indexing your pages. Instead, use the noindex meta directive.
Using the file, you can also prevent certain audio, image, and video files from being crawled, indexed and appearing in Google search results - while leaving the rest of your website perfectly visible.
Keep in mind that the robots.txt file might not block all web crawlers from accessing your website - it works only if the crawler is programmed to respect the file and read from it. Some crawlers might also interpret syntax differently from Google, and as such not read certain tags and directives.
How does a robots.txt file work?
Most search engines utilize web crawlers - scripts designed to gather information about websites on the Internet - for discovering new content and indexing web pages on its platform. Before users will be able to search up your website on Google, a crawler will have to visit your website, scour it thoroughly, and index all the relevant information. It will then be added to Google’s search results, after being analyzed by Google’s algorithms for SEO and other factors that might influence your searchability.
Google’s crawlers are programmed to search for the robots.txt file, which it will always try to read first before crawling through the rest of the page. The file itself can contain a range of instructions for the crawler, dictating how exactly the crawling process will look like. The crawler must be actively forbidden from accessing certain files or pages by the robots.txt file, or else it will start the crawling process. Using specific commands, webmasters can disallow a specific user agent from a specific activity, most often accessing the entirety or parts of the site.
Before we proceed: Google warns against using Microsoft Word when editing the robots.txt file, as it tends to add hidden formatting marks that can prevent the file from working as expected. To avoid syntax issues, we recommend using software like Notepad, Visual Studio or Notepad++ for maximum consistency.
Basic robots.txt syntax and commands
Let’s start with an empty robots.txt file. If your website doesn’t have one, create it. Remember that the file needs to be specifically named robots.txt, lowercase, and needs to be located at the root level. Every one of your subdomains should also contain its own robots.txt file that can contain different instructions.
The basic command in robots.txt is User-Agent. This command defines the web crawler you’re assigning a specific rule to, usually indicating a specific search engine. In the file, the line will look like this for defining a web crawler named example:
Allow / Disallow Directive
After we have defined the user agent, we can attach instructions to it. The two basic types of instructions are Allow and Disallow - both of which can be followed with a specific URL, or left empty to affect the whole site. To allow or disallow a specific user agent from accessing the whole website, add one of the following lines:
Allow: / Disallow: /
Crawl-delay is an “unofficial” command, which is not used by Google or Yandex. Some other crawlers, however, might respect this command, which can be used to limit the speed at which they browse your website. You can try adding a following line to your robots.txt file, specifying the delay in seconds:
Even 5 seconds is a lot, especially for websites with a lot of pages - but it can save you some bandwidth.
When editing the robots.txt file, you can also add a Sitemap location. If your website uses a sitemap, it’s recommended to always note it down in the robots.txt file, making it easy to access for crawlers. If you use more than one sitemap for your website (different language versions, for example), you can use several Sitemap command lines to specify the URLs of multiple sitemaps.
Example robots.txt setups
Let’s take a look at the sample robots.txt file provided by Google in their documentation on creating a robots.txt file:
User-agent: Googlebot Disallow: /nogooglebot/ User-agent: * Allow: / Sitemap: https://www.example.com/sitemap.xml
As you can see, the user agent specified in this robots.txt file is named Googlebot. Because of the /nogooglebot/ tag under Disallow, Googlebot won’t be allowed to crawl any pages of the website that start with /nogooglebot/, while allowing access to the rest of the website.
All other bots, however, are fully allowed to crawl the website. The asterisk user agent stands for all crawlers, and the allow line provides access to the entire website directory. The Googlebot user agent rule acts as an exception to this, but all the other crawlers can access the /nogooglebot/ directory.
There is also a sitemap URL specified at the bottom of the robots.txt file, pointing the crawler to its location. While there is more to the syntax, this is, in essence, how the robots.txt file works - it’s very basic and shouldn’t pose much of a challenge to most users.
Do you need robots.txt?
While you technically don’t need a robots.txt file to run a website, or even to be crawled and indexed by web bots. When a crawler comes to your website and doesn’t notice a robots.txt file, it continues to index your pages as if allowed full access. Adding a robots.txt file to your website gives you control over who can crawl your website and how. And what can you gain from it? First of all, not allowing automation systems to browse those sections of the website that they should not visit for different reasons and showing them places where visits are most advisable.
Blocking specific areas of the page can be important for a variety of reasons:
- Security Issues - perhaps you just don't want robots (or accidental users who later use resources crawled by robots) to be able to get to sections that they shouldn't have access to too easily.
- Protection against duplicate content - if there is a large amount of internally duplicated content on the page, and at the same time, the URL scheme allows it to be clearly identified, using a robots.txt file you can give search engines a signal that this part of the site should not be tracked.
- Saving transfer - with the help of robots.txt entries you can try to remove from the paths that robots travel, entire subdirectories or specific types of files - even a folder containing graphics or their high-format versions. For some websites, the transfer savings can be significant.
- Content protection against "leaking" outside - note that the suggested above protection for a folder with large-format graphics can also be used to present only smaller versions in the image search engine. This can be important in the case of photo banks (but not only).
- Crawl budget optimization - although I mention it at the end of the list, it is definitely not a trivial thing. The larger the website, the more emphasis should be placed on optimizing the paths along which search engine indexing bots move. By blocking sites that are irrelevant to SEO at robots.txt, you simply increase the likelihood that robots will move where they should.
As mentioned before, you can avoid server overloads with a proper robots.txt file. You can also block pages you don’t want indexed from being crawled (though you should still use the noindex tag). Indexing duplicate pages or multiple pages with similar content can have a negative influence on your SEO and lower your search results ranking.
Why isn’t the robots.txt file enough to prevent indexing of your content? Because if another indexed website links to your page, the crawlers will still index it! That’s why the noindex meta tag is so important, in addition to the robots.txt file.
Creating your own robots.txt file
To start off, simply create a new text file at the root domain of your website called “robots.txt”. On WordPress, the robots.txt file is pre-generated in the public_html folder, and many other CMS platforms also include it - if this is the case, you’ll only need to edit the file to add any changes.
If you don’t want to create the robots.txt file manually, you can use one of trusted robots.txt generators available online. These allow you to customize your instructions using a simple-to-use UI, and save it as a ready-made robots.txt file. You can then simply upload the file into the root directory of your website and enjoy a properly set-up robots.txt file. To make absolutely certain everything works as it should, we recommend using the Google robots.txt Tester - more on that later.
Robots.txt and SEO: best practices
There are several benefits of having a properly set-up robots.txt file for SEO, potentially boosting your position on Search Engine Results Pages (SERPs). The file helps prevent duplicate content from indexing, allows webmasters to keep a portion of their site’s pages private, limits poorly optimized pages’ influence on the website’s SEO, and provides easy access to the sitemap for web crawlers.
If you decide to use your own robots.txt file, be extremely careful - you don’t want to accidentally block any content on your website that you actually want people to see. If you’re not tech-savvy enough to create your own file from scratch, we recommend using a robots.txt generator instead.
When adjusting your robots.txt file for Google’s search engine, there are three user-agents you should keep in mind:
You can also choose to add user-agents of Bing bot, Yandex bot, and any other crawlers you want. Remember to double-check every line of robots.txt, as all URLs and crawler names need to be written down exactly and are case-sensitive. If Google’s crawler encounters confusing syntax, it most often prefers to restrict sections rather than leave them unrestricted, but the same can’t be said about all the other crawler bots.
Order of directives in robots.txt
The order in which the directives are written down in the robots.txt file is important. However, different crawlers handle that order in different ways. For example, Google’s and Bing’s crawlers treat longer directives with higher priority, which can allow them access to subdirectories of disallowed directories, while blocking access to the whole directory for most other crawlers. In case of conflicting rules, Google claims it uses the least restrictive one.
When using the Disallow directive, crawlers will not access any pages that match the trigger. To avoid accidentally blocking crawlers from parts of your website you actually want to be crawled, be as specific as possible when writing down directives. Also, remember that for each crawler, only a single group of directives can be created - otherwise they will be confusing to the bot.
Updating robots.txt file
Any changes to your website’s structure should be reflected in your robots.txt file as well. To make sure your pages are crawled properly, wait with publishing the changes to your web structure until the robots.txt file has been updated. Remember to update your sitemap as well.
Also, remember that any links from blocked parts of the website won’t count towards your SEO.
Validate your robots.txt with the Tester tool
Google provides a free-to-use handy tool for testing your robots.txt file, called simple robots.txt Tester. It will check whether your robots.txt file blocks Google’s crawlers from accessing any URLs on your website. The tool immediately highlights any syntax and logic errors, helping webmasters adjust their files easily.
You can test any URL on your website by writing it down in the box at the bottom of the tester and selecting a user-agent from the dropdown list. After clicking the “Test” button, the tool will show you whether the request was accepted or blocked. You can even edit the file right on the tester page and copy them over to your robots.txt file. Keep in mind that the tool doesn’t directly edit your robots.txt file, and the changes won’t save by themselves.
The Google robots.txt Tester is a useful tool that webmasters should use to make sure their websites are crawled properly by Google. If there are problems with crawler access, your website might not show up in Google search results!
Look at other website’s robots.txt
For all websites, the robots.txt document is a publicly accessible file - simply add /robots.txt to the end of any website’s URL! You can look at other website’s robots.txt structure this way and compare it to yours.
Webmasters often place little Easter eggs in the robots.txt file - a little joke or some ASCII art hidden in the document. You can try it out with any website you want. A couple of examples are:
- Nike’s website has “Just Crawl It” text on top of the robots.txt file and features a Nike logo ASCII art on the bottom,
- Youtube’s robots.txt file says “Created in the distant future (the year 2000) after the robotic uprising of the mid 90's which wiped out all humans.”
Is the page block in robots.txt sufficient?
Unfortunately, no. First of all, the main search engine robots don't always respect the bans (not to mention how some tools approach them). Secondly, even after reading the ban, Google may enter the page and add it to the index, taking into account only its title and URL address, and sometimes adding the following statement "For this page information is not available."
So it is still possible to get to this page from the search engine level, although this is unlikely. What's more, bots still go through such pages after subsequent links, even though they no longer provide link juice, and their ranking does not include data resulting from their content.
Meta robots directives
In addition to the robots.txt file, webmaster can specify additional meta directives for further, more specific instructions for web crawlers. These meta tags can be added as part of the HTML page, with different parameters to adjust their behavior. Using meta directives, you can input more specific instructions for the crawlers, since these parameters cannot be used in robots.txt.
Let’s take a look at some of the most useful parameters:
- Noindex - This parameter tells crawlers not to index this page,
- Index - Default behavior (you don’t need to add this parameter),
- Nofollow - Used for telling crawlers not to follow any links on the website. Prevents Domain Authority from transferring between links,
- Follow - Allows crawlers to pass link equity, even if the page isn’t indexed,
- None - Acts as both the nofollow and noindex parameters,
- Noimageindex - Prevents crawlers from indexing any images on the page,
- Noarchive - When used, this page will not display a cached link when displayed on search engine results pages,
- Unavailable_after - Prevents crawlers from indexing the website after a set date.
Meta robots tags are placed in the <head> section of a page’s HTML code, and you can use more than one parameter, separated by commas. They are a much more reliable way of preventing a page or element from being indexed than robots.txt alone. However, to make sure crawlers can read them the page must not be disallowed in robots.txt. If you disallow a page inside your robots.txt, the meta directives can’t provide instructions to crawlers, as they won’t be able to access the page.
Apart from adding meta tags to the <head> section to influence indexing, you can include tags directly within the HTTP header, using the X-Robots tag to provide instructions to search engine crawlers, giving you more control over the details.
To use this tag, add the following like in your HTTP header, and follow it with any meta directive parameters you want. Unlike meta directives, the X-Robots tag can be used to disallow indexation of non-HTML content, block crawlers from accessing certain elements of a page, and add additional rules for the crawlers to follow.
Conclusion: should I worry about robots.txt?
While not all websites will need a robots.txt - in fact, it’s often safer not to have one, since you don’t risk accidentally blocking important crawlers - most could benefit from a properly set-up robots.txt file, enhancing SEO factors by quite significant margins in some cases. It doesn’t take a lot of effort to set such a file up, and you don’t need any programming knowledge to do so.
The potential bandwidth saved by limiting other crawler’s accessibility to your website, as well as using the crawl-delay tag in some of them, can boost your SEO as well. There are also more likely than not pages and elements of your website that could hurt your SEO, such as duplicate content and private pages that you do not want indexed. In such cases, using a combination of robots.txt, robots meta directives, and the X-Robots tag allows website owners to fine-tune crawler behavior on their websites, choosing exactly what is indexed and what isn’t.
If you haven’t yet, try making your own robots.txt file - there are many potential benefits to setting one up, and with the available generators, it should only take a couple of minutes for smaller websites.
And also Great content for Digital Marketing
Great content about SEO, thank you.