Robots.txt is a useful and relatively powerful tool for instructing search engine crawlers on how you want them to crawl your site.
It’s not a panacea (in Google’s own words“It’s not a mechanism to exclude pages from Google”), but it can help prevent your site or server from being overloaded with crawler requests.
If you have this crawling block set up on your website, you need to make sure it is being used correctly.
This is especially important if you use dynamic URLs or other methods of generating a theoretically unlimited number of pages.
In this guide, we’ll cover some of the most common problems with robots.txt files, how they can affect your site and your searches, and how to fix them if you think they’ve happened.
But first, let’s take a quick look at robots.txt and its replacements.
What is Robots.txt?
Robots.txt is in plain text file format and is located in the root directory of your website.
It must be in the topmost directory of your site; if you put it in a subdirectory, search engines will simply ignore it.
Although robots.txt is powerful, it is usually a relatively simple document, and a basic robots.txt file can be created in seconds using a similar editor notebook.
There are other ways to achieve some of the same goals that robots.txt is commonly used for.
Individual pages can include robots meta tags within the page code itself.
You can also use X-Robots-Tag HTTP header Affects how (and if) content appears in search results.
What can Robots.txt do?
Robots.txt can achieve various results in a range of different content types:
Can prevent web pages from being crawled.
They may still appear in search results, but will not have a text description. Non-HTML content on the page is also not crawled.
Can prevent media files from appearing in Google search results.
This includes image, video and audio files.
If the file is public, it still “exists” online and can be viewed and linked, but this private content won’t show up in Google Search.
Resource files such as unimportant external scripts can be blocked.
But this means that if Google crawls a page that needs to load the resource, the Googlebot bot will “see” a version of the page as if the resource didn’t exist, which may affect indexing.
You cannot use robots.txt to completely prevent pages from appearing in Google’s search results.
For this you have to use another method like adding noindex meta tag to the top of the page.
How dangerous are Robots.txt errors?
Errors in robots.txt can have unintended consequences, but it’s usually not the end of the world.
The good news is that by repairing the robots.txt file, you can quickly (and usually) recover from any errors.
Google’s Guidance for Web Developers The thread on robots.txt errors says this:
“Web crawlers are generally very flexible and are usually not bogged down by small errors in the robots.txt file. In general, the worst that can happen is incorrect [or] Unsupported directives will be ignored.
Remember that while Google can’t read minds when interpreting robots.txt files; we have to interpret the robots.txt files we get. That said, if you know of problems in the robots.txt file, they are usually easy to fix. “
6 Common Robots.txt errors
- Robots.txt is not in the root directory.
- Inappropriate use of wildcards.
- No index in Robots.txt.
- Blocked scripts and stylesheets.
- There is no sitemap URL.
- Visit the development site.
If your site is behaving strangely in search results, your robots.txt file is a good place to look for any errors, syntax errors, and out-of-scope rules.
Let’s look at each of the above errors in more detail and see how to make sure you have a valid robots.txt file.
1. Robots.txt is not in the root directory
The file can only be found by search bots in your root folder.
This is why there should only be a forward slash between your website’s .com (or equivalent domain) and the “robots.txt” filename in the URL of the robots.txt file.
If there is a subfolder in it, your robots.txt file may not be visible to search robots, and your site may behave as if there is no robots.txt file at all.
To resolve this issue, move the robots.txt file to the root directory.
It’s worth noting that this requires you to have root access to the server.
Some content management systems upload files to the “media” subdirectory (or similar) by default, so you may need to circumvent this in order to place the robots.txt file in the correct location.
2. Improper use of wildcards
Robots.txt supports two wildcards:
- Asterisk * It represents any instance of a valid character, such as the clown in a deck of cards.
- dollar sign $ It represents the end of the URL and allows you to apply rules only to the last part of the URL, such as file type extensions.
It’s wise to use wildcards in the easiest way possible, as they have the potential to apply restrictions to broader parts of your site.
It’s also relatively easy to end up blocking bots from your entire site with a poorly placed asterisk.
To fix the wildcard problem, you need to find the incorrect wildcard and move or delete it so that your robots.txt file performs as expected.
3. No index in Robots.txt
This is more common in sites that are several years old.
As of September 1, 2019, Google has stopped following the noindex rule in robots.txt files.
If your robots.txt file was created before that date, or contains a noindex specification, you’re likely to see these pages indexed in Google’s search results.
The solution to this problem is to implement an alternative “noindex” method.
One option is the robots meta tag, which you can add to the head of any web page you want to prevent Google from indexing.
4. Blocked scripts and style sheets
It seems logical to prevent crawlers from accessing external JavaScript and Cascading Style Sheets (CSS).
However, keep in mind that Googlebot needs access to CSS and JS files to properly “view” your HTML and PHP pages.
If your pages behave strangely in Google’s search results, or Google doesn’t seem to be viewing them correctly, check that you’re preventing crawlers from accessing required external files.
A simple solution is to remove the line blocking access from the robots.txt file.
Or, if you really need to block certain files, insert an exception to restore access to the necessary CSS and JavaScript.
5. No sitemap URL
This is more about SEO.
You can include the URL of your sitemap in the robots.txt file.
Because this is the first place Googlebot looks when crawling your site, this gives the crawler a head start in understanding your site’s structure and main pages.
While this isn’t strictly a mistake, as omitting a sitemap won’t negatively impact your site’s actual core functionality and appearance in search results, it’s still worth adding the sitemap URL to robots if you want to. TXT Boost your SEO efforts.
6. Enter the development site
Blocking crawlers from your live site is a no-no, but so is allowing them to crawl and index pages you’re still developing.
It’s a best practice to add a prohibit directive to the robots.txt file of the site under construction so the public won’t see it until it’s finished.
Likewise, it’s important to remove prohibitions when launching a completed website.
Forgetting to remove this line from robots.txt is one of the most common mistakes web developers make, and can prevent your entire site from being crawled and indexed properly.
If your development site appears to be receiving real-world traffic, or your recently launched site isn’t performing well in search, look for the generic user-agent blocking rule in your robots.txt file:
User-Agent: *
Disallow: /
If you don’t see it when you shouldn’t (or don’t see it when you should), make the necessary changes to your robots.txt file and check that your site’s search appearance is updated accordingly.
How to recover from Robots.txt errors
If errors in robots.txt are adversely affecting your site’s search appearance, the most important first step is to correct robots.txt and verify that the new rules have the desired effect.
Some SEO Crawler can help so you don’t have to wait for search engines to crawl your site next time.
When you are confident that robots.txt is behaving as expected, you can try to recrawl your site as soon as possible.
platforms such as google search console and Bing Webmaster Tools can help.
Submit an updated sitemap and request a recrawl of any pages that were inappropriately removed.
Unfortunately, you’re on Googlebot’s whim – there’s no guarantee how long it might take for any missing pages to reappear in the Google search index.
All you can do is take the right steps to minimize that time and keep checking until Googlebot implements a fixed robots.txt.
final thoughts
When it comes to robots.txt errors, prevention is better than cure.
On large revenue-generating sites, stray wildcards that remove your entire site from Google can have a direct revenue impact.
Edits to robots.txt should be carefully performed by experienced developers, scrutinized, and, where appropriate, a second opinion.
If possible, test in the sandbox editor before pushing it to your real server to ensure you don’t inadvertently create usability issues.
Remember, it’s important not to panic when the worst happens.
Diagnose the problem, make any necessary fixes to robots.txt, and resubmit the sitemap for a new crawl.
Your position in search rankings is expected to recover within a few days.
More resources:
Featured Image: M-SUR/Shutterstock
!function(f,b,e,v,n,t,s) {if(f.fbq)return;n=f.fbq=function(){n.callMethod? n.callMethod.apply(n,arguments):n.queue.push(arguments)}; if(!f._fbq)f._fbq=n;n.push=n;n.loaded=!0;n.version='2.0'; n.queue=[];t=b.createElement(e);t.async=!0; t.src=v;s=b.getElementsByTagName(e)[0]; s.parentNode.insertBefore(t,s)}(window,document,'script', 'https://connect.facebook.net/en_US/fbevents.js');
if( typeof sopp !== "undefined" && sopp === 'yes' ){ fbq('dataProcessingOptions', ['LDU'], 1, 1000); }else{ fbq('dataProcessingOptions', []); }
fbq('init', '1321385257908563');
fbq('track', 'PageView');
fbq('trackSingle', '1321385257908563', 'ViewContent', { content_name: 'common-robots-txt-issues', content_category: 'seo ' });



