How and why to prevent bots from crawling your site

For the most part, robots and spiders are relatively harmless.

For example, you want Google’s robots to crawl and index your website.

However, bots and spiders can sometimes be a problem and provide unwanted traffic.

This unwanted traffic can result in:

Obfuscate the source of the traffic.
Confusing and incomprehensible report.
wrongly attributed to Google Analytics.
The bandwidth cost you pay increases.
other troubles.

There are good robots and bad robots.

Good bots run in the background and rarely attack other users or websites.

Malicious bots compromise the security behind websites or are used as widespread large botnets for DDOS attacks against large organizations (a single machine cannot be shut down).

Here’s what you should know about bots and how to prevent malicious bots from crawling your site.

What is a robot?

Knowing exactly what a bot is can help determine why we need to block it and stop it from crawling our site.

A robot, short for “robot,” is a software application designed to perform specific tasks repeatedly.

For many SEO professionals, using bots is consistent with scaling SEO activities.

“Scaling” means you automate as much work as possible to get better results faster.

Common misconceptions about robots

You may be under the misconception that all bots are evil and must be explicitly banned from your site.

But this is far from the truth.

Google is a robot.

Can you guess what would happen to your search engine rankings if you blocked Google?

Some bots can be malicious and designed to create fake content or impersonate legitimate websites to steal your data.

However, bots are not always malicious scripts run by bad actors.

Some can be great tools to help SEO professionals do their jobs more easily, such as automating common repetitive tasks or scraping useful information from search engines.

Some common bots used by SEO professionals are Semrush and Ahrefs.

These bots scrape useful data from search engines, help SEO professionals automate and complete tasks, and can help make your SEO tasks easier.

Why do you need to stop bots from crawling your website?

While there are many good bots, there are also bad bots.

Bad bots can help steal your private data or shut down other operating websites.

We want to block any bad bots we can spot.

Spotting every bot that might crawl your site isn’t easy, but with a little digging, you can find malicious bots that don’t want to visit your site anymore.

So why do you need to stop bots from crawling your site?

Some common reasons why you might want to prevent bots from crawling your site might include:

Protect your valuable data

Maybe you find that a plugin is attracting many malicious bots that want to steal your precious consumer data.

Or, you find that bots are exploiting security holes to add bad links on your website.

Or, someone has been trying Use bots to spam your contact form.

Here, you need to take certain steps to protect your valuable data from bots.

bandwidth overload

If you receive a lot of bot traffic, your bandwidth can also spike, resulting in unforeseen overages and charges you don’t want to have.

In these cases, you definitely want to prevent problematic bots from crawling your site.

You don’t want a situation where you pay thousands of dollars for bandwidth that isn’t worth paying for.

What is bandwidth?

Bandwidth is the data transfer from server to client (web browser).

Every time you try to send data over a connection, you use bandwidth.

When bots visit your site and you waste bandwidth, you may incur overage charges for exceeding your monthly allotment.

When you signed up for a hosting plan, you should have at least some details from your host.

Limit bad behavior

If malicious bots somehow start targeting your site, it’s appropriate to take steps to control it.

For example, you want to ensure that the bot cannot access your contact form. You want to make sure that bots cannot access your website.

Do this before a bot corrupts your most critical files.

By making sure your website is properly locked and secured, you can stop these bots before they can do too much damage.

How to effectively block bots on your website

You can effectively block bots from visiting your website using two methods.

The first is through robots.txt.

This is a file located in the root directory of the web server. Often, you may not have one by default, you must create one.

These are some very useful robots.txt codes you can use to block most spiders and bots on your site:

Block Googlebot from your server

If for some reason you wanted to completely block Googlebot from crawling your server, you would use the following code:

User-Agent: Googlebot
not allowed: /

You only want to use this code to prevent your site from being indexed.

Don’t use it on a whim!

There’s a specific reason to make sure you don’t want robots crawling your site at all.

For example, a common problem is wanting to exclude your staging site from the index.

You don’t want Google to crawl the staging site and your real site because you double the content and create duplicate content issue therefore.

Ban all bots from your server

If you want to completely block all bots from crawling your site, you can use the following code:

User-Agent: *
not allowed: /

This is the code to ban all bots. Remember the staging site example above?

Maybe you want to exclude the staging site from all bots before fully deploying the site to all bots.

Or, you might want to keep your website private for a while before releasing it to the world.

Either way, this will keep your site safe from prying eyes.

Prevent bots from crawling specific folders

If for some reason you want to prevent bots from crawling a specific folder you want to specify, you can do that too.

Here is the code you will use:

User-Agent: *
Forbidden: /folder name/

There are many reasons why someone might want to exclude bots from a folder. Maybe you want to make sure that some content on your site doesn’t get indexed.

Or maybe that particular folder is causing some type of duplicate content problem and you want to exclude it from crawling entirely.

Either way, this will help you do just that.

Common errors with Robots.txt

SEO professionals make several mistakes when using robots.txt. The most common mistakes include:

Use disallow in both robots.txt and noindex.
Use forward slashes / (all folders down from root) when you really mean a specific URL.
The correct path is not included.
Do not test your robots.txt file.
Don’t know the correct name of the user agent to block.

Use Disallow in Robots.txt and Noindex on pages

Google’s John Mueller says you shouldn’t use disallow in robots.txt and noindex on the page itself at the same time.

If you do both, Google won’t be able to crawl the page for noindex, so it may still index the page.

That’s why you should only use one, not both.

Use forward slashes when you really mean a specific URL

The forward slash after Disallow means “down from this root folder, completely completely permanent”.

Every page on your site will be permanently blocked until you change it.

One of the most common issues I find in site audits is someone accidentally adding a forward slash to “Disallow:” and preventing Google from crawling their entire site.

does not include the correct path

We understand. Sometimes coding robots.txt can be a tough job.

Initially you didn’t remember the exact correct path, so you browsed the file and modified it.

The problem is that these similar paths all result in a 404 because they are one character.

That’s why it’s important to always double-check the paths you use on specific URLs.

You don’t want to risk adding a URL to robots.txt that doesn’t work in robots.txt.

Don’t know the correct name of the user agent

If you want to block a specific user agent, but you don’t know the name of that user agent, that’s a problem.

Instead of using a name you think you remember, do some research and find out the exact name of the user agent you need.

If you’re trying to block a specific bot, that name becomes extremely important in your work.

Why stop robots and spiders?

There are other reasons SEO professionals want to stop bots from crawling their sites.

Maybe they’re deep in grey hat (or black hat) PBNs and they want to hide their private blog network from prying eyes (especially their competitors).

They can do this by using robots.txt to block common robots that SEO professionals use to evaluate their competition.

For example Semrush and Ahrefs.

If you want to block Ahrefs, execute the following code:

User-Agent: AhrefsBot
not allowed: /

This will prevent AhrefsBot from crawling your entire website.

If you want to block Semrush, here is the code to do it.

There are other instructions here.

There are many lines of code to add, so be careful when adding:

To stop SemrushBot from crawling your site for different SEO and technical issues:

User-Agent: SiteAuditBot
not allowed: /

To stop SemrushBot from crawling your site for backlink audit tools:

User-Agent: SemrushBot-BA
not allowed: /

To stop SemrushBot from crawling your site for the On Page SEO Checker tool and similar tools:

User-Agent: SemrushBot-SI
not allowed: /

To stop SemrushBot from checking URLs on your website for SWA tools:

User-Agent: SemrushBot-SWA
not allowed: /

To stop SemrushBot from crawling your site for Content Analyzer and Post Tracking tools:

User-Agent: SemrushBot-CT
not allowed: /

To stop SemrushBot from crawling your site for brand monitoring:

User-Agent: SemrushBot-BM
not allowed: /

To stop SplitSignalBot from crawling your site for the SplitSignal tool:

User-Agent: SplitSignalBot
not allowed: /

To stop SemrushBot-COUB from crawling your site for the Content Outline Builder tool:

User-Agent: SemrushBot-COUB
not allowed: /

Block bots with your HTACCESS file

If you are on the APACHE web server, you can block specific bots using your website’s htaccess file.

For example, here’s how you would use code in htaccess to block ahrefsbot.

Note: Use this code with care.

If you don’t know what you are doing, you can shut down your server.

We provide this code here for example purposes only.

Make sure you do your own research and practice before adding it to a production server.

order allowed, denied
Deny from 51.222.152.133
Denied from 54.36.148.1
Denied from 195.154.122
allow everyone

For this to work, make sure you block all IP ranges listed This article On the Ahrefs blog.

If you want a comprehensive look at .htaccess, check out this tutorial on Apache.org.

If you need help blocking specific types of bots using your htaccess file, you can follow Tutorial is here.

Stopping bots and spiders may take some work

But it was worth it in the end.

By making sure to stop bots and spiders from crawling your site, you won’t fall into the same traps as everyone else.

You can rest easy knowing that your website is immune to some automated process.

Things get better for you, the SEO professional, when you can control these specific bots.

If you must, always make sure to block the desired bots and spiders from crawling your site.

This will lead to greater security, a better overall online reputation, and better websites for years to come.

More resources:

Featured image: Roman Samborskyi/Shutterstock

Source link

How and why to prevent bots from crawling your site

What is a robot?

Common misconceptions about robots

Why do you need to stop bots from crawling your website?

Protect your valuable data

bandwidth overload

Limit bad behavior

How to effectively block bots on your website

Block Googlebot from your server

Ban all bots from your server

Prevent bots from crawling specific folders

Common errors with Robots.txt

Use Disallow in Robots.txt and Noindex on pages

Use forward slashes when you really mean a specific URL

does not include the correct path

Don’t know the correct name of the user agent

Why stop robots and spiders?

Block bots with your HTACCESS file

Stopping bots and spiders may take some work

Related articles

Most Popular Baby Names 2024: Top Picks

Most Popular Baby Names 2024: Top Picks

How to Settle a Colic Baby: Proven Tips

What Is Colic in Babies: Key Facts Revealed

The 7 Best Ways to Gain Popularity

LEAVE A REPLY Cancel reply

EDITOR PICKS

How to Build a Personal Brand That Gets You Speaking Gigs

Top 5 Google Business Profile Services for Chiropractors in Sioux Falls

5 Best PR Agencies for Building Investor Credibility

POPULAR POSTS

How Accident Reconstruction Helps Fort Myers Injury Victims

What Is Product Animation and When Does a Pitch Need One?

AJ Mizes: Why Smart People Don’t Get Promoted Faster (And What Actually Works)

ABOUT US

FOLLOW US