-->
6
min read

Robots.txt

Written by
Search Historian
Edited by
Emanuel Skrobonja
TL;DR: 
The robots.txt file is a set of instructions for all crawlers visiting your website. It informs them about pages that shouldn’t be crawled.

What is a Robots.txt File?

The robots.txt file is a set of instructions for all crawlers visiting your website. It informs them about pages that shouldn’t be crawled.

The Robots.txt file is most commonly used for search engine crawler instructions, but can be also used to give instructions to any other type of crawler i.e. the OpenAI crawler.

Robots.txt files for any website should be found by visiting:

https://www.your-domain.com/robots.txt.

Robots.txt Importance for SEO

You can learn more about this in our crawling and indexing guide, but for now, it’s enough to know that search engines assign crawling budgets to each website based on the trust your site has.

This means that each website has a limited amount of pages/content that search engine crawlers will crawl and index every day/week/month.

Though a limited crawling budget should not be an issue for most smaller websites, it becomes a serious concern for massive websites. 

Non-SEO Page Blocking

Websites of any size will have plenty of pages that don’t need to be indexed by search engines.

Usually, pages that don’t need to be indexed are:

  • Staging pages
  • Internal search result pages
  • Duplicate pages
  • Landing pages for specific marketing campaigns
  • Gated content pages (that require a login)
  • Non-SEO content like client pages or case studies (it depends!)
  • Others

Re-Crawling Benefits

By using robots.txt correctly we can minimize the number of pages that search engines need to focus on. 

A crawling budget is fundamentally the amount of resources that crawlers will spend analyzing your site.

This means that the same crawling budget can be spent on reindexing updated pages and keeping an up-to-date index on pages that you want to rank for.

Hiding Pages

Another great use case for robots.txt is to hide certain content (pages) on your website.

Keep in mind that by hiding pages from crawlers, you are also preventing the pages from showing up in search results. 

So while you might want to hide protected pages, you should avoid hiding your login, and signup page. 

Because some users might still want to search for “[your brand] create account”, which makes this search term a navigational keyphrase.

How to Add Robots.txt in Webflow?

Adding your robots.txt file to your Webflow website is very simple. 

Go to Website Settings > SEO > Indexing and paste your robot instructions under “Robots.txt”.

Click on Save Changes and publish your website. That’s it!

Robots.txt Syntax

Wondering what you should add to your robot's instructions?

That depends on the goals and structure of each website, but let’s look at the basic markup language used in robots.txt files.

Disallow Crawling

Robots.txt syntax is quite simple.

It consists of two main parts:

  • User-agent identification (naming the crawler bot)
  • Rules and directives (giving orders to that bot)

It looks something like this:

User-agent: *
Disallow: /clients/

The instructions above are:

All crawler bots should not crawl our /clients/ path.

Asterix (*) Means All

Using User-agent: * will not identify specific crawlers, but give directives to all crawler bots. 

An asterisk (*) is a wildcard that stands for “all”.

Allow Crawling

You don’t need to add anything to your robots instructions to allow web page crawling. It’s the default behavior of crawlers.

Therefore you should only provide instructions about pages you don’t want crawled.

However, we can use Allow whenever we give Disallow instructions for all bots - but want to override this for specific bots. 

Let’s make instructions so that Google can crawl our website, but other bots can’t.

User-agent: Googlebot
Disallow: 
User-agent: *
Disallow:/

We said we would Allow Googlebot, but we still added disallow… Why?

Empty Disallow = Allow

Not adding anything after means we are allowing crawlers to crawl everything on this website:

Disallow: 

You could also use allow directive to reach the same goal:

Allow: /

This directive right here says “Crawl everything on this domain”.

URLs and Directories Inside Robots.txt

To fully understand how to use robots.txt for any use case, we will need a real-world example that you probably encountered.

Let’s say you have a blog in Webflow that consists of 2 CMS Collections and one static page:

  • /article/article-name | CMS Collection for Articles
  • /blog/category-name | CMS Collection for Article Categories
  • /blog | Static Page for all blog posts
  • /guides/page-name | with a few different static pages inside it

How to Allow Everything?

To allow crawlers to crawl everything on this imaginary site of yours, we can add any of the following directives:

User-agent: *
Disallow:

or

User-agent: *
Allow: /

Note that in robots.txt “/” stands for your whole domain. So the first directive was “disallow nothing” and the second directive was “allow everything”.

How to Disallow Just One Page?

To disallow just one specific page, be it a static or a CMS Collection page, you will need to know its relative path or slug. 

Let’s say we want /article/january-update not indexed on search engines because it’s just an update for our clients that has no SEO value.

Then our instructions would look something like this:

User-agent: *
Disallow: /article/january-update

How to Disallow the Whole CMS Collection or Page Folder?

In the last example, we disallowed crawling for a specific single page, but we can also disallow groups of pages. 

To disallow crawling for a group of pages we will need to block the whole directory.

Directory means a folder in web development, though because in Webflow we have both static page folders and CMS Collection URLs - both of them are directories.

To block a directory from crawling, our instructions would look like this:

User-agent: *
Disallow: /blog/

This instruction says, don’t crawl any pages inside Blog Category CMS. We define that directory by starting and ending our rule with a slash (/).

This means that instructions like this would not allow crawlers to index any pages, even if you add new CMS Collection items to this collection.

Combining Rules

Let’s look at how our instructions look now:

  • Don’t crawl the January Update article
  • Don’t crawl any of the Blog Category pages
  • Don’t crawl any of the static pages inside the Guides folder

Let’s see how our robots.txt file looks when we put everything together:

User-agent: *
Disallow: /article/january-update
Disallow: /blog/
Disallow: /guides/

Sitemap: https://www.your-domain.com/sitemap.xml

You noticed that we also gave the location of our sitemap and we used a full URL there. It’s optional, but a full URL is mandatory ONLY when providing a sitemap location.

Doing so will allow search engine crawlers to find your sitemap, even if you don’t submit it to them.

However, if you submit a sitemap to webmaster tools like Google Search Console separately, your site might get crawled faster as search engines have to discover your website first.

Most Common Robots.txt Mistakes

There are a few common issues that you should look out for when creating and managing your robots.txt file.

Difference Between Folder and Static Page

Choosing to add or not add slash (/) at the end of your URL path will determine whether you blocked the whole folder (path) or just a single URL.

Using our previous website example, let’s look at the meanings behind two rules:

Disallow: /blog/

and

Disallow: /blog

The first rule here blocks crawling on any page that is inside /blog/ CMS Collection or Static Page Folder.

The second rule just blocks a single static page. Your blog static page.

Values are Case Sensitive

Values you add after user agent, allow, sitemap, or disallow commands are case-sensitive. 

This means that if you add this rule:

Disallow: /Blog/*

It will NOT block the crawling of all slugs that start with /blog/ because you used one of the uppercase letters in there. 

In other words - /blog/ and /Blog/ are not the same thing. Be cautious about it.

Complex Patterns in Robots.txt

Let’s end this article with a small pro tip for more advanced robots.txt use cases.

You can use the * and $ operators to add more logic to your robots.txt file:

  • * represents any character sequence
  • $ matches the character sequence that the URL ends with

Regular expressions (regex) with * and $ will allow you to create complex robots.txt rules. 

Learn more about regex or get help from ChatGPT if you ever need something more complex written.

Using Meta Robots Tag in HTML

The last thing you need to know is that you can also give commands inside your head code for each page.

<meta name="robots" content="noindex">

This example would tell crawlers not to index that page. 

This becomes extremely important when you have a page that Googlebot already indexed, but you want it removed from search results.

Removing Indexed Pages Can Be Challenging

The problem with removing pages from search results is that the crawler needs to crawl the page once again. 

If you add the Disallow rule to robots.txt the crawler will never crawl your page again.  The page would stay indexed but would not be crawled again.

You can learn more about this in our XML sitemap or indexing guides, but for now, it’s enough to know that bots take two steps:

  • Crawl (find)
  • Index (save)

That’s an oversimplification but it’s enough to understand why already indexed pages need to be crawled again, and robots.txt gives them rules about finding pages, they don’t forbid indexing.

While <meta name="robots" content="noindex"> gives them a request not to index the page and not to store it in their database.

How to Remove Indexed Pages from Search Results?

To remove any page from the search engine index you should:

  • Add meta robots noindex tag to the page
  • Wait for a search engine to crawl the page again
  • Check your Google Search Console to see if the page is not indexed anymore
  • Add robots.txt disallow rule that prevents bots from crawling the page