How the search engine processes the Robots.txt file for your Google site

How the search engine processes the Robots.txt file for your Google site

Automatic robots of search engines follow the Robots Exclusion Protocol (REP) rules, which means: before scanning the site, the search engine reads the file robots.txtTo determine which sections of the site are permitted or prohibited for indexation. This protocol does not apply to tools controlled by users or employees of security goals (for example, scanning for malicious software).

This material explains in detail how the directives of REP are interpreted. The original specification can be found in RFC 9309.

What is the Robots.txt file site in Google

If you do not want some parts of your site to be indexed by search engines, create a Robots.txt file with the necessary rules. This is a simple text document, which indicates which search boots the access is permitted and which is prohibited. An example of a file structure:

User-agent: *
Disallow: /includes/
User-agent: Googlebot
Allow: /includes/
Sitemap: https://вашдомен.ру/sitemap.xml

If you first encounter Robots.txt, start by studying the basics and practical tips for its creation.

File location and action area on your Google Site

The Robots.txt file should be in the root catalog of the site and be available on the supported protocol. The search engine takes into account the protocol, port and domain name. For example, the file will be used only for the same host as its location, including the protocol and the port.

Examples of the permissible URL for the Robots.txt file in Google

Here are examples of the correct ways to the file and their action:

  • HTTPS: // Example - We apply only to this domain and port.
  • HTTPS: //VVV. Example - Only for Pododen www.
  • HTTPS: // Example.Kom/folder/ Robots - It is not acceptable.
  • FTP: // Example -Only for the FTP application.

Processing errors and response codes of your site server in Google

The behavior of the search robot depends on the HTTP code received when the file is requested:

  • 2xx - The file is processed.
  • 3xx - More than five redirects are considered as 404.
  • 4xx (except 429) - It is believed that the file is absent, there are no restrictions.
  • 5xx - Scanning is suspended or postponed depending on the conditions.

Robots.txt caching search engine Google

The contents are caching up to 24 hours, sometimes longer - with errors of loading. Title Cache-Control may affect the period of storage of a copy.

Robots.txt Format for your Google site

The file should be in the UTF-8 encoding, simple text. Translations of lines are permissible in any format (CR, LF, CRLF). Erroneous lines are ignored, like, BOM, unsupported symbols.

The maximum permissible file size is 500 KIB. Everything that exceeds this volume is ignored.

Robots.txt Rules Syntax Site to Google

Each line includes a field, colon and value. The following fields are supported:

  • user-agent - determines which bot belongs to the rule;
  • disallow - prohibits access to a certain path;
  • allow - allows access to the path (even if there are prohibiting rules);
  • sitemap -Indicates the location of the XML site of the site.

user- Agent in the Robots.txt file in Google

This is the name of the search bot, which include these rules. The value is not sensitive to the register.

Disallow: Banning Pages of your site in Google

Prohibits access to certain paths. If the path is not indicated, the rule is ignored. Value sensitive to the register.

allow: permission to scan the content of the site in Google

Allows access to the URL. It works in conjunction with other rules, with a conflict, the least restrictive is selected.

SITEMAP: indicating the site card in Google

The site of the site’s URL is completely indicated. The field can be repeated. It may be on another domain. Not attached to a specific bot.

Grouping Rules for the User-Agent site in Google

You can indicate several groups with different or the same user-agent. For example:

user-agent: a
disallow: /private
user-agent: b
disallow: /temp
user-agent: c
user-agent: d
disallow: /files

Priority of the Rules for the User-Agent site in Google

Each bot uses only one group of rules-the most suitable named user-agent. General rules p * They are used if there are no more specific ones.

An example of User- Agent processing in Robots.txt your Google site

user-agent: bot-news
disallow: /news-private
user-agent: *
disallow: /
user-agent: bot
disallow: /all

Bot bot-news Uses the first group, bot - The third, all the others are the second.

How is the URL Put in the Robots.txt Rules used in Google

A comparison of the path with the URL takes into account the register, as well as special symbols. Supported:

  • * - corresponds to any number of characters;
  • $ - denotes the end of the URL.

Examples of compliance of ways for robots.txt site in Google

  • / - corresponds to all pages;
  • /$ - only root;
  • /fish - Everything that begins with /fish;
  • /*.php$ - URL, ending on .php.

The priority of the Allow and Disallow rules in Robots.txt your Google website

In the conflict of rules with different path lengths, a longer one is used. With equal length - less restrictive.

Examples:

  • Allow: /Private
    Disallow: / - ALLOW is used;
  • Allow: /Page
    Disallow: /*.htm - Disallow is used, since the path is longer.

For all issues of Robots.txt settings of your site, as well as other aspects of SEO, you can contact the team SEO companies "seo.computer" By email: info@seo.computer or through WhatsApp: +79202044461

ID: 159

Send a request and we will provide a consultation on SEO promotion of your website