How can I set up robot txt

Basics / Robots.txt

In the file robots.txt you can specify which behavior of search robots you want on your site.

Note: It is not possible with this technology to protect websites from being accessed by robots or people. You can only control how it appears in search results.
Note: There is no guarantee that search engines will adhere to the prohibitions in robots.txt. The vast majority of the robots of modern search engines take into account the presence of a robots.txt, read it out and follow the instructions. Robots maliciously crawling the web are unlikely to abide by it.

If you want to protect your content from unauthorized access, read the relevant sections on configuring web servers, for example.

General [edit]

The so-called Robots Exclusion Standard Protocol regulates how you can use a file robots.txt affect the behavior of search engine robots on your domain. This protocol has now grown into a quasi-standard even without RFC.

It is true that the use of the page can also be determined in individual HTML files with the help of a meta tag for search engines, but this only applies to the individual HTML file and at most all of the pages within it that can be accessed through links, not other resources such as B. Pictures. In a central robots.txt, on the other hand, you can specify which rules should apply to directories and directory trees, regardless of the file and reference structure of your web project. As there is no written RFC, the robots.txt and its syntax are not always interpreted in a uniform manner by the robots. The additional use of meta tags in HTML files is therefore recommended in cases of undesired indexing by the robot, if the robot did not interpret the robots.txt or did not interpret it correctly.

Location [edit]

The (there can only be a maximum of one such file per (sub) domain) must be stored under this name (all letters in lower case) in the root directory of the domain's web files. The URI for the domain example.org is therefore http://example.org/robots.txt. Only in this way can it be found by search engine robots that visit the project. This means that you can only use the robots.txt technology if you have your own domain, but not with web space offers where you only get a homepage directory on a server without accessing the root directory of the domain.

The robots.txt is a pure text file and can be edited with any text editor.

Structure of a robots.txt [edit]

# robots.txt to http://www.example.org/ User-agent: UniversalRobot / 1.0 User-agent: mein-Robot Disallow: / quellen / dtd / User-agent: * Disallow: / fotos / Disallow: / temp / Disallow: /fotoalbum.html
The first record is the robots UniversalRobot / 1.0 and my-robot prohibited data from the directory / sources / dtd / and index all subdirectories.

In the second data set, all robots are prohibited, the two subdirectories /photos/ and / temp / read out. In addition, access to the file fotoalbum.html forbidden.

The first line is just a comment line. Comments are introduced by a gate symbol and can also begin on the line.

One consists of records (records), which in turn basically consist of two parts. The first part specifies which robots () the following statements apply to. In the second part, the instructions themselves are noted. The instructions are to forbid something to the previously determined robots ().

Each line of a data record begins with one of the two permitted keywords or. This is followed by the relevant information, separated by a colon and a space. A blank line is noted between the data records.

User agent [edit]

Within a data record, at least one line must begin with. Only one entry is possible after this. If you want to address more than one particular robot, you have to write down several lines one below the other that begin with - as in the first data record in the example above.

Either the wildcard (asterisk), which means “all robots”, or the name of a certain robot, whose name you must know, is allowed. No distinction is made between upper and lower case. More than one record for all robots is not allowed.

Disallow [edit]

The lines that begin with are noted below the lines that begin with. The information about this is then taken into account by the robots that were specified in the same data record. Empty lines within a data record are not permitted.

You can write down a path after each line that begins with. The robots will not index any path on your side that begins with this path specification.

The statements are processed in sequence from the first to the last line. The first entry that matches the path to be checked wins.

User-agent: * Disallow: / photos / Disallow: / photos / vacation /
The entry for / photos / vacation / is superfluous as this path is already included /photos/ was excluded in the previous line.

does not limit you to entire directory or file names. Sections are also possible, /picture fits (next to /picture itself) / picture / vacation just like on /images or bild123.jpg. You should therefore make sure to note a trailing slash in the directory paths.

Placeholders such as or are only known to some search engines and should therefore be avoided.

Examples [edit]

Don't allow a search engine bot to do anything
User-agent: * Disallow: /
With you determine all data of the root directory and all subdirectories.
Exclude my-Robot from all other prohibitions
User-agent: mein-Robot Disallow: User-agent: * Disallow: /
If there is no information behind, everything is allowed.

Extensions to the original protocol [edit]

Even the original protocol is merely a recommendation; the extensions presented here are being used by Google, Microsoft and Yahoo! supported since 2008.[1]

Order of user agents [edit]

Originally, the robots.txt was processed strictly from top to bottom. Therefore, the instructions for all robots (user agent: *) should be at the very end of the list. In addition, the name of a user agent had to be known down to the upper and lower case.

The information about the user agent is only the beginning of the user agent string, so it is synonymous with.

Only one data record of the robots.txt can be applied to a robot. The robot must therefore determine the data record that applies to it by determining the data record with the most precise user agent information that still results in a match. All other records are ignored. The order of the data records is therefore not important.

User-Agent: * # Instructions 1 User-Agent: google # Instructions 2 User-Agent: google-news # Instructions 3
A robot by the name of googlebot accepts instructions 2, google-news-europe Instructions 3; one with the name MyGoogle only accepts the information for all robots.

Allow [edit]

The original protocol did not allow individual files or directories to be indexed.

was established in 1996[2] introduced in order to enable individual releases within actually blocked paths. It is not necessary to explicitly release objects that no other entry in the robots.txt matches.

User agent: * Allow: / pics / public / Disallow: / pics /
The directory named pics is excluded from indexing; Images that are still allowed to appear in the search results are located in pics / public.
User agent: * Allow: / public / Disallow: /
The entire domain is blocked here, only the directory public is accessible to search engine robots.

It should be noted that the entries in the robots.txt have always been processed in sequence until the first matching one. Accordingly, allow entries for paths that were previously excluded by Disallow are actually ineffective:

User agent: * Disallow: / pics / Allow: / pics / public /
pics / public is not indexed, because in the line before that there is a prohibition for the higher-level pics / was pronounced.

At Google it was suspected that due to this procedure, paths would be unintentionally excluded from the indexing and changed the processing sequence for its own index: First all allow entries are checked one after the other, only then are the disallow entries processed.

Since this deviation is only documented by Google and it makes no difference to Google, the order should be followed line by line with regard to other search engines.

Wildcards [edit]

The extended protocol recognizes two wildcard characters for the path information:

  • : any number of characters
  • : End of line
User-agent: * Disallow: / priv * / # all subdirectories that start with "priv" Disallow: / * priv * / # all subdirectories that contain "priv" Disallow: /*.jpg$ # all files that point to ".jpg" ends

Sitemap [edit]

A sitemap contains the structure of your website in machine-readable form.

You can enter the complete URI of the sitemap, which, unlike the robots.txt itself, can be saved anywhere under any name in the robots.txt.

Sitemap: http://example.com/sitemap.xml

Although the sitemap may contain useful additional information for search engines[3], it only makes sense to create one in a few cases, for example with very large or very complex pages. Make sure that your pages are linked to each other. If a human visitor can find all pages, every bot can be trusted to do so.

Approach recommended by Google [edit]

To ensure that certain pages are not indexed by Google, a "ban" via robots.txt is very unreliable. For example, if the Google bot B. is seen via an external link, he still picks up the page.

In order to reliably prevent pages from ending up in the Google index, the relevant page must

<meta name="robots" content="noindex">

can be specified.

In order to remove pages from the Google index, access in the robots.txt must not be forbidden and the budget tag must be set.[4]

However, this is not useful for non-HTML resources, since a PDF file, for example, cannot contain such a meta element. In this case the tag can be used.[5]

Note: This recommendation comes from by Google for Google - It cannot be transferred 1: 1 to other search engines.

Sources [edit]

  1. ↑ Google Webmaster Central Blog: Improvements to the Robots Exclusion Protocol
  2. ↑ robotstxt.org: Extended Draft 1996
  3. ↑ sitemaps.org: sitemaps protocol
  4. ↑ Hacker News: 20326445
  5. ↑ Google developer: robots meta tag

See also [edit]

Web links [edit]