Robots.txt is an exclusion protocol for web crawlers to ignore certain pages, folders or files on a website, and is used for improving search engine optimization.
Robots exclusion standard also referred to as the robots exclusion protocol (Robots.txt for short) is a file that lets search engine spiders know what web pages or sections of a website not to crawl. It is important for the robots.txt to be set up correctly, as a single mistake can get an entire website deindexed from search engines.
Robots.txt is an important part of SEO, as all major search engines recognize and obey this exclusion standard.
The majority of sites do not need this protocol as Google will only index the important pages of a website, leaving out the rest (e.g. duplicate pages), but there are some cases in which it is recommended to use robots.txt. The robots exclusion standard can be used to prevent indexing multimedia resources (e.g. images), block pages that are not public (e.g. member login pages) and to maximize the crawl budget.
The basic format for the robots.txt file is:
User-agent: ______
Disallow: ______
Where the user-agent is the name of the robot being addressed, and the part that comes after “disallow” will contain the name of the web page, folder or file that the robot must ignore while visiting a website. An asterisk (*) can be used instead of the name of a specific bot, if one whant to address all the robots that might visit the website.
In this example the crawlers are informed not to enter the mentioned directories:
User-agent: *
Disallow: /tmp/
Disallow: /junk/
While in this one, crawlers are informed to avoid a specific file:
User-agent: *
Disallow: /directory/file.html