A search engine crawler or spider is a Web “robot” and will normally follow the robots.txt file (Robots exclusion protocol) if it is present in the root directory of a Website. The robots.txt exclusion protocol was developed at the end of 1993 and still today remains the Internet’s standard for controlling how search engine spiders access a particular website.
If the robots.txt file can be used to prevent access to certain parts of a web site, if not correctly implemented, it can also prevent access to the whole site! On more than one occasion, I have found the robots exclusion protocol (Robots.txt file) to be the main culprit of why a site wasn't listed in certain search engines. If it isn't written correctly, it can cause all kinds of problems and, the worst part is, you will probably never find out about it just by looking at your actual HTML code.
As the name implies, the “Disallow” command in a robots.txt file instructs the search engine’s robots to "disallow reading", but that certainly does not mean "disallow indexing". In other words, a disallowed resource may be listed in a search engine’s index, even if the search engine follows the protocol. On the other hand, an allowed resource, such as many of the public (HTML) files of a website can be prevented from being indexed if the Robots.txt file isn’t carefully written for the search engines to understand.
The most obvious demonstration of this is the Google search engine. Google can add files to its index without reading them, merely by considering links to those files. In theory, Google can build an index of an entire Web site without ever visiting that site or ever retrieving its robots.txt file.
In so doing, it is not violating the robots.txt protocol, because it’s not reading any disallowed resources, it is simply reading other web sites' links to those resources, which Google constantly uses for its page rank algorithm, among other things.
Contrary to popular belief, a website does not necessarily need to be “read” by a robot in order to be indexed. To the question of how the robots.txt file can be used to prevent a search engine from listing a particular resource in its index, in practice, most search engines have placed their own interpretation on the robots.txt file which allows it to be used to prevent them from adding resources or disallowed files to their index.
Most modern search engines today interpret a resource being disallowed by the robots.txt file as meaning they should not add it to their index. Conversely, if it’s already in their index, placed there by previous crawling activity, they would normally remove it. This last point is important, and an example will illustrate that critical subject.
The inadequacies and limitations of the robots exclusion protocol are indicative of what sometimes could be a bigger problem. It is impossible to prevent any directly accessible resource on a site from being linked to by external sites, be they partner sites, affiliates, websites linked to competitors or, search engines.
No comments:
Post a Comment