| When you go to a search engine and perform a | | | | index the site if there isn't such a file. |
| search many people don't understand how those | | | | Generally, today's crawlers are stripped down |
| results end up there. Some people may think that | | | | versions of web browsers. Some, like Googlebot, |
| sites are submitted while others know that a | | | | are built upon a text based web browser called |
| piece of software finds the pages. This article | | | | Lynx. Therefore one of the tools one can use to |
| explains one piece of that puzzle: The search | | | | verify a site is the Lynx browser. by loading the |
| engine crawler. | | | | site in the browser you can see essentially what |
| Todays search engines rely on software | | | | the crawlers "sees." You can then look for errors |
| packages called spiders or robots. These | | | | in the pages as well as any navigation problems |
| automated tools are used to search the web to | | | | the crawler may come up against. |
| discover new pages. | | | | One other thing you may notice, as you view |
| A brief history of search crawlers | | | | your web server log reports, is that some |
| The first crawler was the World Wide Web | | | | browsers come many different times and with |
| Wander and it appeared in 1993. It was developed | | | | many different configurations. |
| by MIT and it's initial purpose was to measure the | | | | Yahoo!s Slurp, for example emulates many |
| growth of the web. Soon after, however, an | | | | different hardware platforms from |
| index was generated from the results | | | | Windows 98 to Windows XP, and many different |
| effectively the first "search engine." | | | | browsers, from Internet Explorer to Mozilla. |
| Since then, crawlers have evolved and developed. | | | | MSNbot also works like this emulating |
| Initially crawlers were simple creatures, only able | | | | different operating systems and browsers. |
| to index specific bits of web page data such as | | | | They do this to ensure compatibility after |
| meta tags. Soon, however, search engines | | | | all, the search engines want to be sure that the |
| realized that a truly effective crawler needs to be | | | | majority of their users find a site which they can |
| able to index other information, including visible | | | | use. Therefore, as a design tip, you should test |
| text, alt tags, images and even other non-HTML | | | | your site against various hardware platforms and |
| content such as PDF's word processor documents | | | | browsers as well. You don't have to use the |
| and more. | | | | variety that the search engines use, but you |
| How a crawler works | | | | should test against Internet Explorer, Netscape |
| Generally, the crawler gets a list of URL's to visit | | | | and Firefox. Also, you should try your site on |
| and store. The crawler doesn't rank the pages, it | | | | other platforms such as a Mac or Linux just to |
| only goes out and gets copies which it stores, or | | | | ensure compatibility. |
| forwards to the search engine to later index and | | | | You may also notice, upon reviewing your |
| rank according to various aspects. | | | | reports, that crawlers like Googlebot will visit |
| Search crawlers also are smart enough to follow | | | | repeatedly and request the same page(s) |
| links they find on pages. They may follow these | | | | repeatedly. This is common as crawlers also want |
| links as they find them, or they will store them | | | | to be sure the site is stable and also to measure |
| and visit them later. | | | | the page's change frequency. |
| To date there are literally dozens of crawlers out | | | | If your site goes down temporarily when a |
| regularly indexing the web. Some are specialized | | | | crawler visits repeatedly like this, don't worry. The |
| crawlers such as image indexers, while | | | | crawlers are smart enough to leave and come |
| others are more general and therefore more well | | | | back later and try again. If, however, the continue |
| known. | | | | to find the site down, or slow to respond, they |
| Some of the most well known crawlers include | | | | may opt to stay away for longer periods, or |
| Googlebot (from Google) MSNBot (from MSN) and | | | | index the site more slowly. This can negatively |
| Slurp (from Yahoo!). There is also the Teoma | | | | impact your site's performance in the search |
| crawler (from Ask Jeeves), as well as an | | | | engines. |
| assortment of crawlers from other engines, such | | | | As time goes on, we'd expect these spiders to |
| as shopping engines, blog search engines and | | | | become even more advanced. As new authoring |
| more. | | | | technology comes available, or new indexing |
| Generally, when a crawler comes to visit a site, | | | | options become available, then the search |
| they request a file called "robots.txt." this file tells | | | | crawlers will be adapted. Remember, the goal of |
| the search crawler which files it can request, and | | | | all the search engines is to have the most |
| which files or directories it's not allowed to visit. | | | | complete index of files found on the web. This |
| The file can also be used to limit specific spiders | | | | means they want to be able to index more than |
| access to any or all of the site, and can also be | | | | just web pages. |
| used to control how many times the crawler | | | | So as you are designing your site, be sure to |
| visits the site, by limiting it's speed or the times | | | | keep the crawlers in mind. Don't build your site for |
| when the crawler can visit. (Yahoo!s Slurp and | | | | crawlers build it for users but be |
| MSNBot both support the "Crawl Delay" directive | | | | sure to test it thoroughly so that the crawlers |
| which tells the crawlers to slow down on their | | | | see what you want them to without hindrances |
| crawling). | | | | or roadblocks. Remember the crawler is a |
| It's not imperative that a site have a robots.txt | | | | site owners best friend. |
| file however as a crawler will assume it is OK to | | | | |