Anatomy Of A Search Engine Crawler

When you go to a search engine and perform aindex the site if there isn't such a file.
search many people don't understand how thoseGenerally, today's crawlers are stripped down
results end up there. Some people may think thatversions of web browsers. Some, like Googlebot,
sites are submitted while others know that aare built upon a text based web browser called
piece of software finds the pages. This articleLynx. Therefore one of the tools one can use to
explains one piece of that puzzle: The searchverify a site is the Lynx browser. by loading the
engine crawler.site in the browser you can see essentially what
Todays search engines rely on softwarethe crawlers "sees." You can then look for errors
packages called spiders or robots. Thesein the pages as well as any navigation problems
automated tools are used to search the web tothe crawler may come up against.
discover new pages.One other thing you may notice, as you view
A brief history of search crawlersyour web server log reports, is that some
The first crawler was the World Wide Webbrowsers come many different times and with
Wander and it appeared in 1993. It was developedmany different configurations.
by MIT and it's initial purpose was to measure theYahoo!s Slurp, for example emulates many
growth of the web. Soon after, however, andifferent hardware platforms – from
index was generated from the results –Windows 98 to Windows XP, and many different
effectively the first "search engine."browsers, from Internet Explorer to Mozilla.
Since then, crawlers have evolved and developed.MSNbot also works like this – emulating
Initially crawlers were simple creatures, only abledifferent operating systems and browsers.
to index specific bits of web page data such asThey do this to ensure compatibility – after
meta tags. Soon, however, search enginesall, the search engines want to be sure that the
realized that a truly effective crawler needs to bemajority of their users find a site which they can
able to index other information, including visibleuse. Therefore, as a design tip, you should test
text, alt tags, images and even other non-HTMLyour site against various hardware platforms and
content such as PDF's word processor documentsbrowsers as well. You don't have to use the
and more.variety that the search engines use, but you
How a crawler worksshould test against Internet Explorer, Netscape
Generally, the crawler gets a list of URL's to visitand Firefox. Also, you should try your site on
and store. The crawler doesn't rank the pages, itother platforms such as a Mac or Linux just to
only goes out and gets copies which it stores, orensure compatibility.
forwards to the search engine to later index andYou may also notice, upon reviewing your
rank according to various aspects.reports, that crawlers like Googlebot will visit
Search crawlers also are smart enough to followrepeatedly and request the same page(s)
links they find on pages. They may follow theserepeatedly. This is common as crawlers also want
links as they find them, or they will store themto be sure the site is stable and also to measure
and visit them later.the page's change frequency.
To date there are literally dozens of crawlers outIf your site goes down temporarily when a
regularly indexing the web. Some are specializedcrawler visits repeatedly like this, don't worry. The
crawlers – such as image indexers, whilecrawlers are smart enough to leave and come
others are more general and therefore more wellback later and try again. If, however, the continue
known.to find the site down, or slow to respond, they
Some of the most well known crawlers includemay opt to stay away for longer periods, or
Googlebot (from Google) MSNBot (from MSN) andindex the site more slowly. This can negatively
Slurp (from Yahoo!). There is also the Teomaimpact your site's performance in the search
crawler (from Ask Jeeves), as well as anengines.
assortment of crawlers from other engines, suchAs time goes on, we'd expect these spiders to
as shopping engines, blog search engines andbecome even more advanced. As new authoring
more.technology comes available, or new indexing
Generally, when a crawler comes to visit a site,options become available, then the search
they request a file called "robots.txt." this file tellscrawlers will be adapted. Remember, the goal of
the search crawler which files it can request, andall the search engines is to have the most
which files or directories it's not allowed to visit.complete index of files found on the web. This
The file can also be used to limit specific spidersmeans they want to be able to index more than
access to any or all of the site, and can also bejust web pages.
used to control how many times the crawlerSo as you are designing your site, be sure to
visits the site, by limiting it's speed or the timeskeep the crawlers in mind. Don't build your site for
when the crawler can visit. (Yahoo!s Slurp andcrawlers – build it for users – but be
MSNBot both support the "Crawl Delay" directivesure to test it thoroughly so that the crawlers
which tells the crawlers to slow down on theirsee what you want them to without hindrances
crawling).or roadblocks. Remember – the crawler is a
It's not imperative that a site have a robots.txtsite owners best friend.
file however as a crawler will assume it is OK to