How to Control Search Engine Robots

How to Controlfor thisdocument first at a web site before doing
Search Engine Robotsanything else. This helps thecrawler to do its job,
Wouldn\'t it be nice to be able to leave someand helps the web site owner tell the spider what
code in your web site to tellthe search engineto do.
spider crawlers to make your site number one?Say for instance you have some data that you
Unfortunately arobots.txt file or robots meta tagdon\'t want the crawlers to see.
won\'t do that, but they can help the crawlersto(Like duplicate content for other browser referrer
index your site better and block out thepages) You can detercrawlers from indexing the
unwanted ones.\'duplicate\' directory by typing this into
First a little definition explaining:yourrobots.txt file.
Search Engine Spiders or Crawlers - A webOr if you would like to have the robots.txt file
crawler (alsoknown as web spider) is a programcreated for you, visit To validateyour robots.txt
which browses the World Wide Web infile to make sure it works properly you can visit
amethodical, automated manner. Web crawlersUser-agent: *
are mainly used to create a copy ofall the visitedDisallow: /duplicate/
pages for later processing by a search engine,The * after user-agent says that this action
that will indexthe downloaded pages to provideapplies to all crawlers and
fast searches./duplicate/ after disallow tells all crawlers to ignore
A web crawler is one type of bot, or softwarethis directory and notsearch it. For each
agent. In general, it startswith a list of URLs touser-agent and disallow line there must be a
visit. As it visits these URLs, it identifies allblankspace between them in order for it to
thehyperlinks in the page and adds them to thefunction correctly. So this is howyou would create
list of URLs to visit, recursivelybrowsing the Webthe above two commands into a robots.txt file:
according to a set of policies.# this identifies the wayback machine
Robots.txt - The robots exclusion standardUser-agent: ia_archiver
orrobots.txt protocol is a convention to preventDisallow: /
well-behaved web spiders andother web robotsUser-agent: *
from accessing all or part of a website. TheDisallow: /duplicate/
informationspecifying the parts that should not beOne thing to note that is very important: Anyone
accessed is specified in a file calledrobots.txt in thecan access therobots.txt file of a site. So if you
top-level directory of the website.have information that you don\'t wantanyone to
The robots.txt protocol is purely advisory, andsee don\'t include it into the robots.txt file. If the
relies on the cooperation ofthe web robot, so thatdirectorythat you don\'t want anyone to see is
marking an area of your site out of boundsnot linked to from your web site thecrawlers
withrobots.txt does not guarantee privacy. Manywon\'t index it anyway.
web site administrators have beencaught outAn alternative to blocking indexing of your site is
trying to use the robots file to make privateto put a meta tag intothe page. It looks like this:
parts of a websiteinvisible to the rest of the<meta name="robots"
world. However the file is necessarilycontent="noindex,nofollow">
publiclyavailable and is easily checked by anyoneYou put this into the <head> tag of your
with a web browser.web page. This line tells therobot crawlers not to
The robots.txt patterns are matched by simpleindex (search) the page and not to follow any of
substring comparisons, so careshould be taken tothehyperlinks on the page. So as an example
make sure that patterns matching directories<meta name="robots"
have the finalcontent="noindex,follow">tells the
\'/\' character appended: otherwise all files withrobots crawlers to not index the page, but follow
names starting with thatsubstring will match,the hyperlinks onthis page.
rather than just those in the directory intended.Did you know that Google has its own
Meta Tag - Meta tags are used to provide<meta> tag?
structured data aboutdata.It looks like this: <meta
In the early 2000s, search engines veered awayname="googlebot"
from reliance on Meta tags, asmany web sitest;
used inappropriate keywords, or were keywordThis tells the Google robot crawler not to index
stuffing to obtainany and all traffic possible.the page, not to follow any ofthe links, and not to
Some search engines, however, still take Metakeep from storing cached versions of your web
tags into some considerationwhen deliveringsite.
results. In recent years, search engines haveYou will want this done if you update the content
become smarter,penalizing websites that areon your site frequently.
cheating (by repeating the same keywordThis prevents the web user from seeing outdated
severaltimes to get a boost in the search ranking).content that isn\'t refreshedbecause of storage in
Instead of going up rankings, thesewebsites will gothe cache.
down in rankings or, on some search engines, willYou can use the <meta> tag to specifically
be kicked offof the search engine completely.talk to Google\'s robots to avoidcomplications or if
Index a site - The act of crawling your site andyou are optimizing your site for Google\'s search
gatheringinformation.How can the robots.txt fileengine.
and meta tag help you?This concludes this month\'s article.
In the robots.txt you can tell the harmful \'webUntil the next article have a great day!
crawlers\' to leave your website alone, and giveCopyright Michael Rock
helpful hints to the ones you want to crawl your(You have permission to copy this article as long
site.as it remains intact with theauthor\'s byline)
Here is an example on how to disallow a webWeb development contractor (Web Design and
crawler to search your site:Hosting)
# this identifies the wayback machineInternet Presence
User-agent: ia_archiverThe owner of this registered companyhas over
Disallow: /ia_archiver is the crawler name for thetwenty years experience with DOS, windows
wayback machine that you may haveheard of,business applications, numerousprogramming
and the / after disallow tells ai_archiver not tolanguages, artistic development, and web design.
index any of yoursite. The #<messageOther areas ofinterest include web marketing,
here> allows you to write comments toweb promoting, and business marketing
yourself so youcan keep track of what youanddevelopment. After the persuasion of those
typed.praising his work, he decided to gointo business
Type the above three lines into notepad fromhimself and highly suggests everyone else to do
your computer and save it to theroot directorythe same.
of your web site as robots.txt. Web crawlers look