| How to Control | | | | for thisdocument first at a web site before doing |
| Search Engine Robots | | | | anything else. This helps thecrawler to do its job, |
| Wouldn\'t it be nice to be able to leave some | | | | and helps the web site owner tell the spider what |
| code in your web site to tellthe search engine | | | | to do. |
| spider crawlers to make your site number one? | | | | Say for instance you have some data that you |
| Unfortunately arobots.txt file or robots meta tag | | | | don\'t want the crawlers to see. |
| won\'t do that, but they can help the crawlersto | | | | (Like duplicate content for other browser referrer |
| index your site better and block out the | | | | pages) You can detercrawlers from indexing the |
| unwanted ones. | | | | \'duplicate\' directory by typing this into |
| First a little definition explaining: | | | | yourrobots.txt file. |
| Search Engine Spiders or Crawlers - A web | | | | Or if you would like to have the robots.txt file |
| crawler (alsoknown as web spider) is a program | | | | created for you, visit To validateyour robots.txt |
| which browses the World Wide Web in | | | | file to make sure it works properly you can visit |
| amethodical, automated manner. Web crawlers | | | | User-agent: * |
| are mainly used to create a copy ofall the visited | | | | Disallow: /duplicate/ |
| pages for later processing by a search engine, | | | | The * after user-agent says that this action |
| that will indexthe downloaded pages to provide | | | | applies to all crawlers and |
| fast searches. | | | | /duplicate/ after disallow tells all crawlers to ignore |
| A web crawler is one type of bot, or software | | | | this directory and notsearch it. For each |
| agent. In general, it startswith a list of URLs to | | | | user-agent and disallow line there must be a |
| visit. As it visits these URLs, it identifies all | | | | blankspace between them in order for it to |
| thehyperlinks in the page and adds them to the | | | | function correctly. So this is howyou would create |
| list of URLs to visit, recursivelybrowsing the Web | | | | the above two commands into a robots.txt file: |
| according to a set of policies. | | | | # this identifies the wayback machine |
| Robots.txt - The robots exclusion standard | | | | User-agent: ia_archiver |
| orrobots.txt protocol is a convention to prevent | | | | Disallow: / |
| well-behaved web spiders andother web robots | | | | User-agent: * |
| from accessing all or part of a website. The | | | | Disallow: /duplicate/ |
| informationspecifying the parts that should not be | | | | One thing to note that is very important: Anyone |
| accessed is specified in a file calledrobots.txt in the | | | | can access therobots.txt file of a site. So if you |
| top-level directory of the website. | | | | have information that you don\'t wantanyone to |
| The robots.txt protocol is purely advisory, and | | | | see don\'t include it into the robots.txt file. If the |
| relies on the cooperation ofthe web robot, so that | | | | directorythat you don\'t want anyone to see is |
| marking an area of your site out of bounds | | | | not linked to from your web site thecrawlers |
| withrobots.txt does not guarantee privacy. Many | | | | won\'t index it anyway. |
| web site administrators have beencaught out | | | | An alternative to blocking indexing of your site is |
| trying to use the robots file to make private | | | | to put a meta tag intothe page. It looks like this: |
| parts of a websiteinvisible to the rest of the | | | | <meta name="robots" |
| world. However the file is necessarily | | | | content="noindex,nofollow"> |
| publiclyavailable and is easily checked by anyone | | | | You put this into the <head> tag of your |
| with a web browser. | | | | web page. This line tells therobot crawlers not to |
| The robots.txt patterns are matched by simple | | | | index (search) the page and not to follow any of |
| substring comparisons, so careshould be taken to | | | | thehyperlinks on the page. So as an example |
| make sure that patterns matching directories | | | | <meta name="robots" |
| have the final | | | | content="noindex,follow">tells the |
| \'/\' character appended: otherwise all files with | | | | robots crawlers to not index the page, but follow |
| names starting with thatsubstring will match, | | | | the hyperlinks onthis page. |
| rather than just those in the directory intended. | | | | Did you know that Google has its own |
| Meta Tag - Meta tags are used to provide | | | | <meta> tag? |
| structured data aboutdata. | | | | It looks like this: <meta |
| In the early 2000s, search engines veered away | | | | name="googlebot" |
| from reliance on Meta tags, asmany web sites | | | | t; |
| used inappropriate keywords, or were keyword | | | | This tells the Google robot crawler not to index |
| stuffing to obtainany and all traffic possible. | | | | the page, not to follow any ofthe links, and not to |
| Some search engines, however, still take Meta | | | | keep from storing cached versions of your web |
| tags into some considerationwhen delivering | | | | site. |
| results. In recent years, search engines have | | | | You will want this done if you update the content |
| become smarter,penalizing websites that are | | | | on your site frequently. |
| cheating (by repeating the same keyword | | | | This prevents the web user from seeing outdated |
| severaltimes to get a boost in the search ranking). | | | | content that isn\'t refreshedbecause of storage in |
| Instead of going up rankings, thesewebsites will go | | | | the cache. |
| down in rankings or, on some search engines, will | | | | You can use the <meta> tag to specifically |
| be kicked offof the search engine completely. | | | | talk to Google\'s robots to avoidcomplications or if |
| Index a site - The act of crawling your site and | | | | you are optimizing your site for Google\'s search |
| gatheringinformation.How can the robots.txt file | | | | engine. |
| and meta tag help you? | | | | This concludes this month\'s article. |
| In the robots.txt you can tell the harmful \'web | | | | Until the next article have a great day! |
| crawlers\' to leave your website alone, and give | | | | Copyright Michael Rock |
| helpful hints to the ones you want to crawl your | | | | (You have permission to copy this article as long |
| site. | | | | as it remains intact with theauthor\'s byline) |
| Here is an example on how to disallow a web | | | | Web development contractor (Web Design and |
| crawler to search your site: | | | | Hosting) |
| # this identifies the wayback machine | | | | Internet Presence |
| User-agent: ia_archiver | | | | The owner of this registered companyhas over |
| Disallow: /ia_archiver is the crawler name for the | | | | twenty years experience with DOS, windows |
| wayback machine that you may haveheard of, | | | | business applications, numerousprogramming |
| and the / after disallow tells ai_archiver not to | | | | languages, artistic development, and web design. |
| index any of yoursite. The #<message | | | | Other areas ofinterest include web marketing, |
| here> allows you to write comments to | | | | web promoting, and business marketing |
| yourself so youcan keep track of what you | | | | anddevelopment. After the persuasion of those |
| typed. | | | | praising his work, he decided to gointo business |
| Type the above three lines into notepad from | | | | himself and highly suggests everyone else to do |
| your computer and save it to theroot directory | | | | the same. |
| of your web site as robots.txt. Web crawlers look | | | | |