At times, you would want few of your web pages to not be visible in the search engine result page (SERP). The reason being it is under construction; or it's a semi-private page, something you would like to share with smaller community; or any other. Here I have discussed ways to keep the search engine robots or crawlers from visiting the page or a link on a page supported by Robots Exclusion Protocol (REP).
To block the search engine robot from indexing a particular page, use the mate tag robots with the content value noindex.
<meta name="robots" content="noindex" />
Alternatively, if you'd like the web page to be indexed but suggest the search robot to not follow any of the links on the page, use the nofollow content value.
<meta name="robots" content="nofollow" />
More Content values for robots Meta tag -
Content Value | Description | Supported By |
noindex | Do not index the web page | Google, Yahoo, Ask, MSN Live |
index | Index the web page | |
nofollow | Do not follow/visit any link on the web page | Google, Yahoo, Ask, MSN Live |
follow | Follow all the links on the web page | |
noarchive | Do not cache the web page | Google, Yahoo, Ask, MSN Live |
nosnippet | Do not auto generate the description based on page content | |
noodp | Do not overwrite the description or title tag content from Open Directory project [home page only] | Google, Yahoo, MSN Live |
noydir | Do no overwrite the description or title tag content | Yahoo |
You can also have combinations of content values (of course the combinations should make sense).
<meta name="robots" content="noindex, follow" />
Now if you are dealing with keeping search robots from visiting multiple web pages of your website, you can make use of robots.txt file which is placed in the top-level directory hierarchy of your web site.
Here is the syntax of the robots.txt file-
User-Agent: *
Disallow: /
In the above syntax, User-Agent identifies the search robot and * refers to all search robots. You can also specify the search robot name here to address a particular search engine robot. Refer to the User-Agent of major search engines.
To restrict certain directory of your website -
User-Agent: *
Disallow: /Songs
To restrict particular robot from visiting your web directory
User-Agent: Googlebot/2.1
Disallow: /Songs
If you are addressing multiple search robots in your robots.txt, make sure that the directive for specific User-agent is specified before.
#Disallow Google bot from visiting any webpage/ content under /Songs/private
User-Agent: Googlebot/2.1
Disallow: /Songs/private
#Disallow all other bots from visiting any web page/ content under /Songs
User-Agent: *
Disallow: /Songs
More Robots.txt directives –
Content Value | Description | Supported By |
Disallow | Do not visit specified web page | ALL search robots |
Allow | Allow visiting the particular web page | Google, Yahoo, Ask, MSN Live |
Sitemap | Location of your sitemap Location of sitemap index file can also be included here. | Google, Yahoo, Ask, MSN Live |
Wild card (*/$) | Wildcard * - matches sequence of characters | Google, Yahoo, MSN Live |
Crawl-Delay | Specifies minimum delay between two successive requests made by search robot | Ask |
robots.txt quick Tips
Q. To allow all search engine spiders to index all the files of your website
Your robots.txt file should like below
User-agent: *
Disallow:
Q. To disallow all spiders to index any file
Your robots.txt file should like below
User-agent: *
Disallow: /
Note: Slash '/' here is your root directory and by adding that in your Disallow statement you are restricting spiders from indexing all the files of your website.
For specific questions related to writing robots.txt for your website, please reach me at bhawnablog@gmail.com
No comments:
Post a Comment