Everyday Developer: search engine

Wednesday, March 11, 2009

Keeping search robots away!

At times, you would want few of your web pages to not be visible in the search engine result page (SERP). The reason being it is under construction; or it's a semi-private page, something you would like to share with smaller community; or any other. Here I have discussed ways to keep the search engine robots or crawlers from visiting the page or a link on a page supported by Robots Exclusion Protocol (REP).

To block the search engine robot from indexing a particular page, use the mate tag robots with the content value noindex.

Alternatively, if you'd like the web page to be indexed but suggest the search robot to not follow any of the links on the page, use the nofollow content value.

More Content values for robots Meta tag -

Content Value	Description	Supported By
noindex	Do not index the web page	Google, Yahoo, Ask, MSN Live
index	Index the web page
nofollow	Do not follow/visit any link on the web page	Google, Yahoo, Ask, MSN Live
follow	Follow all the links on the web page
noarchive	Do not cache the web page	Google, Yahoo, Ask, MSN Live
nosnippet	Do not auto generate the description based on page content	Google
noodp	Do not overwrite the description or title tag content from Open Directory project [home page only]	Google, Yahoo, MSN Live
noydir	Do no overwrite the description or title tag content	Yahoo

You can also have combinations of content values (of course the combinations should make sense).

Now if you are dealing with keeping search robots from visiting multiple web pages of your website, you can make use of robots.txt file which is placed in the top-level directory hierarchy of your web site.

Here is the syntax of the robots.txt file-

User-Agent: *
Disallow: /

In the above syntax, User-Agent identifies the search robot and * refers to all search robots. You can also specify the search robot name here to address a particular search engine robot. Refer to the User-Agent of major search engines.

To restrict certain directory of your website -

User-Agent: *
Disallow: /Songs

To restrict particular robot from visiting your web directory

User-Agent: Googlebot/2.1
Disallow: /Songs

If you are addressing multiple search robots in your robots.txt, make sure that the directive for specific User-agent is specified before.

#Disallow Google bot from visiting any webpage/ content under /Songs/private
User-Agent: Googlebot/2.1
Disallow: /Songs/private

#Disallow all other bots from visiting any web page/ content under /Songs
User-Agent: *
Disallow: /Songs

More Robots.txt directives –

Content Value	Description	Supported By
Disallow	Do not visit specified web page	ALL search robots
Allow	Allow visiting the particular web page Eg, Disallow: /Songs Allow: /Songs/Favs Above statement will restrict the search robot from visiting all the directories under Songs other than Favs subfolder	Google, Yahoo, Ask, MSN Live
Sitemap	Location of your sitemap Sitemap: `http://yourwebsite.com/sitemap_location.xml` Location of sitemap index file can also be included here.	Google, Yahoo, Ask, MSN Live
*Wild card (/$)**	Wildcard * - matches sequence of characters Eg, Disallow: /Songs/personal Wildcard $ - matches everything from the end of the URL Eg, Disallow: /Songs/*.mp3$ Learn more on pattern matching	Google, Yahoo, MSN Live
Crawl-Delay	Specifies minimum delay between two successive requests made by search robot	Ask

robots.txt quick Tips

Q. To allow all search engine spiders to index all the files of your website

Your robots.txt file should like below
User-agent: *
Disallow:

Q. To disallow all spiders to index any file

Your robots.txt file should like below
User-agent: *
Disallow: /

Note: Slash '/' here is your root directory and by adding that in your Disallow statement you are restricting spiders from indexing all the files of your website.

For specific questions related to writing robots.txt for your website, please reach me at bhawnablog@gmail.com

Sunday, March 8, 2009

Organic SEO Best Practices Checklist

The DOs	Description	TIPs
Title Tag	<title>keyword in the Title</Title>	< 60-65 characters including spaces. First 3 words in any combination will lead to keyword phrase.
Image Tag	<img src="" alt="keyword in the alternate text"/>	Alt is another opportunity to add keywords Add only image related text in Alt Create short and meaningful alt text
Anchors	<a href="link to a related website">keyword</a>	Text based links As long as it is a important keyword text and the link is relevant, anchor it Avoid broken links Use anchor text for linking to relevant dynamic content (crawlers will favor you)
Meta Tags	<meta name="keyword" content="related keyword list"/> <meta name="description" content="short description of your website with few keywords"/>	<200 characters (description) Do not repeat exact title in description Avoid keyword repetition
Header Tags	<h1></h1>, <h2></h2>, <h3></h3>, <h4></h4>	Most important <h1>------> less important <h4> <h1> occurrence – 2 at most
URL	Parameter: http://www.yoursite.com/products.jsp?id=12356&category=7&type=42&size=6&batch=65 Depth: http://www.yoursite.com/products/category/batch/season/item	Max parameters: 2 Max depth: 4
Inbound Links		As many strong inbound links, the better Request links, write articles with link to your site, PR, social network community, paid links
Visible Body Text	<body>body text</body>	Use <strong></strong> for relevant keywords Use <em></em> for relevant keywords
Sitemap	XML based document at the root of your website. http://www.yourwebsite.com/sitemap.xml Learn more about writing Sitemap	< 50,000 URLs and 10MB size per site map <1,000 site maps, per website, if multiple site maps used Submit Site to the search engines
Navigation		Text based navigation on the left
Domain Name		If new website, try using keyword in domain name to specifically define the content of your website
Bread-crumb trail	On your web page, provides navigation depth info. Home > kids > Toys > 4T – 5T	Use your website name instead of HOME
FAQs	FAQs of popular searches related to your product	4-15 questions 200-800 words for each FAQ.
Popular Search List	Maintain a list of popular /most frequently searched items/ keywords related to your product /website on very web page.	The search list should be relevant and should link to pages in your website.
Robots.txt	Learn more about robots.txt	Suggest crawlers on pages to crawl and pages to avoid Helps in logging search engine visits
Company Address	Company Name, Street Address, City, State, Zip, Country, Phone#	Provides visibility in location search
Contact Statement	If you need more info, please contact ……	Instead, If you need more info about <yourproductname>, please contact …..

The DON'Ts	The Fix
Duplicate URLs	Use canonical link tag in the <head> section of all the duplicate pages to point to the original web page content: <link rel="canonical" href=http://yourwebsite.com/product.html"/>
Broken Links	Fix them! Add nofollow keyword to the link suggesting the search crawler to not visit the link
Cookies	Un restrict cookies Set some default content when cookie is unavailable.
Session Ids	Generate a guest user and allow to view the un restricted content
Frames	Provide alternative to framed web site using the <noframes> tag. The <noframes> tag content should be exactly the same as frames site content Add link to HOME, with attribute TARGET="_top"
302 redirect	Avoid, if possible Use robot.txt to avoid crawling this link, if possible Add > 15-sec delay before redirecting

Everyday Developer

Wednesday, March 11, 2009

Keeping search robots away!

robots.txt quick Tips

Sunday, March 8, 2009

Organic SEO Best Practices Checklist

About Me

Visitor Map

Blog Archive

Be a part...

Everyday Developer

Wednesday, March 11, 2009

Keeping search robots away!

robots.txt quick Tips

Sunday, March 8, 2009

Organic SEO Best Practices Checklist

About Me

Visitor Map

Subscribe To Every Day Developer

Blog Archive

Be a part...