Thursday, August 02, 2007

Google can't index the entire web

It's sometimes hard for people to think about the Internet without automatically thinking of Google. But Dan Crow of Google's Crawl Infrastructure Group gave this sobering message last month in his interview with Jonathan Hochman:
"...the World Wide Web is very large, and Google is not even sure how large. We can only index a fraction of it. Google has plenty of capital to buy more computers, but there just isn't enough bandwidth and electricity available in the world to index the entire Internet."
That leaves Google with a massive dilemma: which pages should they index and which should they ignore? According to Dan, PageRank plays a large role. If your site has relatively few pages and they all have high PageRank, it's likely they'll all be indexed no problem. However, if you have a large number of pages with low PageRank, you probably find that they don't make the cut.

So that just leaves the $64,000 question: what can you do to give your web pages the best possible chance of being indexed? Jonathan was convinced that the following aspects have an impact on a page's indexability:

- Clean, valid HTML code
- Use of external CSS and external Javascript files
- No code bloat

During his interview, Jonathan asked Dan outright if these things would help a page get indexed and Dan agreed that they would. Pages with clean code load faster and use less bandwidth to index.

Looks like it's time to go clean up that sloppy code!

Add to: Digg | Del.icio.us | Ma.gnolia | Reddit

Subscribe via: Yahoo Feeds | Feedburner | Technorati | Bloglines

Labels: , ,

AddThis Social Bookmark Button

0 Comments:

Post a Comment

<< Home


Proposal templates ready for editing