Tuesday, November 24, 2009

Search Engine for Dummies part 1

What would Internet be without search engines? You have a humongous amount of data, but without a tool to find what you need, you would be at lost.

Today you probably take it for granted that Google is the most popular search engine. But that wasn't always the case. If you've been around in the Internet long enough (i.e. a decade or more), you'd probably remember those days before Google become household name. Before the turn of the new millennium, the most popular search engine was Altavista.

Back than Altavista was considered cutting-edge and the pioneer of a truly usable Internet search engine. It was one of the top web destinations. But by early 2000's. It's popularity quickly dropped due to the arrival of a new kid on the block: the venerable Google.

When Google entered the search engine market, many people thought the market was already saturated. There were already a bunch of search engines (You may recall names such as Lycos, Magellan, InfoSeek, Excite, etc). But Google proved them wrong. It rose steadily in prominence, and soon enough it grasped the top spot and remained there ever since.

So how does Google do it? And where did Altavista and those other search engines fail?

All search engines begin with a web crawler. It is a piece of software which automatically crawls the web, collecting information about every page it encounters. This data is then stored and indexed in the search engine's database, making it available for search queries from users.

The quality of a search engine from user's point of view depends on the relevance of the result set it gives back based on the search term. The primary difference between old generation search engines (Altavista et al) and newer generation ones is the method to determine the most relevant web pages to be put in the result set.

Old search engines' method is based on the textual content of the web pages. In this method, a page's relevance to a search term is calculated based on how many times that search term occurs in the page. For example, if you search for "Nuclear Weapon", the search engine would probably return a page where this term occurs multiple times at the top of the search result.

That sounds reasonable, right? But it turns out that this simple method has serious flaws. Supposed you create a page which contains nothing but the term "Nuclear Weapon" repeated a dozen times. This page would rank high in the result set, despite the fact that it's a useless page. This is the reason why results returned using this method typically have low quality. How do you improve the quality then?

to be continued...

No comments: