Search Engines
This chapter should give a general overview of the technique and function of a search engine. The following information is generally true for all search engines. The development of the internet leaded to the development of the search engines, which is explained considering the most important search engines.
How does a search engine work?
General
If you are a searcher, you just type in your search into the search interface at any given search engine, e.g. Google, and expect to return the result which fit to your search. Nevertheless, there are many processes a search engine has to carry out in order to prepare the results for searchers.
Search Engine Spider
A search engine spider is a software program which has the purpose of downloading as many webpages as fast as possible for the search engine to index and rank.
In order to provide a reasonable amount of possible information that can be searched with a search engine, you need to index, or try to index, the entire web. For this purpose, every search engine has its own search engine spider, which is the technical background for downloading websites.
The search engine spiders identify themselves with their User-Agents and IP Adresses they use, for example “Googlebot”, or, as with the search engine Yahoo!, who call their spider Slurp!
A search engine spider follows the links in the internet. It doesn’t matter where you start, because almost every website contains links to other webpages, and therefore the web grows quickly.
Problems for Search Engine Spiders
- The spider needs to download every webpage again and again
- Webpage could not be accessible for the spider, e.g. login, password protected or in a database
- Getting fresh content, e.g. news in time
The websites which have been found by the search engine spider are saved in the so called index. There is a lot of data available, but the problem of this data is that it is not yet sorted into search results.
Algorithm
This data is sorted into search results by a complicated algorithm which is only known to the engineers of the specific search engines.
Most of the algorithms take on the page and off the page factors into consideration, but since on the page factors are easily to manipulate, the algorithm mainly uses link-analysis to determine the ranking for its results pages.
Link-relevancy does not only focus on the amount, but also on the position, age, topic, link-text and neighbourhood of a link. The part of the algorithm about link-relevancy is the key to good search engine results.
Hardware
In order to keep a search engine running smoothly it is important to have a lot of powerful hardware: This is mainly used for calculating search engine results pages (SERPs) using the given algorithm and storing the websites the spider has collected.
Every search engine works with a cluster of cheap computers and a large amount of storage.
If one or several of the computers fail, there are always other parts of the network left where the information is stored and the algorithm continues working, so that the search results can still be delivered to the user.
Summary
A search engine first needs to spider the web and store its data in the index, during this process of index-creation and actualisation, the spider has to deal with several problems.
After the index is there, the search engine has to provide an algorithm that can compute the search results according to expectations of the users, and it needs to maintain its hardware and traffic.