
- Image via Wikipedia
Search engines are available either in the form of an engine supported by human power and search engines that feature crawlers. Sometimes search engines will hybridize these approaches, but generally they will at least focus on one or the other.
In crawler based engines, spiders (or crawlers) do most of the work. The crawlers read gazillions of web pages so you don’t have to. On a site that features a program like the e-book manager Calibre, the spiders would follow links around the site discovering various pages. This process is repeated regularly, maybe once every few months, because the web changes constantly. Whatever the spider reads on these visits goes into the index or catalog (basically two names for the same thing) of the search engine. Sometimes there’s a gap between when the spider reads it and when it’s listed in the index. There’s also the search software, basically an algorithm to decide what’s relevant to any given search based on the content the spiders have collected.
Search software determines relevancy partly by the location and frequency of words related to the search on a given website. So a website about audio software like Audacity would need to have relevant keywords placed throughout the site. However, there are other factors. One is clickthrough, the frequency with which a certain result is clicked on when it comes up for a given search. If a page ranks high, but never gets clicked, it’s likely to be penalized in search rankings. Another factor is an analysis of the links connected to the page – how many there are, and of what quality. It’s sort of an arms race between webmasters zealous to increase rankings, with search engines scrambling to keep their algorithms as accurate as possible at finding the quality content people want to see.
Human-powered search engines, not surprisingly, are human powered. A human is given the tasking of submitting a synopsis of a website to the directory – either the human who runs the website, or a human who reviews websites. A search in this type of directory looks only among these summaries to decide what content is relevant.
