Building Search Smarts (Hyperviews:Online Winter 2000 Feature)

home > websmarts

Building Search Smarts

Other articles by Seth

Hyperviews home page

Winter 2000 Volume 3, # 1	BUILDING SEARCH SMARTS Search engines are stupid. Fast, cheap, simple ... and stupid. B Y S E T H A. M A I S L I N Boston Chapter
	Using a search engine is like asking someone on the telephone how to fix your car. It’s possible, but you need to be prepared to do a lot more than say "My car isn’t working right." With today’s basic search engines, the demands of searching are placed on users. They have to know precisely what they’re looking for, including spellings, meanings, and languages. They need to understand higher-level search syntax, if available. They have to know the scope of the search. Finally, they have to be prepared to sort through innumerable results with no clear guide to why certain results are valuable. On the other hand, programmers work furiously to overcome the failings of the simple search to make search engines "smarter." The best algorithms use multiple sources of additional input: context, dictionaries, thesauri, grammar, search history, personal preferences, and third-party ratings. Left alone, however, this pile of information is just that: a pile of information. To succeed, search engines must emulate human judgment. The more "human-like" a search engine becomes, the more "intelligent" it seems, and the better it performs. This article gives some basic approaches toward improving your search engines. All of them involve the application of human intelligence to the search process.
Understanding the search engine	A search engine is comprised of two parts: the algorithm and the data. The algorithm evaluates the data by performing a series of valuation steps, from which the overall rating is determined. The data, often (unfortunately) called an index, are generated by separating the total documentation into smaller chunks, each with a label (unique identifier). For example, the World Wide Web can be subdivided into individual web pages, each defined by a specific URL; the elements of a telephone directory are the unique name entries; a user manual is divided into sections or paragraphs. The size of these chunks determines the granularity of the search. These chunks are then stored as records in a database, along with other useful data like date-time information. The search algorithm scans the data, collecting the labels of any hits (search matches). If the algorithm doesn’t incorporate a rating system, the labels remain in database order. Sometimes alphabetical order is treated as a valuation algorithm, such that labels starting with "C" appear before labels starting with "D." Finally, the collated labels are displayed accessibly to the user. There are two approaches for improving each of the search engine’s two parts (algorithm and data): making overall improvements and making case-by-case improvements. The next several sections address these approaches individually.
Improving the data by assumption	In everyday life, human beings respond to new situations by making immediate categorical assumptions -- "judging a book by its cover" -- before delving into specifics. Although never true in all circumstances, these assumptions provide an excellent framework for later judgments, such as the generalizations below. Although there are clear exceptions, these assumptions are correct in more than 50% of ordinary circumstances. Words in titles, headings, and meta tags are important. Words that appear frequently within a document are important. Italicized words are important. Words in bulleted lists are important. Uppercase words are important. All other words are not particularly important. Footnotes are unimportant. Conjunctions, prepositions, articles, and auxiliary verbs are particularly unimportant. The algorithm assigns auxiliary values to the records by preemptively applying assumptions such as these. Then, when the search algorithm reads the database, these values are used to help rate the hits. Consequently, a search of "books" might determine that a bookstore web site is a better match than an online book review. There are no limits regarding the quantity or complexity of these assumptions: Date-time information identifies newer or frequently updated documentation. Articles written in Spanish can be emphasized for Latin American audiences. Online documents with images can be devalued to discourage long downloads.
Touching up the data: Case-by-case improvement	Sometimes writers and editors want to classify a database record as more relevant or irrelevant than usual. Advertisers want their products to appear more relevant than competitors’, and authors might think explanations are more desirable than definitions. These opinions usually reveal themselves only after reviewing results to a few inadequate searches. Writers can add editorial values to the records for the algorithm to consider. For example, note that text-based search engines cannot effectively rate graphics: important diagrams and maps often are ignored inadvertently. By individually "boosting" the ratings of graphics, authors can guarantee graphics better placement in the results list. This approach also can be taken to an effective extreme, often called keywording or (unfortunately again) indexing. Editorially chosen text, called keywords, is added into each database record. Then, the search algorithm considers only these keywords, and not the documentation itself. Further, the keywords can be collected into a standalone document and scanned by users, as a replacement for searching. This is how indexes are built for online help documentation. (When applied to entire web sites, this is known as meta tagging.) With careful keyword selection, a more accurate search is almost guaranteed. One interesting technique is adding commonly misspelled words as keywords so that errors and ignorance don’t interfere with a successful search. Any case-by-case editorial interpretation of data is resource-intensive. In fact, the entire process is much more like book indexing than search programming.
Improving the algorithm overall: Query processing	Basic search algorithms attempt to find simple equalities: When does the search term match a term in a database record? Unfortunately, the human definition of "equality" is not limited to identical words. Users might be interested in any different word that has the same meaning, has a similar spelling, or shares the same linguistic root. Several algorithms can use rules of language and grammar to serve these needs. For example, a user searching for "computer" might also want records that contain the word "computers." This can accomplished using a simple algorithm that looks for singulars as well as plurals. A more complicated grammatical algorithm known as stemming compares only linguistic roots: "Computer" is equivalent to "compute," "computation," and "computational." A thesaurus identifies words that share meanings. For example, a search for "computer" finds examples of "desktops" and "laptops." Thesauri have to be built by human beings, and some can be purchased. Understanding the context and scope of the documentation is important. To many applications, equating "acetaminophen" with its many brand names is not relevant. The algorithm can also interpret the user’s query before searching the database, such as testing for common misspellings. (This is the opposite approach to using misspellings as keywords, mentioned above.) The algorithm can ignore particularly unimportant words, called stop words (usually prepositions, conjunctions, and articles) that are too common to be limiting. If the user is allowed to use search-related syntax such as wildcards, quotation marks, and Boolean expressions, the algorithm must further interpret the query. Search engines that accept natural language queries are an extreme case. These engines, which attempt to accept queries like "How can I print graphics upside-down?" and "How far is the Boston Aquarium from Faneuil Hall?" utilize years of grammatical and syntactical research and continue to improve. (In contrast, most humans perform this kind of full-sentence interpretation easily, automatically, and unconsciously!) To use natural language algorithms in documentation or on web sites, the only sensible option these days is to license the technology. Algorithms also can process hits before displaying them to the user. Parental controls use this technique to repress content considered vulgar or otherwise inappropriate. Query processing of any kind relieves the user from the burden of knowing how to search and what to search for.
Tweaking the algorithm: Handling specific searches	Users are often interested in a particular word only when defined in a specific way or used in a specific context. For example, there are two meanings for the words "glasses" and "speakers." Because of this language-based subjectivity and ambiguity, computers are unable to guess which interpretation the user desires. Instead of ignoring the ambiguity, the search engine can respond by asking for more information. For example, if a user searches for "glasses," the search algorithm gives the user a choice between "drinking glasses" and "eyeglasses." The response to this choice influences the final database search. Iterative searching is a natural human process: Perform a query Interpret the results Perform a new query based on the results. An interruptive model of searching accommodates this process. Unfortunately, most users are not interested in secondary choices and might think the engine is "showing off." Thus, it’s important to provide results in addition to providing these options. For example, users searching for "glasses" should be presented with results for this ambiguous search and the option to refine the search using either "drinking glasses" or "eyeglasses." (Also remember that there might be additional interpretations of the query words that cannot be predicted.) There is also the opportunity to override the natural behavior of a search engine for specific queries. Sometimes there is an advantage to not performing the search, particularly when the user considers the search input box as the only viable input opportunity. If a user searches for the word "home," it might make sense to send the user to the home page of the site instead of displaying search results. Likewise, a user who types "lycos.com" in the search box is probably interested in navigating to the Lycos site.
Clarifying the search scope	After applying all this "intelligence" to your search engine, there is still one major element missing: context. Users often believe they are searching something different than what is really being searched: After one search, users might think they can search the search results. After navigating into a section of documentation, they might expect the search to be limited to that section. Users might suspect that a search engine on a web site actually searches the entire Web. It is important to communicate the scope of the search. Instead of labeling the engine with only the word "Search," be more explicit: "Search our site" or "Search the Web." Consider allowing users to choose the scope with radio buttons or drop-down menus: "Search: This Site (or) This Section (or) This Page." In addition, if there are sections of a site that are password-protected or otherwise restricted, identify whether or not users can get search results for those sections without a password. The more flexibility you allow with search scope, the more search databases you need.
Conclusion	Search engines cannot handle language without human help. They perform no interpretation without editorial assistance. Remember that the burden of effective searching is often on the user, and that the user is rarely as familiar with the site structure as the writers, editors, and programmers. Use your human language skills to interpret the data in advance and to build a search algorithm that incorporates language rules and considers exceptions.
Winter 2000 Volume 3, # 1	Seth Maislin is an indexer, information architect, and consultant. Formerly the senior indexer at O'Reilly & Associates, Seth is a regular conference presenter, a council member for the Boston STC Chapter, webmaster for the Indexing SIG, and a member of the American Society of Indexers national board. Other articles by Seth Copyright 1999 Seth A. Maislin Top
Hyperviews home page

HOME | ABOUT | INDEXING | WEBSMARTS | FUN & WACKY | EMAIL
Site design by little graphics studio.
© 2002 All rights reserved.