BUILDING
SEARCH SMARTS Search engines
are stupid. Fast, cheap, simple ... and stupid.
B Y
S E T H A. M A I S L I N Boston Chapter
Using
a search engine is like asking someone on the telephone how to fix your
car. It’s possible, but you need to be prepared to do a lot more than
say "My car isn’t working right."
With today’s
basic search engines, the demands of searching are placed on users. They
have to know precisely what they’re looking for, including spellings,
meanings, and languages. They need to understand higher-level search syntax,
if available. They have to know the scope of the search. Finally, they
have to be prepared to sort through innumerable results with no clear
guide to why certain results are valuable.
On the other
hand, programmers work furiously to overcome the failings of the simple
search to make search engines "smarter." The best algorithms
use multiple sources of additional input: context, dictionaries, thesauri,
grammar, search history, personal preferences, and third-party ratings.
Left alone, however, this pile of information is just that: a pile of
information.
To succeed,
search engines must emulate human judgment.
The more
"human-like" a search engine becomes, the more "intelligent"
it seems, and the better it performs. This article gives some basic approaches
toward improving your search engines. All of them involve the application
of human intelligence to the search process.
Understanding the search
engine
A search
engine is comprised of two parts: the algorithm and the data. The algorithm
evaluates the data by performing a series of valuation steps, from which
the overall rating is determined.
The data,
often (unfortunately) called an index, are generated by separating
the total documentation into smaller chunks, each with a label
(unique identifier). For example, the World Wide Web can be subdivided
into individual web pages, each defined by a specific URL; the elements
of a telephone directory are the unique name entries; a user manual is
divided into sections or paragraphs. The size of these chunks determines
the granularity of the search. These chunks are then stored as records
in a database, along with other useful data like date-time information.
The search
algorithm scans the data, collecting the labels of any hits (search
matches). If the algorithm doesn’t incorporate a rating system, the labels
remain in database order. Sometimes alphabetical order is treated as a
valuation algorithm, such that labels starting with "C" appear
before labels starting with "D." Finally, the collated labels
are displayed accessibly to the user.
There are
two approaches for improving each of the search engine’s two parts (algorithm
and data): making overall improvements and making case-by-case improvements.
The next several sections address these approaches individually.
Improving the data by
assumption
In everyday life,
human beings respond to new situations by making immediate categorical
assumptions -- "judging a book by its cover" -- before delving into
specifics. Although never true in all circumstances, these assumptions
provide an excellent framework for later judgments, such as the generalizations
below. Although there are clear exceptions, these assumptions are correct
in more than 50% of ordinary circumstances.
Words
in titles, headings, and meta tags are important.
Words
that appear frequently within a document are important.
Italicized
words are important.
Words
in bulleted lists are important.
Uppercase
words are important.
All other
words are not particularly important.
Footnotes
are unimportant.
Conjunctions,
prepositions, articles, and auxiliary verbs are particularly unimportant.
The algorithm
assigns auxiliary values to the records by preemptively applying assumptions
such as these. Then, when the search algorithm reads the database, these
values are used to help rate the hits. Consequently, a search of "books"
might determine that a bookstore web site is a better match than an online
book review.
There are
no limits regarding the quantity or complexity of these assumptions:
Date-time
information identifies newer or frequently updated documentation.
Articles
written in Spanish can be emphasized for Latin American audiences.
Online
documents with images can be devalued to discourage long downloads.
Touching up the data:
Case-by-case improvement
Sometimes writers
and editors want to classify a database record as more relevant or irrelevant
than usual. Advertisers want their products to appear more relevant than
competitors’, and authors might think explanations are more desirable
than definitions. These opinions usually reveal themselves only after
reviewing results to a few inadequate searches.
Writers can add editorial
values to the records for the algorithm to consider. For example, note
that text-based search engines cannot effectively rate graphics: important
diagrams and maps often are ignored inadvertently.
By individually "boosting" the ratings of graphics, authors
can guarantee graphics better placement in the results list.
This approach also
can be taken to an effective extreme, often called keywording or
(unfortunately again) indexing.
Editorially chosen
text, called keywords, is added into each database record. Then,
the search algorithm considers only these keywords, and not the documentation
itself. Further, the keywords can be collected into a standalone document
and scanned by users, as a replacement for searching. This is how indexes
are built for online help documentation. (When applied to entire web sites,
this is known as meta tagging.) With careful keyword selection,
a more accurate search is almost guaranteed. One interesting technique
is adding commonly misspelled words as keywords so that errors and ignorance
don’t interfere with a successful search.
Any case-by-case editorial
interpretation of data is resource-intensive. In fact, the entire process
is much more like book indexing than search programming.
Improving the algorithm
overall: Query processing
Basic search algorithms
attempt to find simple equalities: When does the search term match a term
in a database record? Unfortunately, the human definition of "equality"
is not limited to identical words. Users might be interested in any different
word that has the same meaning, has a similar spelling, or shares the
same linguistic root. Several algorithms can use rules of language and
grammar to serve these needs.
For example, a user
searching for "computer" might also want records that contain
the word "computers." This can accomplished using a simple algorithm
that looks for singulars as well as plurals. A more complicated grammatical
algorithm known as stemming compares only linguistic roots: "Computer"
is equivalent to "compute," "computation," and "computational."
A thesaurus identifies
words that share meanings. For example, a search for "computer"
finds examples of "desktops" and "laptops." Thesauri
have to be built by human beings, and some can be purchased. Understanding
the context and scope of the documentation is important. To many applications,
equating "acetaminophen" with its many brand names is not relevant.
The algorithm can
also interpret the user’s query before searching the database, such as
testing for common misspellings. (This is the opposite approach to using
misspellings as keywords, mentioned above.) The algorithm can ignore particularly
unimportant words, called stop words (usually prepositions, conjunctions,
and articles) that are too common to be limiting. If the user is allowed
to use search-related syntax such as wildcards, quotation marks, and Boolean
expressions, the algorithm must further interpret the query.
Search engines that
accept natural language queries are an extreme case. These engines,
which attempt to accept queries like "How can I print graphics upside-down?"
and "How far is the Boston Aquarium from Faneuil Hall?" utilize
years of grammatical and syntactical research and continue to improve.
(In contrast, most humans perform this kind of full-sentence interpretation
easily, automatically, and unconsciously!) To use natural language algorithms
in documentation or on web sites, the only sensible option these days
is to license the technology.
Algorithms also can
process hits before displaying them to the user. Parental controls use
this technique to repress content considered vulgar or otherwise inappropriate.
Query processing of
any kind relieves the user from the burden of knowing how to search and
what to search for.
Tweaking the algorithm:
Handling specific searches
Users are often interested
in a particular word only when defined in a specific way or used in a
specific context. For example, there are two meanings for the words "glasses"
and "speakers." Because of this language-based subjectivity
and ambiguity, computers are unable to guess which interpretation the
user desires.
Instead of ignoring
the ambiguity, the search engine can respond by asking for more information.
For example, if a user searches for "glasses," the search algorithm
gives the user a choice between "drinking glasses" and "eyeglasses."
The response to this choice influences the final database search.
Iterative searching
is a natural human process:
Perform
a query
Interpret
the results
Perform
a new query based on the results.
An interruptive model
of searching accommodates this process. Unfortunately, most users are
not interested in secondary choices and might think the engine is "showing
off." Thus, it’s important to provide results in addition to providing
these options. For example, users searching for "glasses" should
be presented with results for this ambiguous search and the option
to refine the search using either "drinking glasses" or "eyeglasses."
(Also remember that there might be additional interpretations of the query
words that cannot be predicted.)
There is also the
opportunity to override the natural behavior of a search engine for specific
queries. Sometimes there is an advantage to not performing the
search, particularly when the user considers the search input box as the
only viable input opportunity. If a user searches for the word "home,"
it might make sense to send the user to the home page of the site instead
of displaying search results. Likewise, a user who types "lycos.com"
in the search box is probably interested in navigating to the Lycos site.
Clarifying the search
scope
After applying all
this "intelligence" to your search engine, there is still one
major element missing: context. Users often believe they are searching
something different than what is really being searched:
After one search,
users might think they can search the search results.
After navigating
into a section of documentation, they might expect the search to be
limited to that section.
Users might suspect
that a search engine on a web site actually searches the entire Web.
It is important to
communicate the scope of the search. Instead of labeling the engine with
only the word "Search," be more explicit: "Search our site"
or "Search the Web." Consider allowing users to choose the scope
with radio buttons or drop-down menus: "Search: This Site (or) This
Section (or) This Page." In addition, if there are sections of a
site that are password-protected or otherwise restricted, identify whether
or not users can get search results for those sections without a password.
The more flexibility you allow with search scope, the more search databases
you need.
Conclusion
Search
engines cannot handle language without human help. They perform no interpretation
without editorial assistance. Remember that the burden of effective searching
is often on the user, and that the user is rarely as familiar with the site
structure as the writers, editors, and programmers. Use your human language
skills to interpret the data in advance and to build a search algorithm
that incorporates language rules and considers exceptions.
Winter 2000
Volume 3, # 1
Seth
Maislin is an indexer, information architect, and consultant. Formerly
the senior indexer at O'Reilly & Associates, Seth is a regular conference
presenter, a council member for the Boston STC Chapter, webmaster for
the Indexing SIG, and a member of the American Society of Indexers national
board.