Why Indexing Fails, and How to Solve It

home > about seth
home > indexing

Why Indexing Fails,
And How to Solve It

A report on Seth Maislin's presentation to the
Society for Documentation Professionals (SDP) in March 2001.

By Ed Marshall

Seth Maislin delivered a presentation on "Why Indexing Fails: Real-Life Problems and Solutions" at the March SDP meeting. Seth is a Directory Manager at Lycos, Inc., and sole proprietor of Focus Information Services. He provides indexing, information architecture, and consulting services to public and private audiences.

According to Seth, there are two approaches to indexing: information architecture/design or knowledge management. The first is more of an engineering approach. That is, this approach is appropriate when you have a good knowledge of the subject matter and the topics that need to be covered. Another way of looking at this approach is using the top-level design approach common in software engineering. You flesh out the details of a subject working from the highest level to lower levels. In other words, you might have a complete outline of the subject before you start writing about it so you can plan your index entries before you start to write about the subject. The second is a "touchy-feely" approach. Another way of looking at this is that you start writing about a subject using the information you know about the subject from previous projects. As you learn more about the subject, you revise your outline and add more information. This can be viewed as a bottom-up approach. In this approach, you will probably add index entries as you write. You will probably not be able to anticipate most of your index entries before you start writing. Whichever approach you use, you still need to structure information so users can find it.

Indexing: Past and Present

	The way we use the written word in our information sets or books today has changed how we present information. Indexing used to be done by librarians. Today, programmers, writers, or engineers commonly do indexing. The same skills are required to create indexes but the problems are different.
	Seth said the following trends make it more difficult to produce good indexes: custom documentation, single-source production, time-variable content, multibook CD-ROMs, internationalization, and tools designed for embedded indexing.
	Historically, creating an index for a printed book involved writing the book first and then indexing it. You wrote index entries on index cards, alphabetized the index cards, and then collated them, placing similar entries in sub-entries. To enter the index, you picked up an index card and typed the entry. You repeated this process until you were done.
	With today's tools, you can have embedded indexes where the index doesn't have page numbers. You can also use tags to generate an index as you build the book. Writing the index is part of writing the book but sometimes it becomes more segmented as pieces of the book go to the printer.
	You can also write the index before the book based on the organization or information structure. You can group similar topics or use a linear outline. As the structuring occurs, the indexer can see general themes.
	It is possible to design the structure before you write the book but problems can occur with items found after you have selected a structure. In traditional books, the table of contents is linear and does not repeat entries. In contrast, the index can repeat entries and is non-linear. Also in a printed book, the table of contents occurs in the front of the book and the index in the back. But these are superficial structures that you lose when you put a book online.

Multibook CD-ROMs and Single Sourcing

Putting multiple books on a CD-ROM is common today for large documentation sets. For example, you might put five books on Java on a CD-ROM as a "virtual bookshelf". Using this approach can present problems in indexing such as the consistent formatting of index entries and including enough entries to make the bookshelf useful to customers. These problems can be compounded if there are multiple writers on the books. How do you maintain consistency in the indexes for all the books? Often, you write your indexes before the book is done, which can lead to inconsistencies in formatting of index entries. Some methods used in writing indexes include using a controlled vocabulary and using style sheets (specifying the use of plural vs. singular, gerunds, etc.). In controlled vocabularies, you use the same noun repeatedly and you use synonyms for cross-references. For example, a user might look for "protection" under "castle" or "shield". Using a controlled vocabulary provides the big picture but you lose details. This can result in poor searches by search engines.

Using a single source for documents in different output media usually doesn't work. For example, it is difficult to use the same source of information for both a printed document and online help. The mediums and how people use them are too different. Books tend to be linear in nature. That is, the information is structured so that people will read from the beginning to the end of the book. In online help systems, users tend to be looking for assistance with a particular task and aren't interested in reading about a tool from start to finish. Although, you can look for information in printed books randomly, typically, users use printed books to learn concepts and how to use the implementation of those concepts in a particular tool. In online help systems, people aren't interested in learning the concepts about a particular task. They only want to know how to perform a particular task as a part of getting their job done. A particular task might only be a subset of what they need to do. They view software products as tools to help them accomplish the assigned tasks by their employers. They are less interested in why you perform a particular task, just how do you do that task and then they move on the next task required for their job.

Indexing Guidelines

Seth said that on the average, you should have 5 index pages per 100 pages of text. This can vary depending on the density of the text. If your document is very repetitive or has many graphics, you might have a shorter index than the average.

Text aimed at beginners might have fewer index entries than highly technical material. Reference information usually has more index entries than introductory or tutorial information. A quick reference document will usually have a long index with short entries. With online documentation, you might want to use double-posting, where for example, teaching is the main entry with education as the subentry and education is the main entry with teaching as the subentry.

Testing Your Index

	Does the index work for you? Do a sampling of the index. Are the entries correct and useful? Read a chapter, put the book aside for a week, then go back to the index and try to find information on key topics.
	Another way to test an index is to simply pick a topic. Can you easily find it in the index?
	The American Society of Indexers (ASI) has a pamphlet on evaluating indexes. The information still applies to printed documentation but not to online documentation.

Searches vs. Indexes

	Seth said that searches tend to be literal while indexes help when you know a related name or topic. Intelligent searches try to mimic the way people look for information by assigning "weight values" to words based on where the words appear. For example, words in a title are more important than words found in paragraphs.
	Another problem arises as your documentation set grows in size. How do you assign "weight values" to topics found across multiple books? One tip is to search iteratively, that is provide the capability to search the search results. Another way is only allow searches for large topics or restrict a search to a specific book.
	A good Web search engine is
	Seth offered the following techniques for creating an index: Consider indexing in chunks. Start indexing those topics that you know your book will cover. Other topics will arise as you write the book so you can index them later. Identify topics that won't change and use these to build pieces of your index.
	Ed Marshall is a software technical writer. Ed has over 14 years experience in writing for companies such as Digital Equipment Corporation, Progress Software Corporation, and Lernout & Hauspie Speech Products. He has produced printed documentation, Windows Help, HTML Help, JavaDoc, and documents for the Web. He has extensive experience with implementing single-sourcing solutions, archiving, and recreating product documentation from revived products. He is a senior member of the Northern New England chapter of the STC, where he is the secretary and co-programs coordinator. He can be reached at EdMofShirley@aol.com.

Top