Indexing SGML documents

home > indexing
	Where to Put Index Tags in SGML Documents



	*(or, better yet, Where Never* to Put SGML Index Tags)** By Seth A. Maislin Note This paper is based on DocBook, one common DTD for SGML. The names of the SGML tags used here come from that DTD. If you are using a different DTD, you should still find insight in this paper. If you have something you would like to add to this paper, please do; write me at seth@maislin.com and reference this page. Informal Table of Contents Introduction Paragraphs Long strings of formatted text Section-long ranges First paragraph of the chapter Footnotes Sidebars Inline font tags Titles Tables Code Sourced-in material Lists Figures Glossaries Introduction Ideally, all index tags will end up between <PARA> and </PARA> tags. However, the goal of using indexing tags is to get accurate page numbers in the final index with a minimum of effort. Constraining the index tags to remain within tags leads to inaccuracy. Worse than that, index tags can mess up the book format, creating problems ranging from missing text and altered font styles to misaligned tables, new page breaks, and occasionally fatal errors. Why are index tags such a problem? Here's my theory. When the document is being analyzed (parsed) and the opening index tag is discovered, the parser "learns" a few things. First, text that follows <INDEXTERM> is not to show up in the document. Second, that the text before the <INDEXTERM> tag and after the </INDEXTERM> tag should be formatted as if the index tags and contents had never been there. Unfortunately, I have found that this second item is not handled for <INDEXTERM> in the same way that it is handled for other tags, such as <EMPHASIS> and <MONOSPACED> tags. When a start-tag is located, no matter what start-tag it is (except index tags, because they are different), the parser alters the "format conditions" that exist for the text in front of the tag. For example, the format conditions of all the text in this paragraph so far is "plain" or "Roman." (Actually, it's whatever the default text for your HTML browser is, but for me, it's just plain ordinary Roman text.) Each tag in this paragraph, however, would augment these format conditions, and how it augments them depends on what tag it is. For example, in sgml, an <EMPHASIS> tag simply "adds" the trait of emphasis to whatever text format immediately preceded that tag. When the </EMPHASIS> appears, the trait of emphasis is "subtracted" from the text style. This allows for tag embedding. Simply by stringing together a whole bunch of tags, it's possible to take ordinary text and add several formatting features, such as small caps, boldface, and italics. These traits can then be removed in any order. The problem is that there really is no such thing as an "index trait." In order for the parser to react properly to an index tag, the formatting information and further instructions have to be hardcoded. Whereas all tags are otherwise relative -- adding and subtracting traits to whatever preceeded the tag -- the index tag sets a new format for the index tag text and then "resets" the text. Unfortunately, it will automatically reset the text to "plain old Roman text," or whatever is default. In other words, putting in an <INDEXTERM> tag pair is the same as reseting the text to a default condition. This means that an index term can never appear anywhere where the text immediately following it must be something other than plain old Roman text. It is for this reason that all the examples that follow are unacceptable. Index tags within paragraphs If you can help it, put the index tags within <PARA> and </PARA> tags, making sure that the term does not appear within any other formatting tags, such as <EMPHASIS> pairs. In fact, if the index tags can show up immediately following <para> or immediately preceding </para>, it makes it easier for someone reading the text as well. (Of course, do not let the index tags stray too far from the text being indexed, or the page numbers can end up inaccurate.) For example, with ranges, the starting index tag can appear immediately after a <PARA> tag, and the ending index tag can appear immediately before a </PARA> tag. Special case: Long strings of formatted text Be careful around other text-formatting tags. Sometimes long sections of text are marked with far-apart tags. Accidentally inserting an index tag within other tags will reset the format. Remember that text-formatting tags do not reset at line breaks, so sometimes the space between a start tag and an end tag can run more than one line. Special case: Section-long ranges In general, it is okay to place an index tag immediately before an end section tag just as a tag might appear just before a </PARA> tag. (That is, an index tag can appear immediately before the tags </SECT1>, </SECT2>, </SECT3>, </SECT4>, and so on.) Although most sections end with both a </PARA> tag and the end section tag, it often makes sense to put the index tag just inside the </SECT#> tag to signify that the section is being indexed, not just some of the text. In other circumstances, of course, the section ends with code or a figure or a table, and there is no </PARA> tag at the end. That's okay. Simply put the index tag immediately before the </SECT#> tag. However, the DTD for sgml does not allow tags to show up between two section end tags, such as between </SECT2> and </SECT1> tags. In this case, the index tags must precede the earliest of the consecutive section end tags. For example, if three levels of sections are ending at the same place, such that the tags </SECT3>, </SECT2>, and </SECT1> appear in a row, all index tags must come before the </SECT3> tag. In fact, if the tags </PARA>, </SECT3>, </SECT2>, and </SECT1> appear in a row, the index tags can go in front of the </PARA> tag. Special case: First paragraph of the chapter Do not put an index tag immediately following the first <PARA> tag of a chapter if that tag immediately precedes the first text letter of that chapter. In a chapter that involved "drop caps" (i.e., a caligraphic or excessively large first letter), the design specifications for that drop cap will be "reset" if an index tag appears first. In this situation, put the index tags at the end of that first paragraph. Special case: Footnotes Footnote text has a special format. If you must index the text within the footnote, if the footnote does not run over six lines or so, put the index tag at the end of the footnote, immediately before the close </FOOTNOTE> tag. If the footnote starts to run long ("6 lines" is an arbitrary figure, but footnotes that long, under special cirucmstances can end up starting on one page and ending on another), then insert the index tag within the text, one line after the start <PARA> tag, but not immediately following it. This way you are guaranteed that the page number will match the start of the footnote text on the off chance that the footnote ends up across two or more pages. Special case: Sidebars Although all text within a sidebar is (generally) between <PARA> tags, there may be other formatting issues at stake. Since it is rare that sidebars will end up on multiple pages, it is best to not put index tags after the first <PARA> marker of the sidebar. Instead, if what you are indexing begins at (or exists only within) the sidebar's initial paragraph, put the tags immediately before the first </PARA> tag (at the end of that first paragraph). Index tags and inline font tags Index tags should not be inserted within other font style formatting tags (inline font tags), such as <EMPHASIS> or <STRONG> tag pairs. It's just as easy to put the index tag immediately before (after) the start (end) tags. So don't do it. Index tags within titles Don't put index tags within headings, since headings are some of the most automatically formatted text in the document. If you absolutely have to drop an index tag within a heading, it must go just before the closing tag. In other words, if the index tag has to go in between <TITLE> tags, it should show up immediately preceding the closing </TITLE> tag, and thus it should follow all title text. This applies to all <TITLE> tag pairs, regardless of what tags surround them. A reminder: it is never a good idea to index the title. I would consider this only if what the title is titling is unindexable, such as a figure or sourced-in material. Both of these examples are mentioned below. Index tags within tables Tables are complicated enough without throwing in index tags. Worse still, there are rarely and <PARA> tags in a table. Instead, however, "table paragraphs" are set off by <ENTRY> and </ENTRY> tags. It is okay to put index tags in between certain <ENTRY> tags, preferably immediately preceding the </ENTRY> tag (and not immediately after the <ENTRY> tag). This is because the format of the table can be affected by the length of the text between the <ENTRY> tags, and an index tag can confuse the parser when calculating that length. Also, should there be special formatting information for a particular table entry or column or section, such as having every entry in the leftmost column in boldface type, an index tag will disturb that. Do not put tags anywhere between the <THEAD> and </THEAD> tags. The information here is crucial to the table's format. An index tag in here might cause the table not to print entirely. Do not put index tags in tables anywhere other than in between the <ENTRY> and </ENTRY> tags that are not between the <THEAD> tags. Index tags within code Do not put index tags within code. Doing so is just asking for the code to get unformatted. Instead, place the index tag either immediately before the code, or immediately after. (After is safer, but before is better for getting accurate page numbers.) It is possible, however, to put an index tag immediately before a closing </PROGRAMLISTING> tag if absolutely necessary. This is really only useful in s construct such as <PROGRAMLISTING><ULINK url="xxxxxxx"></ULINK></PROGRAMLISTING>, since the index tag should remain outside the <ULINK> tag pair. Index tags and sourced-in material Because it's impossible to know by looking what sourced-in material will look like, and because links can be fragile, it's safer to keep index tags away from any sourced-in material. Therefore, do not put index tags between <ULINK> and </ULINK> tags. (See the above paragraph concerning <ULINK> tags within <PROGRAMLISTING> tags.) If you have to, you can put the tag into the title, but only immediately preceding the </TITLE> tag and following any title text. See Index tags in titles. Note that most figures qualify as sourced-in material. See Index tags and figures. Index tags within lists Lists often have <PARA> tags, so index tags can usually be inserted into lists. It is recommended to put the tags immediately before the close </PARA> tags in case certain list items are given special formats. Do not put index tags in lists anywhere other than in between <PARA> and </PARA> tags. For example, you will get an error if you put an index tag immediately after <VARIABLELIST> or immediately before </VARIABLELIST>. Index tags and figures Do not put index tags within <FIGURE> or <GRAPHIC> tags under any circumstances. If you absolutely do need to index a figure, it is possible to place the index tag immediately above the <FIGURE> (<GRAPHIC>) tag, or you can index the caption (title). If you are going to put an index tag in between <CAPTION> and </CAPTION> (<TITLE> and </TITLE>) tags, it must go immediately preceding the closing tag. (See Index tags in titles.) Note that even this can be dangerous. It is better to index text than the figure, especially because figures are usually explained in the text. Whoever said a picture is worth a thousand words never tried indexing a picture.) Of course, if the figure holds information that is not anywhere else in the text, you still have to index the figure. Figures and other graphics are often sourced in to the document. See Index tags and sourced-in material. Index tags and glossaries Surprisingly, there are many places within a glossary that you can get away with dropping index tags. However, the best place is obviously between the <PARA> and </PARA> tags within the <GLOSSDEF> tags. Nevertheless, if for some reason you can't do this (although I can't conceive of a good reason for it), it is acceptable to put the tags in any of the following other, less recommended places: immediately before the </TITLE> tag, immediately after the </TITLE> tag but before the <GLOSSTERM> tag, or immediately before the </GLOSSTERM> tag. Some glossaries may have paragraphs within them, perhaps to separate items in the glossary by category. As long as these paragraphs have <PARA> tags in them, it's okay to drop index tags into there. Copyright 1999 Seth A. Maislin Top

HOME \| ABOUT \| INDEXING \| WEBSMARTS \| FUN & WACKY \| EMAIL Site design by little graphics studio. © 2002 All rights reserved.