Chapter 56

Indexing with CGI

by Jef fry Dwight


CONTENTS

If you could see my desk, you would realize two things instantly: First, I'm highly computerized; second, I'm highly disorganized. I usually have two or three keyboards within reach and often type on more than one at a time. I'm surrounded by monitors. Modems, tape drives, CD-ROM drives, routers, hubs, printers, switches, mice, disks, pagers, and my coffee cup all vie for space on the desk.

Yet everywhere you look, you'll find bits of paper. I work within a growing mountain of flattened wood pulp. Anything will do-as long as it has a blank area, I'll write something on it.

Ask me two days after I write down your birthday, and I can reach into a pile, pull forth a crumpled napkin, turn it inside out, and triumphantly report success. The fact that I wrote it using a felt pen, at an angle across the only free space left on the napkin, and that the ink had spread throughout the fibers, making everything else on the napkin illegible, is beside the point. Out of thousands of pieces of paper, I was able to retrieve exactly the right one within moments. I remembered the napkin, the felt-tip pen, the part of the napkin I'd written on, and even the pile I'd put the napkin in. Pretty impressive, even if it had been simpler to just remember your birthday.

Ask me two weeks later rather than two days later and I'll scratch my head, narrow my eyes, and rummage through all the napkins, then all the papers with felt-tip pen marks, until I find your birthday. It might not be an instantaneous retrieval, but it's good enough. I've remembered two key items about your birthday-napkins and felt tips. If I couldn't find it using the first key, I'd use the second. At worst, I could iterate through all the papers matching either key, with the highest probability ranking given to those matching both keys.

But ask me two months afterward, and I'll say, "Who are you again?" or just pretend I didn't hear your question.

Clearly, my system is inadequate. Just as clearly, the solution is sitting on the same desk with all that confetti. The World Wide Web has the same problem, the same kind of system, and the same solution.

The Perfect Secretary

To solve my paperwork problem, I need a system to organize information, sort it by keyword, topic, concept, or phrase; order it using some hierarchical scheme; and then correlate it with everything else. I can imagine the world's best secretary-someone who would come into my office without ever disturbing me, take all of those bits of paper away, file them appropriately, and be on call 24 hours a day to retrieve anything instantly.

I might hit the intercom button and say, "I need to know the birthday of John, um, somebody-I forget his last name, but he's a member of the Elks-maybe the Moose-and he was in here sometime last week, or the week before. It might have been last month, but it was definitely after I had my wisdom teeth out."

"You mean John Peterson, 5'10", brown eyes, black hair, born on 25 December 1974? Married to your daughter?"

"Yeah, that's the one! Thanks!"

Well, maybe any good secretary could have answered that particular question, but only a robot could solve the general case. Fortunately, the Web has robots-called spiders, Web crawlers, or worms-that are on the job 24 hours a day, 365 days a year. They do nothing but wander around, picking up stray bits of paper, reading and cataloging whatever they find. They store the results of their searches in huge databases, which anyone can browse.

But, just as I'll never find the perfect secretary, you'll never find the perfect search engine. Each one has its strengths and weaknesses, its admirers and detractors. What you can do, however, is build a team of secretaries, each one doing the particular task that he or she does best.

The WAIS and Means Committee

WAIS (pronounced ways) stands for Wide-Area Information Systems, a popular full-text indexing and retrieval engine. Full-text refers to the fact that each word in each document scanned becomes part of the index. Listings 56.1, 56.2, and 56.3 show three files that might be included in a WAIS index.


Listing 56.1  Holidays.txt-Sample Text File #1
Holiday Schedule
New Year's Day, Monday, January 2.
Memorial Day, Monday 29 May
July 4th, Independence Day, Thursday


Listing 56.2  Birthdays.txt-Sample Text File #2
John, Jan 17 (Thursday this year)
Mary, May 29


Listing 56.3  Taxes.txt-Sample Text File #3
Fiscal year ends 31 December
Expect big write-off in May or June
Estimates due July 1

These three files roughly correspond to things that I might have scrawled on slips of paper here and there. My wonderful secretary, Mr. Ways, has swept through the room, cleaned up all the papers, and organized the information for me. Mr. Ways keeps a careful catalog of everything that he finds, and can examine it on demand.

Suppose I ask Mr. Ways for anything with the word "Jan." Mr. Ways would instantly hand me Holidays.txt (which has "January") and Birthdays.txt (which has "Jan"). He wouldn't give me Taxes.txt because "Jan" doesn't appear anywhere in it. If I ask for "May," I'll get back all three files, because all three contain the word "May."

If I ask for anything containing either "February" or "tax," Mr. Ways will return Taxes.txt. Even though none of the files contains the word "February," the Taxes.txt file contains the word "tax" as part of the title. This satisfies my request for either the first word or the second word. This kind of search is called a Boolean OR.

If I ask Mr. Ways to search for both "May" and "29," he will hand me Birthdays.txt and Holidays.txt, at which point I'll find out that Mary's birthday is on Memorial Day this year. Taxes.txt contains the word "May" but not the number "29," so the file fails the "find files with the first term and the second term" test. This kind of search is called a Boolean AND.

I can stretch Mr. Ways a bit by asking him to produce only files that have both "May" and "29," but not "Mary." A Boolean expression might state this search as follows:

((May AND 29) AND (NOT Mary))

This search first finds files matching the first term (it must have both "May" and "29"), then excludes files having "Mary," leaving only Holidays.txt as the result. Suppose the search expression had been

((May AND 29) OR (NOT Mary))

Mr. Ways would have given me all three files under this search expression. The Holidays.txt file is included because it has both "May" and "29"; the Birthdays.txt file is included for the same reason; and the Taxes.txt file shows up because it doesn't have "Mary."

A full-text index is obviously very powerful. Even in this limited example, you can clearly see the usefulness and flexibility of this kind of tool. Yet, in a large database of files, thousands might include the word "May." If the database includes source code files, there might be hundreds of thousands of references to "29." Wouldn't it be nice to find only dates that look like birthdays, or the word May, but only if it's near the word 29, and not in any source code files?

Fuzzy search engines go one step beyond Mr. Ways and give you the means to do more.

Warm Fuzzies

A fuzzy search is one that doesn't rely on exact matches. It is not based on Boolean algebra, with its mixture of AND, OR, and NOT operators, although these might come into play if appropriate. Instead, it tries to identify concepts and patterns, and deal with information rather than data.

Feel the Heat
Information is data that's been assigned meaning by a human. In a simple example, "It's 98 degrees" is data, whereas "It's hot" is information. As the amount of data on the Internet grows, the importance of distinguishing information from data skyrockets
The ultimate artificial-intelligence machine would have a DWIM, or "Do What I Mean" command. Putting data in context with other data is one way to derive information. Human language abounds with contextual references and implied scopes.
For instance, when you say "It's hot," you probably don't mean "Somewhere in the world, the temperature is such that someone might refer to it as hot," or "The global distribution of thermal energy across the planet's surface gives rise to local anomalies, with perception of the relative differences being expressed by the relevant indigenous populations as either 'hot' or 'cold,' and the area to which I now refer is one of the former."
You mean that you're feeling hot right now, regardless of the actual temperature.
The context and scope of your original statement is implied; the concomitant associations derive from the context, your knowledge of human behavior in general, and your behavior in particular.
If I searched the Internet for "Hot Babes" (not that I would ever do so), I would be disappointed if I got back pointers to the National Weather Service's reports mingled with articles about infant care. How can search engines figure out what kind of "hot" I mean? Can DWIM ever be achieved?
This question is a hot topic-the basis for an ongoing and bitter debate among philologists, linguists, artificial-intelligence theorists, and natural language programmers. There are almost as many sides as there are participants in the debate, and no one view clearly outstrips the rest. If you are interested in this sort of debate, check out the comp.ai.fuzzy newsgroup on UseNet, or stop by your local library or favorite online search engine and find references to AI and natural language.

Much of the following material is adapted from Rod Clark's excellent discussion in Special Edition Using CGI (Que Corporation, 1996).

Suppose that a friend mentions a reference to "dogs romping in a field." It could be that what he actually saw, months ago, was the phrase "while three collies merrily romped in an open field." In a very literal search system, searching for "dogs romping" would turn up nothing at all. "Dogs" are not "collies," and "romping" is not "romped." But the query "romp field" might yield the exact reference if the search tool understands substrings. A substring is just part of the a string-but figuring out which part is meaningful isn't easy.

People think and remember in imprecise terms. But conventional query syntax follows very precise rules, even for simple queries.

Concept-based engines can effectively find related information, even in files that don't contain any of the words that the user specified in a search query. These tools are particularly helpful for large collections of existing documents that were never designed to be searched.

Casual users seldom use the more advanced syntax that sophisticated search tools offer. Concept-based searching offers such users a broad, reasonable search by default. This is much easier for people than phrasing several specific queries and conducting multiple searches for them.

Concept-based search tools might combine several different searching techniques, some of which are described in the following sections. The most general of those techniques is pattern matching, which is used to find similar files.

Thesauri  One way to broaden the reach of a search is to use a thesaurus, a separate file that links large numbers of words with lists of their common equivalents. Most thesauri let you add special words and terms, either linked to a dictionary or directly to synonyms. A thesaurus-based search engine automatically looks up words related to the terms in your submitted query and then searches for those related words. For example, if you publish several technical briefs on the cellular mitosis, a thesaurus-based search engine would show your articles under biology and physiology as well as cytology.

Stemming  Some search engines, but by no means all, offer stemming. Stemming is trimming a word to its root and then looking for other words that match the same root. For example, "wallpaper" has "wall" as its root word. So does "wallboard," which the user might never have entered as a separate query. A stemmed search might serve up unwanted additional references to "wallflower," "wallbanger," "Wally," and "walled city," but would also catch "wall" and "wallboard" when the user entered "wallpaper," and probably provide useful information that way.

Stemming has at least two advantages over plain substring searching. First, it doesn't require the user to mentally determine and then manually enter the root words. Second, it allows assigning higher relevance scores to results that exactly match the entered query and lower relevance scores to the other stemmed variants.

But stemming is language-specific. Human languages are complex, and a search program can't simply trim English suffixes from words in another language.

Finding Similar Documents  Several newer search engines concentrate on some more general techniques that are not language-based. Some of these tools can analyze a file, even if it's in an unknown language or file format, and then search for similar files. The key to this kind of search is matching patterns within the files instead of matching the contents of the files.

Building specific language rules into a search engine is difficult. What happens when the program encounters documents in a language that it hasn't seen before, for which the programmers haven't included any language rules? There are people who've spent their whole adult lives formally recording the mathematics of the rules for using English and other languages-and they still aren't finished. In our daily experience, we hardly think of those rules because we've learned them in our everyday human way-by drawing conclusions from comparing and summing up a great many unconscious, unarticulated, pattern-matching events.

Even if you don't know or can't explain the rules for constructing the patterns that you see-whether those patterns are in human language, graphics, or binary code-you can still rank them for similarity. "Yes, this one matches." Or "No, that one doesn't. This one is very similar, but not exact. This one matches a little. This one is more exact than that one." To analyze files for content similarity, nearness, and other such qualities, some of the newer search engines look for patterns. Such engines use fuzzy logic and a variety of weighting schemes.

The theory behind sophisticated pattern analysis is far beyond the scope of this book. A good explanation of just the algorithms, sans theory, would cover several chapters. However, you should be aware that these techniques exist, and that some of the indexing engines you'll encounter use crude variants of these techniques to enhance their searching power.

Leveraging Commercial Indexes

Fortunately, you don't have to be a natural-language or artificial-intelligence expert to incorporate indexing into your home page or Web site. Many fine public search engines are available. You link to some and install others. In this section, you will learn about some of the more common commercial indexes and how you can use them.

Public indexes are just that-public. They are available for free through sponsoring corporations, groups, or individuals. These public indexes are accessed from an HTML form and usually consist of a paired set of Web-crawling robots to collect data and a CGI program to search the index.

You don't have to rely on a list of bookmarks or your browser's setting for a search page. You can make a page of your own that links directly to your favorite search engines. You can even tailor the form so that it comes preloaded with specific search terms. Listing 56.4 shows a generic form for invoking AltaVista's gigantic search engine. Listing 56.5 shows a modification to restrict the search to one of several predefined terms.

AltaVista

AltaVista provides a helpful index of Web sites and newsgroups. You can find AltaVista at the following address:

http://www.altavista.digital.com

Figure 56.1 shows the AltaVista Web page.

Figure 56.1 : The AltaVista Web page.

Notice how Listing 56.5 takes the same form fields defined in Listing 56.4 and hard-codes some of them. The result is that Listing 56.5 always searches the newsgroups, and only for CGI by Example, Using CGI, or Que Corporation.

NOTE
HTML examples are not provided for the other sites. The concept is the same for each site-you take the HTML used by the site itself to invoke its CGI script, and then modify the HTML to suit your needs


Listing 56.4  A Generic AltaVista Search Form
<H1>Search Alta Vista</H1>
<FORM METHOD=get
      ACTION="http://www.altavista.digital.com/cgi-bin/query">
<INPUT TYPE=hidden name=pg value=q>
<B>Search
<SELECT name=what>
<OPTION value=web  SELECTED>the Web
<OPTION value=news >Usenet
</SELECT>
and Display the Results
<SELECT name=fmt>
<OPTION value="." SELECTED>in Standard Form
<OPTION value=c >in Compact Form
<OPTION value=d >in Detailed Form
</SELECT></B>
<input TYPE=text name=q size=55 maxlength=200 value="">
<INPUT TYPE=submit value=Submit>
<BR>
</FORM>


Listing 56.5  A Customized Alta Vista Search Form
<H1>Search Alta Vista</H1>
<FORM METHOD=get
      ACTION="http://www.altavista.digital.com/cgi-bin/query">
<INPUT TYPE=hidden name=pg value=q>
<INPUT TYPE=hidden name=what value=news>
<INPUT TYPE=hidden name=fmt value=d>
<B>Search Newsgroups for</B>
<SELECT name=q>
<OPTION>CGI by Example
<OPTION>Using CGI
<OPTION>Que Corporation
</SELECT><BR>
<INPUT TYPE=submit value=Submit>
<BR>
</FORM>

Infoseek

Infoseek is one of my favorite search engines because it is fast, usually up, and processes search terms in ways that make sense to me. Figure 56.2 shows Infoseek's Web page.

Figure 56.2 : Infoseek's Web site.

One nice touch is that you don't have to write any HTML at all if you want to include a link to Infoseek's search engine on one of your pages. Simply send a blank e-mail message to html@infoseek.com, and 5 to 10 minutes later, you'll receive HTML ready to plug into any of your pages.

You can find Infoseek at the following site:

http://www.infoseek.com

Lycos

Lycos also provides HTML by e-mail. Figure 56.3 shows the Lycos Web page.

Figure 56.3 : The Lycos Web site.

Stop by http://www.lycos.com/backlink.html and fill out the online form. Within a day or so, you'll get back some sample HTML. The backlink service from Lycos enables you to incorporate your own company logo or custom graphics so that visitors see a nicely integrated package.

Lycos is available at the following site:

http://www.lycos.com

Starting Point

For many users, Starting Point is the starting point when they conduct Web searches. Figure 56.4 shows the Starting Point Web page.

Figure 56.4 : The Starting Point Web page.

When you visit Starting Point, you can add a link for your site. Starting Point responds in e-mail with suggested HTML for linking your site with the Starting Point site.

You can find at the following site:

http://www.stpt.com

Excite

Excite is more than a public index. Excite makes search engines that you can install on your own system, and is working closely with Web server companies to provide integrated solutions. Figure 56.5 shows the Excite Web site, which you can find at the following address:

Figure 56.5 : The Excite Web site.

http://www.excite.com

Indexing Your Own Site

So far, you have learned the theory behind site indexing, and have seen some of the large commercial search engines at work. In this section, you study several of the smaller indexers and search engines-ones more appropriate for a single site.

The code examples, and much of the supporting text in this section, are adapted from Rod Clark's excellent discussion in Special Edition Using CGI (Que Corporation, 1996).

Keywords

Before you start studying indexing programs and individual search engines, you need to examine the kinds of information that you can provide for the indexers to index. The examples in this section are drawn from Rod Clark's discussion of indexing in Chapter 11 of Special Edition Using CGI.

Adding keywords to files is particularly important when using simple search tools, many of which are very literal. These tools need all the help they can get.

Manually adding keywords to existing files is a slow and tedious process. Doing so isn't particularly practical when you are faced with a blizzard of seldom-read archival documents. However, when you first create new documents that you know people will search online, you can stamp them with an appropriate set of keywords. This stamping (or keying) provides a consistent set of words that people can use to search for the material in related texts, in case the exact wording in each text doesn't happen to include some of the relevant general keywords. Using equivalent non-technical terminology that users are likely to understand also helps.

Sophisticated search engines can yield good results when searching documents with little or no intentional keying. But well-keyed files produce better and more focused results with these search tools. Even the best search engines, when they set out to catch all the random, scattered, unkeyed documents that you want to find, return information that's liberally diluted with noise-irrelevant data. Keying your files helps keep them from being missed in relevant lists for closely related topics.

Keywords in Plain Text  To help find HTML pages, you can add an inconspicuous line at the bottom of each page that lists the keywords you want, like this:

Poland Czechoslovakia Czech Republic Slovakia Hungary Romania 
Rumania

This line is useful, but ugly and distracting. Also, many search engines assign a higher relevance to words in titles, headings, emphasized text, <A NAME=...> tags and other areas that stand out from a document's body. The next few sections consider how to key your files in more sophisticated and effective ways.

Keywords in HTML META Tags  You can put more information than simply the page title in an HTML page's <HEAD>...</HEAD> section. Specifically, you can include a standard keywords list in a META tag. keywords and expires are officially defined components of HTTP headers, which is why they include HTTP-EQUIV as part of the statement in the tag.

People sometimes use META tags for other, nonstandard information. But search engines often pay particular attention to a META Keywords list. Here's an example of using the META Keywords tag within an HTML header:

<HEAD>
<META HTTP-EQUIV="Keywords" CONTENT="George, Jungle">
<TITLE>George's Jungle Page</TITLE>
</HEAD>

Keywords in HTML Comments  This section presents some lines from an HTML file that lists links to English language newspapers. (These are just examples, not links to real places.) The lines aren't keyed; therefore, to find a match, you have to enter a query that exactly matches something in either a particular line's URL or its visible text. Such matches are not too likely with some of these example lines. Only one of them comes up in a search for "Sri Lanka." None of them comes up in a search for "South Asia," which is the section head just above them in the source file.

<B><A HREF="http://www.lanka.net/lakehouse/anclweb/dailynew
/select.html">Sri Lanka Daily News</A></B><BR>

<B><A HREF="http://www.is.lk/is/times/index.html">Sunday Times
</A></B><BR>

<B><A HREF="http://www.is.lk/is/island/index.html">Sunday Island<
/a></B><BR>

<B><A HREF="http://www.powertech.no/~jeyaramk/insrep/">
Inside Report: Tamil Eelam News Review</A></B><i> - 
monthly</i><BR>

To improve the search results, you can key each line with one or more likely keywords. The keywords can be contained within <!--comments -->, in <A NAME=...> statements, or in ordinary visible text. Some of these approaches are more successful than others. The following are examples of each.

First, add some keywords as HTML comments on each line. The following example already looks better. Again, these are examples, not real URLs:

<!--South Asia Sri Lanka --><B><A HREF="http://www.lanka.net
/lakehouse/anclweb/dailynew/select.html">
Sri Lanka Daily News</A></B><BR>

<!--South Asia Sri Lanka --><B><A HREF="http://www.is.lk/is
/times/index.html">Sunday Times</A></B><BR>

<!--South Asia Sri Lanka --><B><A HREF="http://www.is.lk/is
/island/index.html">Sunday Island</A></B><BR>

<!--South Asia Sri Lanka --><B><A HREF="http://www.powertech.no
/~jeyaramk/insrep/">Inside Report: Tamil Eelam News Review
</A></B><I> - monthly</I><BR>

You could put the keywords in <A NAME=...> statements, too, but HTML prohibits spaces in <A NAME=...> statements. Therefore, keys in an <A NAME=...> statement are limited to single keywords, rather than phrases. This might suffice if you can always be sure of using an AND or OR search instead of searching for exact phrases. But many scripts don't support Boolean operators, and, even when Booleans are allowed, most users don't use them. So, overall, using <A NAME=...> statements for keying isn't the best choice. Nevertheless, here is an example of using an <A NAME=...> statement to provide a keyword:

<A NAME="Tamil">

SWISH (Simple Web Indexing System for Humans)

SWISH is easy to set up and offers fast, reliable searching for Web sites. Kevin Hughes wrote the program in C for UNIX Web servers. SWISH is freeware, available from EIT at the following site:

http://www.eit.com/goodies/software/swish/swish.html

You can download SWISH's source code from EIT's FTP site and compile it on your own system:

http://www.eit.com/software/swish/

Installing SWISH is straightforward. After uncompressing and untarring the source files, you edit the SRC/CONFIG.H file and compile SWISH for your system.

Configuring SWISH isn't very hard, either. You set up a configuration file, Swish.conf, which the indexer uses. Listing 56.6 shows a sample SWISH configuration file.


Listing 56.6  Swish.conf-A Sample SWISH Configuration File
# SWISH configuration file

IndexDir /home/rclark/public_html/
# This is a space-separated list of files and directories you 
# want indexed. You can specify more than one of these directives.

IndexFile index.swish
# This is what the generated index file will be.

IndexName "Index of Small Hours files"
IndexDescription "General index of the Small Hours Web site"
IndexPointer "http://www.aa.net/~rclark/"
IndexAdmin "Rod Clark (rclark@aa.net)"
# Extra information you can include in the index file.

IndexOnly .html .txt .gif .xbm .jpg
# Only files with these suffixes will be indexed.

IndexReport 3
# This is how detailed you want reporting. You can specify numbers
# 0 to 3 - 0 is totally silent, 3 is the most verbose.

FollowSymLinks yes
# Put "yes" to follow symbolic links in indexing, else "no".

NoContents .gif .xbm .jpg
# Files with these suffixes will not have their contents indexed -
# only their file names will be indexed.

ReplaceRules replace "/home/rclark/public_html/"
 "http://www.aa.net/~rclark/"
# ReplaceRules allow you to make changes to file path names
# before they're indexed.

FileRules pathname contains test newsmap
FileRules filename is index.html rename chk lst bit
FileRules filename contains ~ .bak .orig .000 .001 .old old. .map 
.cgi .bit .test test log- .log 
FileRules title contains test Test
FileRules directory contains .htaccess
# Files matching the above criteria will *not* be indexed.

IgnoreLimit 80 50
# This automatically omits words that appear too often in the files
# (these words are called stopwords). Specify a whole percentage
# and a number, such as "80 256". This omits words that occur in
# over 80% of the files and appear in over 256 files. Comment out
# to turn of autostopwording.

IgnoreWords SwishDefault

# The IgnoreWords option allows you to specify words to ignore.
# Comment out for no stopwords; the word "SwishDefault" will
# include a list of default stopwords. Words should be separated 
# by spaces and may span multiple directives.

After you set up SWISH for your site, create the indexes by running SWISH from the command line:

swish -c swish.conf 

You can use cron to update the indexes regularly or just run the job manually when needed. Now that you have your indexes, you need some CGI to access them. You can use the WWWWAIS gateway, also available from EIT (http://www.eit.com/software/wwwwais/) or you can create your own script using the WWWWAIS gateway as your model. Figure 56.6 shows the results of a search using the WWWWAIS gateway.

Figure 56.6 : The results of a search using the WWWWAIS gateway against a SWISH index at EIT. Notice that the results are ranked in order of relevance and file size.

freeWAIS

Almost anytime you encounter a discussion of WAIS on the Internet, freeWAIS will also be mentioned. The term freeWAIS is fairly self-explanatory-it's a freeware version of WAIS. Much of the material in this section is adapted directly from Bill Schongar's comprehensive discussion of WAIS in Chapter 12 of Special Edition Using CGI (Que Corporation, 1996).

freeWAIS on UNIX  Most WAIS tools are still primarily designed for use on UNIX servers. These tools include the servers themselves, as well as the client scripts. So, it only makes sense that one of the most significant public extensions to original WAIS functions first appeared on UNIX servers. freeWAIS-SF, designed by the University of Dortmund, Germany, takes advantage of built-in document structures to make more sense out of queries. It even enables you to specify your own document types for its use.

In addition, freeWAIS-SF gives you more power to search the way you want to search. Wild cards, "sounds-like" searches, and more conditions for what does and doesn't match, are all components that make finding what you're looking for much easier. You no longer have to worry about whether the author wrote "Color" or "Colour," "Center" or "Centre."

Unlike many things that you use with your server, especially in the UNIX world, the freeWAIS-SF package is easy to install. A shell script leads you through the basic configuration by asking questions; when you finish answering the questions, you're finished installing freeWAIS-SF.

You can obtain the freeWAIS-SF package at the following site:

ftp://ftp.germany.eu.net/pub/infosystems/wais/Unido-LS6/freeWAIS-sf-2.0/freeWAIS-sf-2.0.65.tar.gz

If you want the original freeWAIS instead (which you can certainly use), you can get it from CNDIR. To get the main distribution directory, so that you can choose the appropriate build, visit the following site:

ftp://cnidr.org/pub/NIDR.tools/freewais/

Whichever freeWAIS build you purchase will be a tarred and GUNZIPped file. Therefore, to unpack the build, you have to enter a command such as the following:

gunzip -c freeWAIS-0.X-whatever.tar.gz | tar xvf -

freeWAIS comes with its own longer set of installation instructions within the distribution, so double-check the latest information for the build that you obtain to make sure you don't skip any steps.

freeWAIS on Windows NT  A port of freeWAIS 0.3 is available for Windows NT from EMWAC (the European Microsoft Windows Academic Center) in its WAIS Toolkit. EMWAC's current version of the toolkit is 0.7, but you should check with EMWAC before obtaining the toolkit to find out what is the latest version. Versions are available for all types of Windows NT: 386-based, Alpha, and Power PC. You can obtain the toolkit from the following site:

ftp://emwac.ed.ac.uk/pub/waistool/

After you obtain the ZIP file, decompress it to retrieve the six files that comprise the distribution. Move them to an NTFS drive partition and then rename the file Waisindx.exe to Waisindex.exe.

If you plan to use the entire WAIS Toolkit with your server, put all three .Exe programs into the %SYSTEMROOT%\SYSTEM32 directory (usually C:\WINNT35\SYSTEM32).

TIP
If you are using UNIX, the WAIS program to query the WAIS indexes is called WAISQ. The query tool provided for Windows NT is called WAISLOOK. Keep this in mind when you see references to WAISQ, and simply substitute WAISLOOK if you are using Windows NT

Building a WAIS Database  Now that you have the software installed and running, you're ready to make a database (a set of index files).

The WAISINDEX program looks through your files and creates an index that the WAIS query tool can use later. This index consists of seven distinct files that are either binary or plain text, as shown in Table 56.1.

Table 56.1  WAIS Index Database Files

File Extension
PurposeFile Type
.Cat
A catalog of indexed files with a few lines of information about each one. Text
.Dct
A dictionary of indexed words. Binary
.Doc
A document table. Binary
.Fn
A file name table. Binary
.Hl
A headline table, featuring the descriptive text used to identify documents that the search returns. Binary
.Inv
An inverted file index. Binary
.Src
A structure for describing the source. The structure includes the creation date and other similar information. Text

The files with the extensions listed in Table 56.1 all share the same first name, as in Index.cat, Index.dct, Index.doc, and so on. You can name the first file anything you want, but if the file containing the HTML for the search form is called Index.html, then INDEX is what you should use for the database. If your HTML file is called Default.htm (as it would be using EMWAC's HTTP server), then DEFAULT is the correct first name for your database.

TIP
Many Web servers have built-in support for WAIS databases and determine which files to look at by matching the first name of the HTML file with the first name of the database files. Therefore, naming your database files correctly is important if you expect the built-in support to function

The command-line options that you use when executing WAISINDEX determines these database files' contents. There are a variety of different options that you might want to use, depending on your objective and the nature of the files that you want to index. The following is a simple command line to create an index:

waisindex -d Data\database1 Data\*.html

This command line uses only one option, the -d switch, which specifies that the next argument is the name that you want to give the index. The preceding command specifies that the name is database1, and that the database is to reside in the data directory. Arguments following the switches are the file names to index. In this example, the command indexes all the HTML files (those with an .html extension) in the data directory.

One of the more powerful features of WAISINDEX is that it enables you to index a variety of file types. To find out exactly which file types your version supports, check your version's documentation. The versions of WAISINDEX vary in the file type support they offer. In particular, freeWAIS-SF enables you to specify your own document types, and the EMWAC Toolkit supports such formats as Microsoft's Knowledge Base.

Accessing the WAIS Database  If your Web server has built-in support for WAIS (as many Web servers do), accessing the WAIS database is quite simple. You just create an HTML file to make the query and put the file in the same directory as the WAIS database files. (Remember that the first names of the HTML file and the database files must match.)

The HTML itself couldn't be simpler. Listing 56.7 shows a sample. All you have to do is include an <ISINDEX> tag somewhere on the form and the Web server does the rest.


Listing 56.7  A Sample WAIS Search HTML
<HEAD>
<TITLE>Sample WAIS Search</TITLE>
</HEAD>
<BODY>
<H1>Sample WAIS Search</H1>
This page has a built-in index.  Give it a whirl!
<P>
<ISINDEX>
</BODY>
</HTML>

If your Web server doesn't support WAIS directly, you must use a CGI script to access the data. You might also want to use a script when you need to format the output or filter the input.

Your script must gather data from a fill-in form and run a query against the WAIS index, then format the data appropriately for the visitor.

You can have your script perform the same function Web servers directly supporting WAIS perform: Call the WAISQ (or WAISLOOK) program. You can test this call from the command line:

waisq -d -http Data\database1 stuff

In this simple example, you run a query against the Data\database1 index files, using stuff as the query term. The result returns STDOUT as properly formatted HTML code, which makes the result perfect for use in a CGI script.

WAIS is so popular that dozens of scripts are available in the public domain for managing your queries. Here are the three most generic and useful scripts:

ICE

Christian Neuss' ICE search engine is the easiest to install of the several programs mentioned in this section. ICE produces relevance-ranked results and lists the search words that it finds in each file. It is written in Perl.

There are two scripts. The indexing script, Ice-idx.pl, creates an index file that ICE can later search. The indexer runs from the UNIX command line, as a standard non-CGI program. The search script, Ice-form.pl, is a CGI script. It searches the index and displays the results on a Web page.

ICE can use an optional external thesaurus in Thesaurus Interchange Format. Christian Neuss notes that ICE has worked well with small thesauri of a few hundred technical terms, but that anyone who wants to use a large thesaurus should contact him for more information.

You can find the current version of ICE on the Net at the following two distribution sites:

http://www.informatik.th-darmstadt.de/~neuss/ice/ice.html
http://ice.cornell-iowa.edu/

Indexing Your Files with ICE  ICE searches the directories that you specify in the script's configuration section. When ICE indexes a given directory, it also indexes all of its subdirectories.

There are five configuration items at the top of the indexer script. You'll need to edit three of them:

@SEARCHDIRS=( 
  "/home/user/somedir/subdir/",
  "/home/user/thisis/another/",
  "/home/user/andyet/more_stuff/"
);
$INDEXFILE="/user/home/somedir/index.idx"

# Minimum length of word to be indexed
$MINLEN=3;

The first directory path in @SEARCHDIRS is the default that will appear on the search form. You can add more directory lines in the style of the existing ones, or you can include only one directory, if you want to limit what people can see of your files.

TIP
Remember that ICE automatically indexes and searches all the subdirectories of the directories you specify

ICE's index is a plain ASCII text file. Here's a sample from the beginning of an ICE index file:

@f /./bookmark.htm
@t Rod Clark s Bookmarks
@m 823231844
1 ABC
1 AFGHANISTAN
1 AGREP
1 AIP
1 ALTNEWS
1 AND
1 ANIMAL
1 ANU
1 ATM
1 AUSTRALIA
1 AsiaLink  

Once you've set the configuration variables, run the script from the command line to create the index. Whenever you want to update the index, run the Ice-idx.pl script again. It will overwrite the existing index with the new one.

TIP
You can use the UNIX cron utility to schedule your index updates

Searching from a Web Browser with ICE  The search form presents a choice of directories in a drop-down selection box. You can specify these directories in the script. Listing 56.8 shows how to accomplish this task.


Listing 56.8  A Sample ICE Indexing Script
# Title or name of your server:
local($title)="ICE Indexing Gateway";

# search directories to present in the search dialogue
local(@directories)=(
    "Public HTML Directory",
    "Another HTML Ddirectory"
);

# Location of the indexfile:
#   Example: $indexfile="/usr/local/etc/httpd/index/index.idx";
$indexfile="/home/rclark/public_html/index.idx ";

# Location of the thesaurus data file:
#   Example: $thesfile="/igd/a3/home1/neuss/Perl/thes.dat";

# URL Mappings (a.k.a Aliases) that your server does.
# map "/" to some path to reflect a "document root"
#   Example
#   %urltopath = (
#   '/projects',   '/usr/stud/proj', 
#   '/people',     '/usr3/webstuff/staff', 
#   '/',           '/usr3/webstuff/documents',
#   );

%urltopath = (
  '/~rclark',   '/home/rclark/public_html'
);

Now you can install the script in your cgi directory and call it from your Web browser.

Hukilau 2

The Hukilau search engine doesn't use a stored index, but instead searches live files in a specified directory. Because of this, it returns absolutely current results. But for the same reason, it's very slow.

Hukilau searches one directory, which you specify in the script. (The registered version lets you choose other directories from the search form.) Its search results page includes file names, relevance scores, and context samples. The files on a search results page are in directory order, not sorted by relevance. Relevance ranking is planned for the next version, which may be available by the time you read this. Check the Small Hours page on the Web for updated information.

There's an option to show text excerpts from all the files in a directory, listed alphabetically by file name. This is useful when you're looking for something ill-defined, or when you need a broad overview of what's in the directory.

A quick file list feature reads only the directory file itself, not the individual files in the directory. It's fast, but it includes only file names, not page titles or context samples.

Unlike SWISH or WAISQ, Hukilau doesn't allow grouping query words together with parentheses so that certain operators affect only the words inside the parentheses and not the rest of the query words.

Listing 56.9 shows the configuration variables for Hukilau.cgi. (The script includes more detailed explanations of all of these.) After you've edited these settings, install the script in the usual way for your system. The script is self-contained and prints its own form.


Listing 56.9  Configuration Variables for Hulikau.cgi
$FileEnding       = ".html";
$DirectoryPath    = "/home/rclark/public_html/";
$DirectoryURL     = "http://www.aa.net/~rclark/";
$HukilauCGI       = "http://www.aa.net/cgi-bin/rclark/hukilau.cgi";
$HukilauImage     = " http://www.aa.net/~rclark/hukilau.gif";
$BackgroundImage  = "http://www.aa.net/~rclark/ivory.gif";
$Copyright        = "Copyright 1995 Adams Communications. All 
rights reserved.";
$HomePageURL       = "http://www.aa.net/~rclark/";
$HomePageName       = "Home Page";
# You must place the "\" before the "@" sign in the e-mail address:
$MailAddress       = "rclark\@aa.net"; 

The defaults are to apply an AND operator to all the words, to search for substrings rather than whole words, and to conduct a case insensitive search. If you'd like to change these defaults, you can edit the search form that the script generates. Listing 56.10 shows the part of the form that applies to the radio button and check box settings, edited a bit here for clarity.


Listing 56.10  Excerpt from a Hukilau Search Form
sub PrintBlankSearchForm
{
...
<INPUT TYPE="RADIO" NAME="SearchMethod" value="or"><B>Or</B>
<INPUT TYPE="RADIO" NAME="SearchMethod" 
value="and" CHECKED><B>And</B>
<INPUT TYPE="RADIO" NAME="SearchMethod" 
value="exact phrase"><B>Exact phrase</B> / 

<INPUT TYPE="RADIO" NAME="WholeWords" value="no" CHECKED><B>Sub</B>strings
<INPUT TYPE="RADIO" NAME="WholeWords" value="yes"><B>Whole</B> Words<BR>

<INPUT TYPE="CHECKBOX" NAME="CaseSensitive" value="yes">Case sensitive<BR>

<INPUT TYPE="RADIO" NAME="ListAllFiles" value="no" CHECKED><B>Search</B> 
(enter terms in search box above) <BR>
<INPUT TYPE="RADIO" NAME="ListAllFiles" value="yes">
List all files in directory (search box has no effect)<BR>
<INPUT TYPE="RADIO" NAME="ListAllFiles" value="quick">
Quick file list<BR>

<INPUT TYPE="RADIO" NAME="Compact" value="yes">
Compact display<BR>
<INPUT TYPE="RADIO" NAME="Compact" value="no" CHECKED>
Detailed display<BR>

<INPUT TYPE="CHECKBOX" NAME="ShowURL" value="yes">URLs<BR>
<INPUT TYPE="CHECKBOX" NAME="ShowScore" value="yes" CHECKED>Scores<BR>
<INPUT TYPE="CHECKBOX" NAME="ShowSampleText" value="yes"
CHECKED>Sample text<BR>
...

For example, to change the default from AND to OR in Listing 56.10, move the word CHECKED from one line to the other on these two lines:

<INPUT TYPE="RADIO" NAME="SearchMethod" value="or"><B>Or</B>
<INPUT TYPE="RADIO" NAME="SearchMethod" value="and" CHECKED><B>And</B>

The result should look like the following:

<INPUT TYPE="RADIO" NAME="SearchMethod" value="or" CHECKED><B>Or</B>
<INPUT TYPE="RADIO" NAME="SearchMethod" value="and"><B>And</B>

Changing the value of a check box is a little different. For example, to make searching case sensitive by default, add the word CHECKED to the statement that creates the unchecked box. Here's the original line:

<INPUT TYPE="CHECKBOX" NAME="CaseSensitive" 
value="yes">Case sensitive<BR>

Below is the same line, but set to display a checked box. It now looks like this:

<INPUT TYPE="CHECKBOX" NAME="CaseSensitive" 
value="yes" CHECKED>Case sensitive<BR>

An unchecked box sends no value to the CGI program. It wouldn't matter if you changed "yes" to "no" (or even "blue elephants"), as long as the box remains unchecked. The quoted value never gets passed to the program unless the box is checked. In other words, an unchecked box is as good as a box that is not even on the form.

This is the importance behind choosing values for the defaults. If you remove all the radio and check box fields from the form, leaving only the SearchText text-entry field, the hidden Command field, and the Submit button, the program sets a range of reasonable, often-used defaults.

This makes it practical to use relatively simple hidden Hukilau forms as drop-in search forms on your pages. To change the defaults and still use a hidden form, you can include the appropriate extra fields, but hide them, as shown in Listing 56.11.


Listing 56.11  Hiding All Form Variables
<FORM METHOD="POST"
ACTION="http://www.substitute_your.com/cgi-bin/hukilau.cgi">
<INPUT TYPE="HIDDEN" NAME="Command" VALUE="search">
<INPUT TYPE="TEXT" NAME="SearchText" SIZE="48">
<INPUT TYPE="SUBMIT" VALUE=" Search "><BR>
<INPUT TYPE="HIDDEN" NAME="SearchMethod" value="and">
<INPUT TYPE="HIDDEN" NAME="WholeWords" value="yes">
<INPUT TYPE="HIDDEN" NAME="ShowURL" value="yes">
</FORM>

The current version of the Hukilau Search Engine is available from Adams Communications at

http://www.adams1.com/

Updates regarding new features being added or tested may be found at the Small Hours site at

http://www.aa.net/~rclark/scripts/

GLIMPSE

GLIMPSE is a project of the University of Arizona's Computer Science Department. It's not trivial to install-either in disk space requirements or technical savvy-but it is powerful and useful once set up.

As the name implies, the program displays glimpses of context samples from the files. This makes it a particularly useful tool, even though it doesn't offer relevance ranking.

GLIMPSE can build indexes of several sizes, from tiny (about 1% of the size of the source files) to large (up to 30% of the size of the source files). Even small indexes are practical and offer good performance.

GLIMPSE isn't particularly easy to install, unless you have fairly extensive experience with UNIX. It's more for UNIX administrator wannabes than for general users. The installation process can't be condensed well into a few paragraphs here. You'll have to read the documentation, which isn't altogether friendly to beginners. GLIMPSE's companion Web gateway is called Glimpse-HTTP.

http://glimpse.cs.arizona.edu/
http://glimpse.cs.arizona.edu/ghttp/
ftp://ftp.cs.arizona.edu/glimpse/glimpse-3.0.src.tar.Z

Architext Excite for Web Servers

Architext's popular new search engine is available for SunOS, Solaris, HP-UX, SGI Irix, AIX, BSDI UNIX, and Windows NT.

Excite lets people enter queries in ordinary language, without using specialized query syntax. The user can choose either a concept-based search or a conventional keyword AND search. The results page presents links and context samples. Relevance ranking is the default, but a click of the mouse enables the user to see the same results grouped by subject or topic.

The software includes a Query by Example (QBE) feature, so that a user viewing a page can click a hypertext link to start a new search for similar pages. The user can specify a paragraph or sentence as a query, and search for information similar to that specific portion of the page.

Excite doesn't require a thesaurus to do concept-based searching, but the company indicates that an external thesaurus can improve results. Because a thesaurus is not necessary, adding support for new languages supposedly isn't as difficult as with some other software. Architext claims that independent software developers can also write modules to support additional data file formats, without facing too many obstacles.

Architext currently offers the software at no charge, and sells annual support contracts. Further information about Excite for Web Servers can be found at

http://www.excite.com/navigate/

Quite a few sites are running the Excite search engine. One good example is the Houston Chronicle search page at

http://www.chron.com/fronts3//interactive/search/

Built-In Search Tools in Web Servers

Several Web servers for UNIX and Windows NT include built-in utilities to index and search the files at a site. Some of these tools have fewer capabilities than the search engines mentioned above.

Navisoft Naviserver

Navisoft's Naviserver runs on Windows NT and UNIX. It includes Ilustra's Text DataBlade search tool, which is an add-on module for Illustra's extensive database system. DataBlade's capabilities include both keyword and concept-based searching. Current information about NaviServer is available at

http://naviserver.navisoft.com/feature.html

An Illustra search page and an Illustra database tools page are available at

http://www.illustra.com/cgi-bin/Webdriver?Mlval=document_search
http://www.illustra.com/cgi-bin/Webdriver?Mlval=document_list&doc_type =Data+Sheet

Process Purveyor

Process Software's Purveyor Web server includes Verity's Topic Server search engine, or some core parts of it. Process notes that add-on modules are available for the Verity search tools that Process bundles with its server. More information about Purveyor and its included version of Topic Server is available at

http://www.process.com/

OraCom WebSite

O'Reilly's WebSite server for Windows NT includes the company's WebIndex indexing and WebFind searching tools. WebIndex can index the full text of every page in the server's directory structure, or only selected parts of the directories. WebFind runs as a CGI program and is a conventional search tool. It does keyword searches and supports AND and OR operators.

O'Reilly publishes a book (or manual) titled Building Your Own WebSite that goes into considerable detail about setting up and using their WebSite server. You can read all about it before you install their software at

http://www.ora.com/

Here's a site that is running WebSite and that has set up several search databases:

http://www.videoflicks.com/

Netscape Commerce Server

Netscape's Commerce Server runs on Windows NT and UNIX. It includes a built-in indexing and searching system, although Netscape's lower-priced Communications Server does not.

Microsoft Tripoli

Designed for zero maintenance and complete Web site indexing, Microsoft's search engine (code-named "Tripoli" while in beta test) supports multiple languages and attempts to index by content type as well as contents. For example, Tripoli knows the difference between a spreadsheet and an HTML document, and lets the user search using both keywords and content types. You may read about Tripoli and download a free copy at

http://www.microsoft.com/ntserver/search/

Tripoli requires NT 4.0 and is designed to work hand-in-hand with Microsoft's Internet Information Server (IIS).

CGI Programming Examples

Here are three example of CGI scripts. One is a UNIX shell script, two are Perl scripts. Perl is widely used for CGI programming, especially on UNIX systems.

Searching a File for Matching Links

We can scan a file (which can be an HTML page) and display all the matches found in it. This is what some of the code in the Hukilau 2 search engine does when it displays context samples from the files it searches. But, let me first introduce a simpler example.

The script below scans each line in a given file (ordinarily, an HTML page) and displays any lines from the file that contain a match for the search term. If the original file contains hypertext links that are contained all on one line (rather than spread over several lines), then each line on the search results page will contain a valid link that the user can click.

This is a UNIX shell script that uses the UNIX utility grep to look for matches in the file. A script like this, or a version of it in Perl, C, or any another language, is a handy tool if you have Web pages with long lists of links in them. This script uses the ISINDEX tag because there are still some browsers that don't support forms.

Listing 56.12 shows the code for a UNIX shell script to do a line-by-line search of a single page. You can edit it to include your own menu at the top of the page and your own return link to the page that the script searches.


Listing 56.12  A UNIX Shell Script to Search Using grep
#! /bin/sh
echo Content-type: text/html
echo
if [ $# = 0 ]
then
  echo "<HTML>"
  echo "<HEAD>"
  echo "<TITLE>Search the News Page</TITLE>"
  echo "</HEAD>"
  echo "<BODY background=\"http://www.aa.net/~rclark/ivory.gif\">"
  echo "<B><A HREF=\"http://www.aa.net/~rclark/\">Home</A></B><BR>"
  echo "<B><A HREF=\"http://www.aa.net/~rclark/news.html\">
     News Page</A></B><BR>"
echo "<B><A HREF=\"http://www.aa.net/~rclark/search.html\">
Search the Web</A></B><BR>"
echo "<HR>"
  echo "<H2>Search the News Page</H2>"
  echo "<ISINDEX>"
  echo "<P>"
  echo "<dl><dt><dd>"
  echo "The search program looks for the exact phrase you specify.<BR>"
  echo "<P>"
  echo "You can search for <B>a phrase</B>, a whole <B>word</B> or a <B>sub</B>string.<BR>"
echo "UPPER and lower case are equivalent.<BR>"
  echo "<P>"
  echo "This program searches only the news listings page itself.<BR>"
  echo "Matches may be in publication names, URLs or section headings.<BR>"
  echo "<P>"
  echo "To search the Web in general, use <B>Search the Web</B>
     in the menu above.<BR>"
echo "<P>"
  echo "</dd></dl>"
  echo "<HR>"
  echo "</BODY>"
  echo "</HTML>"
else
  echo "<HTML>"
  echo "<HEAD>"
  echo "<TITLE>Result of Search for \"$*\".</TITLE>"
  echo "</HEAD>"
  echo "<BODY background=\"http://www.aa.net/~rclark/ivory.gif\">"
  echo "<B><A HREF=\"http://www.aa.net/~rclark/\">Home</A></B><BR>"
  echo "<HR>"
  echo "<H2> Search Results: $*</H2>"
  grep -i "$*" /home/rclark/public_html/news.html
  echo "<P>"
  echo "<HR>"
  echo "<B><A HREF=\"http://www.aa.net/cgi-
     bin/rclark/isindex.cgi\">Return to Searching the News
     Page</A></B><BR>"
echo "</BODY>"
  echo "</HTML>"
fi

Hukilau 2 Search Engine

Hukilau is a search script that searches through all the files in a directory. It can be very slow, so it's not practical for every site. Because Hukilau is written in Perl, it's easy to install and modify. Perl is an appropriate language in which to write such tools because it includes a good set of text pattern matching capabilities.

The complete source code for the original Hukilau Search Engine is on the CD-ROM. There's also a modified version that includes some added routines that were written for this chapter. We'll refer to it as Hukilau 2.

Listing 56.13 is from some new routines added to Hukilau 2. These routines are from the part of the script that alphabetically lists all the files in the directory. Shown below is a routine that displays a text sample from each file, and another routine that displays a quick file list for the directory, without reading eachfile.


Listing 56.13  A Perl Script for Hukilau 2 Indexing
#----------------------------------------------------------------
# List Files

sub ListFiles {
   opendir (HTMLDir, $DirectoryPath);
   @FileList = grep (/$FileEnding$/, readdir (HTMLDir));
   closedir (HTMLDir);
   @FileList = sort (@FileList);

   $LinesPrinted = 0;
   foreach $FileName (@FileList) {
      $FilePath = $DirectoryPath.$FileName;
      $FileURL     = $DirectoryURL.$FileName;
      if ($ListAllFiles eq "quick") {
      print "<li><B><A HREF=\"$FileURL\">$FileName</A></B><BR>\n";
      $LinesPrinted ++;
      }
      else {
      if ($Compact eq "no") {
         &ListDetailedFileInfo;
      }
      else {
      &ListQuickFileInfo;
      }
    }
  }
}

#----------------------------------------------------------------
# List Detailed File Info

sub ListDetailedFileInfo {
   print "<li><B><A HREF=\"$FileURL\">$FileName</A>";
   if (($ShowSampleText eq "yes") || ($Title ne $FileName)) {
      &FindTitle;
      print " - $Title";
   }
   print "</B><BR>\n";
   $LinesPrinted ++;
   if ($ShowURL eq "yes") {
      print "$FileURL<BR>\n";
      $LinesPrinted ++;
   }
   if ($ShowSampleText eq "yes") {
      &BuildSampleForList; 
      $SampleText = substr ($SampleText, 0, $LongSampleLength);
      print "$SampleText<BR>\n";
      print "<P>\n";
      # this is an approximation, as sample lines will vary
      # (if results long, add duplicate links at page end, later)
      $LinesPrinted = $LinesPrinted + $AvgLongSampleLines + 1;
   }
}

#----------------------------------------------------------------
# List Quick File Info

sub ListQuickFileInfo {
   print "<li><B><A HREF=\"$FileURL\">$FileName</A>";
   if ($ShowSampleText eq "no") {
      print "</B><BR>\n";
      $LinesPrinted ++;
   }
   else {
      if ($Title ne $FileName) {
      &FindTitle;
      print " - $Title";
      }
      print "</B><BR>\n";
      $LinesPrinted ++;
      &BuildSampleForList;
      $SampleText = substr ($SampleText, 0, $ShortSampleLength);
      print "$SampleText<BR>\n";
      print "<P>\n";
      $LinesPrinted = LinesPrinted + AvgShortSampleLines + 1;
   }
}

#----------------------------------------------------------------
# Find Title

sub FindTitle {
   # find the file's <TITLE>, if it has one
   # if not, put $FileName in $Title

   $ConcatLine = "";
   # look in the <HEAD> section of the file
   open (FILE, "$FilePath");
   foreach $IndivLine (<FILE>) {
      $ConcatLine = $ConcatLine.$IndivLine;
      last if ($TempLine =~ m#</HEAD>#i);
      # "last" aborts loop at end of <HEAD> section
      # (use # instead of / as delimiter, because / is in string)
      # trailing i is for case insensitive match
   }
   close (FILE);

   # if file has no <TITLE>, use filename instead
   if ($Title eq "") {
      $Title = $FileName;
   }
   # replace linefeeds with spaces
   $ConcatLine =~ s/\n/ /g;
   # replace possibly mixed-case <TITLE></TITLE> with fixed string
   $ConcatLine =~ s#</[tT][iI][tT][lL][eE]>#<XX>#;
   $ConcatLine =~ s#<[tT][iI][tT][lL][eE]>#<XX>#;
   # concatenated line is now "junk XXPage TitleXX junk"
   @Lines = split (/<XX>/, $ConcatLine);
   # part [0] is junk, part [1] is page title, part [2] is junk
   $Title = $Lines[1];
   undef @Lines; # dispense with array, free a little memory
}

#----------------------------------------------------------------
sub BuildSampleForList {
   $SampleText = "";
   open (FILE, "$FilePath");
   foreach $Record (<FILE>) { 
      &BuildSampleText;
  }
  close (FILE);
}

#----------------------------------------------------------------
# Build Sample Text

sub BuildSampleText {
   # remove linefeed at end of line
   chop ($Record);
   # collapse any extended whitespace to single space
   $Record =~ s/\t / /g;
   # remove separator at end of existing sample text, if one exists
   $SampleText =~ s/$SampleSeparator$//;
   # add sample from current line, separate former lines visually
   $SampleText = $SampleText.$SampleSeparator.$Record;
   # remove everything inside <tags> in sample
   $SampleText =~ s/<[^>]*>//g;
}

The code samples above are only extracts from the full script.

TROUBLESHOOTING
If you make any changes in the script, you can test them for syntax errors before installing the script in your cgi-bin directory. Give the script execute permission for your account, and then type its file name at the command line. The output will be either the default form (if the syntax is correct) or a syntax error message (if it's not).

Swish-Web SWISH Gateway

Swish-Web is in the public domain. If you'd like to practice a little programming on it, here are a few ideas for additions to the script.

NOTE
The complete Perl source code for the Swish-Web gateway is on the CD-ROM. It's an example of a Web gateway for a UNIX command-line program

SWISH provides relevance scores, but the scoring algorithm seems to favor small files with little text, among which keywords loom large. Since SWISH reports file sizes, it's possible to add a routine to Swish-Web to sort SWISH's output by file size. Another useful addition would be a second relevance ranking option that weights file size more heavily.

A selection box on the form to limit the results to the first 10, 25, 50, 100, or 250 (or all) results might be another useful addition.

The example routines shown in Listing 56.14 display some information on the screen about the SWISH index file that's being read.


Listing 56.14  Routines for SWISH Indexing
#--------------------------------------------------------------------
# PRINT INDEX DATA

sub PrintIndexData {
   # If entry field is blank, index isn't searched, hence no index data.
   # In that case, search the index to retrieve indexing data.
   if (!$Keywords) {
      &SearchFileForIndexData;
   }
   print "<HR>";
   print "<dl><dt><dd>";
   print "Index name: <B>$iname</B><BR>\n";
   print "Description: <B>$idesc</B><BR>\n";
   print "Index contains: <B>$icounts</B><BR>\n";
   if ($ShowIndexFilenames) {
      print "Location: <B>$IndexLocation</B><BR>\n";
      print "Saved as (internal name): <B>$ifilename</B><BR>\n";
   }
   print "SWISH Format: <B>$iformat</B><BR>\n";
   print "Maintained by: <B>$imaintby</B><BR>\n";
   print "Indexed on: (day/month/year): <B>$idate</B><BR>\n";
   if ($ShowSwishVersion) {
      if (open (SWISHOUT, "-|") || exec $SwishLocation, "-V") {
      $SwishVersion = <SWISHOUT>;
      close (SWISHOUT);
      }
      print "Searched with: <B>$SwishVersion</B><BR>\n";
   }
   print "</dd></dl>";
}

#--------------------------------------------------------------------
# SEARCH FILE FOR INDEX DATA

# If the form's input field is blank, ordinarily no search is made,
# which prevents reading the index file for the index data. In that 
# case, the following subroutine is called.

sub SearchFileForIndexData {
  # use a keyword that definitely won't be found
  $Keywords = $GoofyKeyword;
  if (open (SWISHOUT, "-|") 
    || exec $SwishLocation, "-f", $IndexLocation, "-w", $Keywords) {
    while ($LINE=<SWISHOUT>) {
      chop ($LINE);
      &ScanLineForIndexData;
    }
    close (SWISHOUT);
  }
}