by Mike Ellsworth
One of the great strengths of the World Wide Web is the breadth of information available. One of its great weaknesses is the lack of organization of this information. Many search engines and indexes have sprung up on the Web to try to assist users in finding the needles in the huge haystack. But what about your site? Are users frustrated because they can't easily find what they're looking for? Are they complaining about having to dig through page after page in order to mine the information nugget they seek?
Even the best organized site can benefit from the addition of an online search engine to its tool set. No matter how well you organize your site to make it logical and complete, there's always a group of users who are too impatient to appreciate the journey through your carefully planned site. They want the goods and they want them now.
Implementing a search engine can provide a way for your users to quickly zero in on the information they seek. It's not only the large, complex sites that need such a facility-any site with more than 100 files can benefit from a search capability.
Adding a search engine to your site can be quite easy to do. A vast array of shareware, freeware, and commercial search engines are available. But how do you pick the best one?
There are two main types of approaches for creating an online search facility for your Web site:
Indexing search engines predigest your Web site and create indexes containing all of its words. The major commercial Web search engines such as AltaVista, Lycos, Excite, and WebCrawler are all indexing engines. In fact, it is not practical to have a search engine that searches the whole Web with the grepping method. To accomplish this, the search engine would have to either add the full text of every site to a database or search every site in real time.
With an indexing search engine, when a user requests a search, the search engine only needs to refer to the index to find relevant pages. Because indexes are often a small fraction of the size of the documents indexed, this takes much less time. More importantly, such an approach makes the major commercial search engines practical by allowing them to store on the indexes of sites rather than site images.
Indexing search engines generally employ more sophisticated searching algorithms to improve their chances of returning relevant documents.
Although very easy to implement, most grepping search engines are somewhat limited in the types of search queries they support. Grepping, after all, is a rather brute force method of searching. Each file is opened and then scanned for the search terms. The amount of system resources consumed by these activities can limit the sophistication of the search strategies. Most grepping engines are limited to simple keyword searches, although some offer searching via regular expressions.
To determine which searching method to employ, you must first decide what kinds of search services you want to offer and how many resources, both disk space and processor time, to dedicate to those services.
As you might imagine, there's a big difference between the performance and efficiency of grepping and indexing search engines.
Performing a grepping search on one section of my site that contains about 600 average-sized files takes 8 CPU seconds and 40 elapsed seconds on a Sun Sparcstation 20. Because our user load is not very high, this is an acceptable amount of overhead for a search. However, if your site is very busy, with a hundred simultaneous users for example, it is probably not feasible to dedicate this amount of resources to user searching.
In contrast, performing an index-based search on the same site takes about 1 CPU second and 5 or 6 elapsed seconds. Because the size of the site is not large, about 90M, and indexes average between 10 percent and 20 percent of the total size of the site, the amount of disk overhead is acceptable. Because the information on our site doesn't change much day to day, I can run the indexing software overnight and provide a day-old index for our users to search.
One approach that adds no disk overhead and a small amount of processor overhead is to have someone else maintain your index and run the search process. An example of this approach is Pinpoint, from Netcreations (http://www.netcreations.com/pinpoint/). This commercial service sends their robot to your site about once a month. The site index is maintained on the Netcreations site, and they also maintain and run the search engine. You maintain a query form on your site that points to the Pinpoint URL. Some trade-offs, of course, exist for this type of solution. You give up a lot of control over what is indexed, how it's indexed, and how often the index is updated. In addition, performance of search queries is likely to be slower when conducted over the Internet.
You may also worry about the security aspects of turning over so much information about your site to a third party. In reality, however, there are many other third parties who index your site (AltaVista, Lycos, Excite, and all the rest), so you might as well worry about them. However, when a third party provides an important service such as this for your site, you are giving up a lot of control. You are trusting the third party to maintain the search engine, as well as to only index those sections of the site that you want your users to see. You are also trusting them to maintain a timely index of your site. If these compromises work for you, then this approach is quick and easy. If, on the other hand, you don't want to give up control of such an important function of your site, then you should consider implementing your own search engine.
Your choice of a search engine depends, in part, on how complex the searches on your site are likely to be. If relevant documents can be found with the use of a simple keyword, there's not much difference between the grepping and the indexing approaches. However, if the average user wants to implement multiple-word searches or searches involving concepts rather than keywords, an indexing search engine is the better choice.
In general, indexing search engines can accomplish more complex searches than grepping engines. A grepping engine basically does string compares. It may support regular expressions, wildcards, or fuzzy normal boolean matching, but it is difficult to implement more sophisticated context matching or concept searching in this type of engine. The sheer overhead of a grepping engine makes it difficult to do multipass searching of any kind.
Using an indexing approach, a search engine can spend more time examining the relationships between search terms and found pages. Because the engine doesn't need to burn processor time churning through all the pages in a site, it can offer nice features such as relevancy ranking and concept searching.
Issues to consider when evaluating an indexing search engine include the following:
Typically, the larger the index, the longer it takes to search. Most indexing search engines create indexes that are a small fraction of the size of the material to be searched (usually between 10 percent and 20 percent). However, if your site is massive, you need to consider whether the index can fit in memory all at once or whether your server needs to swap it in and out as the engine does its searching. Excessive disk thrashing dramatically slows the search process and may even affect overall server performance.
If yours is a high volume site and your users do a lot of searches, you may need to consider holding the index in memory or even limiting the number of simultaneous searches you allow.
One way to reduce the size of the index on your site is to exclude certain common words from the index. By default, most indexing engines exclude words known as "stop" words: commonly occurring articles and pronouns, for example, but there may be additional "noise" words on your site that you may not want to include in your index (the name of your organization, for example). Excluding words such as these reduces the size of the index and improves searching efficiency.
Most indexing search engines have some ability to ignore stop
words, also known as garbage or noise words. These are commonly
occurring words such as articles, pronouns, and many adjectives.
The indexing engine should ignore such words when indexing and
the query engine should discard them from the search terms when
performing a search. Table 57.1 is a list of commonly used stop
words.
| after | can | her | may | on | them | way |
| all | come | hers | me | only | then | we |
| also | did | hid | more | onto | there | were |
| am | do | him | most | or | these | what |
| an | does | his | much | other | they | when |
| and | each | how | must | out | this | where |
| any | etc | however | my | over | those | whether |
| are | far | ie | near | per | til | which |
| as | few | if | new | put | to | who |
| at | fix | in | next | same | too | why |
| be | for | into | no | say | try | will |
| because | from | is | none | since | under | with |
| been | get | it | nor | so | unto | within |
| before | go | its | not | some | up | without |
| between | got | just | now | such | upon | yet |
| big | had | led | of | than | us | |
| both | has | less | off | that | very | |
| but | have | let | oh | the | vs | |
| by | he | many | old | their | was |
One feature to look for in an indexing search engine is the ability to add words to the stop words list. For example, on my site, I'd like to add the name of our company, ACNielsen, to the list. Because this word is mentioned in almost every file on the site, it doesn't make sense to waste indexing space nor do we need to allow users to search for this term.
An important feature of any search engine is the ability to determine which files on your site to include in a search. If your site is like mine, there are various directories that are either password protected or developmental in nature and not linked to the main pages. Certainly you don't want files from these directories to turn up when users search your site.
Two approaches generally are used to control what material is indexed by the search engine. Either you specify all the directories you want to search, or you specify only those directories you want to exclude from searches. This latter approach usually results in less maintenance for the Webmaster. An even easier method is for the indexing engine to automatically skip directories protected with an access control file, such as the .Htaccess file used by the NCSA Web Server. This way, you don't have to remember to include or exclude new directories as they are added to your site.
Some indexing search engines allow you to use the <META> tag to control how a page is indexed. Using this tag, you place a page description and key words in the heading of your documents. The search engine then gives this information special treatment when it performs its index. The following HTML fragment is an example of this usage.
<HEAD>
<META name="description" content="ACNielsen Consumer Information">
<META name="keywords" content="consumer panel,
consumption,
marketing research">
</HEAD>
After the search engine indexes this page, if a user searches for "marketing research," the engine will find this page even if the words "marketing research" do not appear anywhere in the text of the page. Some engines will even use the contents of the meta description tag to identify a found page.
Two major parameters are used to judge the results of a search:
Each query can be graded as a fraction, with a perfect score being 1.00. In a perfect world, every search would score a 1.00 on both measures because only relevant documents would be retrieved. For example, let's say that you have a site containing 100 documents and of these 100, ten are about search engines. If a query is made for "Perl-based search engines," the query might retrieve four documents about search engines and two others. In this case, it would have a precision of 0.66 and a recall of 0.40.
There are various search strategies that are used to increase recall and precision, and some of them are quite complex.
It is common for indexing search engines to assign confidence factors or weights to the documents returned from a search and to use these measures to order the list of documents. Common methods for establishing weights include evaluating adjacency, frequency, and relevance feedback.
Adjacency Adjacency is a type of phrase-searching method that examines the relationship between words in the search phrase. The search engine increases the relevance score based on how closely the words in the search term occur in the target document. For example, if you search for the phrase "hearing aids," the search engine can use adjacency to determine that you aren't interested in documents containing the phrase "Senate hearing on medical research on AIDS."
Obviously, adjacency comes into play only when there is more than one search term. Yet, findings by WebCrawler (see http://info.webcrawler.com/bp/WWW94.html) indicate that the average search comprises only 1.5 words. If you can encourage your users to specify search phrases, however, a good indexing engine can employ adjacency to increase the effectiveness of the search.
Frequency Indexing search engines can use the frequency of hits on search terms within a page to increase the page's relevancy score. For example, if you're like me, it is far more likely that you are interested in a page that lists "Duke Blue Devils" seven times than in a page that contains only one mention of the phrase. The former page is much more likely to be an article about the subject, while the other could just be a listing of teams or a passing mention.
Relevance Feedback Relevance feedback is a form of query by example. Using this method, a user first performs a search using normal search terms. The user samples one or more of the found documents and determines if a particular document is close to what he or she wants. The user can inform the search engine to "find me more documents like this one." The search engine then parses the relevant document and uses its profile to perform another search.
Relevance feedback can be an especially powerful means of searching. Rather than using the one or two search terms the user originally provides, the search is done using all the keywords from the found document.
There are two main security concerns to think about when implementing a search facility on your site:
Anytime you add a piece of software to your site, you need to be concerned with its impact on site security. Can the software be overwhelmed by an attack and provide direct access to the site? Does it offer a way for users to execute programs on your server? Before releasing a search engine for production use, you may want to experiment with it, try to overwhelm it, or get it to produce unpredictable results.
If the search engine is implemented in Perl within the Windows
NT environment, you should be aware of the recent security warnings
concerning proper Perl interpreter installation on this platform.
You can find information about this problem at http://www.perl.com/perl/news/latro-announce.html.
| CAUTION |
Be aware of security concerns regarding implementations of Perl on Windows NT. See http://www.perl.com/perl/news/latro-announce.html for more information. |
The potential for users to use your search engine to execute arbitrary code on your Web server is obviously a very serious security concern. If the search engine uses the Perl eval command to perform the search, you need to be sure to screen search terms to remove potentially harmful characters and code before passing them to the search engine. On UNIX systems, this means preventing the user from entering a search term containing the escape symbol (!) or any commands that could be used to invoke a command interpreter (!sh, for example).
Even if your search engine doesn't offer a security hole, you still need to be sure that users can't see information on your site that they ordinarily would be prevented from seeing. It is common on sites using the NCSA Web server, for example, to use access control files (typically .Htaccess) to control access to sensitive directories. If the search engine ignores these access control files, it can return links to or summaries of the files contained in protected directories. At best, your users will be frustrated at seeing links that they are prevented from following. At worst, file summaries can compromise the confidentiality of protected information.
And finally, a security concern that is really a resource concern: You may want to limit the amount of resources any one user of your search engine can consume, or the number of simultaneous searches that can occur. A malicious user can bring your server to its knees by launching a large number of time-consuming searches. Most search engines do provide a method of controlling access in this way. You may need to use other system management tools to regulate search engine use.
Which search engine you select depends on whether you prefer the timely, but resource-hungry, grepping approach or the faster, CPU-friendly, indexing approach. Regardless of the approach you pick, there are several requirements you should evaluate before selecting your engine:
These are just some of the questions you should ask yourself as you plan to add a search capability to your site. In the discussions that follow, we'll see how well various approaches satisfy these requirements.
Grepping search engines share a common methodology: Start at an arbitrary point in the directory tree, open each HTML file in the tree, and search the file for the search term. Optionally, the engine might recursively follow each subsequent directory branch encountered and repeat the search process. This allows for unsophisticated searches, although it is possible to enable support for searches using regular expressions.
Building Your Own Grepping Search Engine In building your own grepping search engine, you'll need to tackle two problems: finding files to search and searching those files for search terms.
Let's first examine the problem of finding files to search. Using a couple of key Perl capabilities, it is easy to build a recursive routine that will identify the types of files contained within a directory tree, perform an operation on them, and continue the process with underlying directories. Listing 57.1's Perl script demonstrates this approach.
Listing 57.1 Tfind.pl-Perl Script to Recursively Find Files in Subdirectories
#!/usr/local/bin/perl
# define the directory to start at
# you could prompt user for this
$BASEDIR = "/web/home/acn";
# print page preamble to STDOUT
print "Content-type: text/html\n\n";
print "<HEAD><TITLE>Test Find Capability</TITLE>\n";
print "<BODY bgcolor=#FFFFFF>\n";
# call subroutine to find files
&finddir($BASEDIR);
# close the page
print "<\/BODY><\/HTML>\n";
sub finddir {
local ($BASEDIR) = @_;
# open directory and load file names into array
opendir(BASE, $BASEDIR) || die("Can't open directory $BASEDIR");
@files = grep(!/^\.\.?$/, readdir(BASE));
closedir(BASE);
ITEM:
# for every file in the array
foreach $file (@files) {
# check to see if it's a directory
if (-d "$BASEDIR/$file") {
# if it is, recursively call the subroutine
$next = "$BASEDIR/$file";
&finddir($next);
# if not a directory, you've got a hit
} else {
print "<P>Found a file called $BASEDIR/$file\n";
next ITEM ;
}
}
}
When you run this Perl program, you see a display similar to Figure 57.1.
Figure 57.1 : The basic file recursion script produces a listing line for each file found.
Note that all HTML files are found in both the base directory (/web/home/acn) as well as in all subdirectories (/web/home/acn/press).
You can roll your own directory walking code, as in this example, or you can use Find.pl, part of the Perl distribution library (available at http://www.perl.com). This Perl script steps through all files recursively and executes a subroutine that you define for each file found. Find returns the name of a file in the variable $name and executes a subroutine in your wrapper script called wanted. You can refer to the $name variable in the want subroutine to display the name of the file or grep for a search string. It is easy use Find.pl to develop a slightly more sophisticated find routine (see Listing 57.2).
Listing 57.2 Tsfind.pl-Using Find.pl to Recursively Search Directories
#!/usr/local/bin/perl
# requires find.pl
require("/public/local/lib/perl5/find.pl");
$BASEDIR = "/web/home/acn";
print "Content-type: text/html\n\n";
print "<HEAD><TITLE>Test Find Capability Using Find.pl</TITLE>\n";
print "<BODY bgcolor=#FFFFFF>\n";
&find("$BASEDIR");
# close the page
print "<\/BODY><\/HTML>\n";
sub wanted {
# if it's an HTML file
if (($name =~ /.htm/) && !($name =~ /.html/)) {
# print its name
print "<P>Found a file called $BASEDIR/$name\n";
}
}
This script merely prints the name of each file where the search string is found. You can easily insert a call to a grepping routine in place of the code that prints out the name of the file.
The grepping routine needs to open the file and read through it to search for instances of the search string. The normal Perl searching function works well. This approach is demonstrated in the Listing 57.3.
Listing 57.3 Tsrch.pl-A Basic Search Script
#!/usr/local/bin/perl
# define the directory, file name, and search string
# you could prompt user for these
$BASEDIR = "/web/home/acn";
$file = "acn.htm";
$term = "ACNielsen";
# print page preamble to STDOUT
print "Content-type: text/html\n\n";
print "<HEAD><TITLE>Test Find Search Engine</TITLE>\n";
print "<BODY bgcolor=#FFFFFF>\n";
# call subroutine to find files and search
&findstr($BASEDIR);
print "<\/BODY><\/HTML>\n";
sub findstr {
# open the file
open(FILE,"$FILE");
# read all lines into an array
@LINES = <FILE>;
close(FILE);
# create one huge string to search
$string = join(' ',@LINES);
$string =~ s/\n//g;
if (!($string =~ /$term/i)) {
# don't include this file name
last;
}
# if string is found
else {
# include the file name
print "<P>Found string in $BASEDIR/$file\n";
}
}
Now, if you combine these two scripts, you will have a rudimentary search engine that still looks similar to Figure 57.1.
This script works; it finds instances of a search string in all files in a directory tree. But it ignores some problems and is definitely lacking in features. It would be nice, for example, to be able to specify the search to be case sensitive and whether multiple words should be treated as Boolean AND or OR. The display does not provide a link to the found files. Another missing feature is the context of the search hit. We know that the search terms are found in these files, but we've no idea if the use of them is trivial or important. We don't know how many times the search string was found and have no way to evaluate the relevance of a file.
Rarely on a site is there a directory tree in which every HTML file and directory is available to the public. On my own site, there are many protected directories that require a user ID and password in order to access. There are also a number of experimental files, backup files, or other files that are not linked to the main site and are not for public consumption. This rudimentary script searches all files on the site whether they are protected or not.
There are two very popular grepping search engines available on the Web: Htgrep by Oscar Nierstrasz and Matt's Simple Search Engine by Matthew M. Wright, author of the famous Matt's Perl Script Archive. Both are written in Perl and each has a little something to recommend it. Both solve many of these problems and provide added functionality.
Implementing a Grepping Search Engine with Matt's Simple Search Engine Matt's Simple Search Engine can be found in Matt's Script Archive at http://www.worldwidemart.com/scripts/, one of the most popular Perl script archives on the Web.
Implementing Matt's search engine is fairly simple: Just get the distribution archive, install it on your site, configure it, and create a search form. To configure the script, you need to edit several lines at the top to point to the base directory. The base directory is the base URL for the site and is used to create links to the found pages. You also need to insert a title to put on the resulting page and furnish links for the home page and search page.
Because Matt's script does not do recursion, you also need to specify all the subdirectories you want searched. This can be tedious to maintain as your site changes, so you may want to modify the file-finding script from the previous example and combine it with calls to Matt's engine to perform the search.
Once you're finished configuring, you need to create a page that incorporates something similar to the following HTML fragment, which appears on the CD-ROMs as Mattform.txt:
<FORM method=POST
action="http://worldwidemart.com/scripts/cgi-bin/demos/search.cgi">
<CENTER><TABLE border>
<TR>
<TH>Text to Search For: </TH>
<TH><INPUT type=text name="terms" size=40><BR></TH>
</TR><TR>
<TH>Boolean: <SELECT name="boolean">
<OPTION>AND
<OPTION>OR
</SELECT> </TH><TH>Case <SELECT name="case">
<OPTION>Insensitive
<OPTION>Sensitive
</SELECT><BR></TH>
</TR><TR>
<TH colspan=2><INPUT type=submit value="Search!">
<INPUT type=reset><BR></TH>
</TR></TABLE></FORM></CENTER>
<HR size=7 width=75%><P>
This form produces a Web page similar to Figure 57.2.
You may want to design your own search interface. If so, your form needs to present three parameters to the search script:
| TIP |
Make sure you use the POST method to call Matt's Simple Search Engine. If you use GET, the script won't work since Matt's script reads form input from <STDIN>. |
The result of a search using Matt's Simple Search Engine interface will look similar to Figure 57.3.
Figure 57.3 : The results page from Matt's Simple Search Engine provides links to the found pages.
Notice that each found page is represented by a link to that page. The search terms are also provided, along with the Boolean and case sensitivity settings.
Matt's script works fine and is fairly fast. It took 3 CPU seconds and about 10 elapsed seconds to search about 250 files on my site.
There are some desirable features that are lacking, however. For example, only the titles of found files are displayed. There is no context to indicate whether the search term is merely mentioned in the file or whether significant information about the term is contained in it. When presented with a list of dozens of files, as the result of a search, with no way to distinguish between them, users may become weary of trying to find the information and visit a different site.
File titles are presented in no particular order, which is not very helpful in determining their relevance. It also does not indicate how many times a search term was found in a particular file; or, in the case of multiple-word search terms, whether the words were found in close proximity. The user has no control over partial matches such as finding "state" within "estate" and "intestate." Whatever the user types becomes the search string.
In addition, there are various implementation problems with this simple search engine. Because it does not support recursion, control over which directories are searched rests entirely in the hands of the Webmaster, who must remember to add new directories to the variable in the script file. Files or directories also are not easily excluded from a search. There is also no limit to the number of files that can be returned, nor are stop words ignored. Given the way that directories must be explicitly specified, this may not seem to be a big drawback, but what if you have painstakingly added all directories on your site to the script and someone searches for the word "the"? A better way is definitely needed to control the directories that are searched.
Fortunately, Htgrep satisfies many of these objections.
Implementing a Grepping Search Engine Using Htgrep Htgrep, written by Oscar Nierstrasz, can be obtained at http://iamwww.unibe.ch/~scg/Src/Doc/htgrep.html or in the Software Composition Group Software Archives at http://iamwww.unibe.ch/~scg/Src/. It used to be part of a package called PerlLib; however, PerlLib is no longer supported even though most of the scripts formerly in PerlLib can be found at CU Online: http://www.cu-online.com/pls.html.
A major difference between Htgrep and Matt's script is that Htgrep automatically recurses subdirectories. Once you have installed the Perl script Htgrep.pl and the associated scripts Find.pl, Html.pl, and Bib.pl, you configure the base directory by changing a variable at the beginning of Htgrep.pl. Other variables you configure include the path to users' public HTML directories and any pseudo-URLs (URLs that have been aliased) that you want included in the search.
Included in the package is a basic search form and a basic CGI wrapper script that can be used to control the behavior of Htgrep.pl. The CGI wrapper appears in Listing 57.4.
Listing 57.4 Htgrep.cgi-A CGI Wrapper to Call HTGREP
#! /usr/local/bin/perl
#
# htgrep - cgi-bin script to query a database of HTML paragraphs
#
# NB: this script may have to be installed as "htgrep.cgi"
# to run as a CGI script.
# Copyright (c) 1995 Oscar Nierstrasz
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or (at
# your option) any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program (as the file COPYING in the main directory of
# the distribution); if not, write to the Free Software Foundation,
# Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
# This script and friends can be found at:
#
# http://iamwww.unibe.ch/~scg/Src/
#
# Author: Oscar Nierstrasz (oscar\@iam.unibe.ch)
# include dir for htgrep
$PERLLIB_INC = "/home/scg/local/perl/lib";
unshift(@INC,$PERLLIB_INC);
require("htgrep.pl");
# Pick up tags from the environment:
&htgrep'settags($ENV{'PATH_INFO'});
&htgrep'settags($ENV{'QUERY_STRING'});
&htgrep'doit;
As you can see, you'll need to configure the location of your
Perl library files. The CGI wrapper assumes that Find.pl, which
was used in an earlier example, is located in the library. You
can find the Find.pl in the Htgrep distribution, if you don't
already have it.
| TIP |
The Htgrep wrapper script allows you to use either the POST method or the GET method to process the form. It first looks for information from a POST, using $ENV{'PATH_INFO'}, and then from a GET, using ($ENV{'QUERY_STRING'}. |
Once you've configured the CGI wrapper, you need to build a form for your users to specify parameters. The form provided with the distribution appears in Listing 57.5.
Listing 57.5 Htform.txt-Example Form for Use with HTGREP
<H2>Generic Form</H2>
<FORM ACTION="/~scg/cgi-bin/htgrep.cgi">
<P>
<INPUT
NAME="file"
SIZE=30
VALUE="/~scg/Src/Doc/htgrep.html"
>
<!
VALUE="/~scg/Src/Doc/htgrep.html"
!>
<B>File to search</B> (relative to WWW home)
<BR>
<INPUT NAME="isindex" SIZE=30>
<B>Query</B>
<INPUT TYPE="submit" VALUE="Submit">
<INPUT TYPE="reset" VALUE="Reset">
<DL>
<DT><B>Query style:</B>
<DD>
<INPUT type="checkbox" name="case" value="yes">
Case Sensitive
<DD>
<INPUT type="radio" name="boolean" value="auto" checked="yes">
Automatic Keyword/Regex
<INPUT type="radio" name="boolean" value="yes">
Multiple Keywords
<INPUT type="radio" name="boolean" value="no">
Regular Expression
<DT><B>HTML Files:</B>
<DD>
<INPUT type="radio" name="style" value="none" checked="yes">
Ordinary Paragraphs
<INPUT type="radio" name="style" value="ol">
Numbered list
<INPUT type="radio" name="style" value="ul">
Bullet list
<INPUT type="radio" name="style" value="dl">
Description list
<DT><B>Plain Text:</B>
<INPUT type="radio" name="style" value="pre">
(preformatted)
<DD>
<INPUT type="checkbox" name="grab" value="yes">
Make URLs live (works with plain text only)
<DT><B>Refer Bibliography files:</B>
<INPUT type="checkbox" name="refer" value="yes">
<DD>
<INPUT type="checkbox" name="abstract" value="yes">
Show Abstract
<INPUT type="checkbox" name="ftpstyle" value="dir">
Link to directories, not files (for refer files)
<DD>
<INPUT type="radio" name="style" value="ul">
Bullet list (instead of numbered)
<DT><B>Max records to return:</B>
<INPUT NAME="max" VALUE="250" SIZE=10>
</DL>
</FORM>
This code produces a form similar to the one in Figure 57.4.
Figure 57.4 : You can use the generic form provided with Htgrep to allow user input.
A welcome feature of Htgrep is its support for regular expressions. Although most users are probably not well versed in the use of regular expressions, most at least can understand using the asterisk to fill out portions of words. Additionally, unless you use regular expressions, Htgrep searches on whole words, which is a nice feature. One search engine I use frequently automatically searches on words that are close to the word I specify. This can be maddening when you want to be specific, so I prefer the whole word matching method.
Using the default search form, you can also determine the format of the resulting hits page: either full paragraphs or various types of listings. The ability to return full paragraphs was the key in my decision to use Htgrep on my site. Because a high proportion of the words that users are likely to search on occur in many documents on the site, I felt it was important to provide this context to help guide users to relevant pages quickly and easily.
To enable the return of entire paragraphs from a search, Htgrep takes a different approach to finding text in files. Rather than assembling one huge string from all the lines in the files, Htgrep allows you to specify a record delimiter and then searches each record in a file. For example, you may decide that you want HTML paragraph tags (<P>) to be your record delimiter. It is the record orientation of the search that allows Htgrep to return the context for a search hit. Htgrep returns the entire record in which it found the search term. The user thus sees the entire paragraph and can better determine whether the page meets his or her needs. Htgrep does this using Perl's ability to define a record delimiter. This is demonstrated in the following code fragment:
# the default record separator is a blank line
#$separator = "";
$separator = "<P>";
[. . .]
# normally records are separated by blank lines
# if linemode is set, there is one record per line
if ($tags{'linemode'} =~ /yes/i) { $/ = "\n"; }
else { $/ = "$separator"; }
Unfortunately, a side effect of this context approach is that multiple paragraphs from each found page can be returned. While this may help further guide the user, many may find it an annoyance. You may want to modify the Htgrep code to cause it to proceed to the next file upon finding a search hit. Doing this, however, might cause the search to skip particularly relevant material. What's really needed is a more sophisticated approach that evaluates the fitness of a document based on other rules such as number of hits per document and the proximity of words found as a result of multiple-word searches. It is difficult to add this level of sophistication to a grepping search engine. As you'll see later in this chapter, such features can be found in some indexing search engines.
Htgrep also allows you to set the maximum number of records to return. This is an important feature because there is no provision in Htgrep to ignore stop words. Unfortunately, there is also no way to prevent Htgrep from returning really long records. For example, let's say you define <P> as your record delimiter. If you add a new document that uses <p> for paragraphs, or if you have long material contained within <PRE> tags, the result can be huge amounts of text returned on the results page. To solve this problem, I modified the code to include a line counter that aborts the paragraph retrieval if it is longer than 200 words. The modification is included in the following code fragment:
# this is where Htgrep actually searches the file
while (<FILE>) {
# call the subroutine that evaluates the search terms
$queryCommand
# optional filter definition
$filter
# remove all the nasty tags that can disturb paragraph display
s/\<table/\<p/g ;
s/\<hr/\<p/g ;
s/\<HR/\<p/g ;
s/\<IMG/\<p/g ;
s/\<img/\<p/g ;
# transform relative URLs in found pages to full URLs
if ((/\<A HREF/) && !(/http/) && !(/home/)) {
s/\<A HREF \= \"/\<A HREF \= \"\$dirname/g ;}
print \$url;
# count the number of words
\@words = split(' ', \$_);
\$wordcount = 0;
foreach \$word (\@words) {
\$wordcount++;
}
# if it's too large, don't print the record
if (\$wordcount >= 200) {
print "\<H4\>Excerpt would be greater than 200 \n";
print "words. Select link above to see entire \n";
print "page.\<\/H4\>\\n";
# skip to next record
next;
}
# otherwise print out the record
print;
# if you've printed up to the limit, stop
last if (++\$count == $maxcount);
}
Another side effect of returning the whole paragraph concerns what else besides text is returned. Because Htgrep grabs the whole paragraph, it also grabs links to images, bits of Java or ActiveX code, and anything else contained in the paragraphs. This is probably not what the user wants when using a search engine. The resulting hits page can contain dozens of large GIFs and take a long time to download.
Because of this limitation, I modified the Htgrep script to remove all <IMG> tags. I must confess, I did this in a decidedly low tech way by simply replacing all instances of <IMG with <P in all found paragraphs (see the previous example). It's crude but effective. The resulting hits page is devoid of image tags (see Figure 57.5).
You'll notice that another script modification produces a hyperlink to the found page, something that the base Htgrep script only provides if you elect plain text formatting.
There is a security problem with using Htgrep that you will need to take care of in the wrapper script: Because the search string can be a Perl regular expression, it executes using Perl's eval function. This can allow your users to execute arbitrary commands on your Web server. To prevent this from happening, be sure to prescreen search terms for dangerous characters or expressions, especially !sh, in the CGI wrapper that you use to call htgrep.
Another nice feature of Htgrep is that, on NCSA servers, it ignores any directories that contain an access control file (.Htaccess). Chances are, you don't want users searching these directories anyway. If you want finer control over what directories are searched, you can put a .Htaccess file in your backup, administration, or internal directories. Other search engines require you to explicitly exclude such directories from the search and that leads to administrative overhead for the poor Webmaster.
Implementing an Indexing Search Engine As seen in the previous discussion, implementing a grepping search engine can be quite easy. I've discussed two popular Perl-based grepping engines, but there are many more with various features. Using the grepping approach represents a trade-off between minimal disk usage with up-to-the-minute timeliness and high CPU usage with long-elapsed times. You certainly can't beat the price (free) or the ease of setup and maintenance.
However, more sophisticated searching is hard to implement using the grepping approach. For larger, more complex sites, an indexing search engine can be the best choice.
There are several indexing search engines available for use on
your Web site. In addition to an array of shareware or free engines,
several of the large commercial search sites make their technology
available for use on a local site. Commercial indexing search
engines include those listed in Table 57.2.
| Company | Tool Name | URL | Free? |
| Verity | Topic Internet Server | http://www.verity.com /products/tis_data.html | No |
| Thunderstone | The Webinator Web Index & Retrieval System | http://www.thunderstone.com/ webinator/ | Yes (shareware) |
| AltaVista | AltaVista Directory | http://altavista.software.digital. com/products/directory/nfintro.htm | No |
| Inmagic/Lycos | DB/Text Navigation Server | http://www.inmagic.com/ pr_dbnav.html | No |
| Excite | Excite for Web Servers | http://www.excite.com/ navigate/home.html | Yes |
| Netcreations | Pinpoint | http://www.Netcreations. com/pinpoint/ | No (free trial) |
In the sections that follow, I discuss implementing two indexing search engines: WebGlimpse, developed by the University of Arizona; and Excite for Web Servers, free from Excite.
Implementing an Indexing Search Engine with WebGlimpse WebGlimpse is a good example of an almost-freeware indexing search engine. Created by the University of Arizona computer science department, WebGlimpse is available for free for nonprofit use. A small licensing fee is charged for commercial users. The WebGlimpse site indexing system is based on the high performance grepping tool, Glimpse (which stands for global implicit search). A recent search of the Web turned up hundreds of sites that are using this popular tool or its precursor, GlimpseHTTP. A partial list of sites is available at http://glimpse.cs.arizona.edu/ghttp/sites.html. Glimpse is also used as a basis for Harvest Information Discovery and Access System (http://harvest.cs.colorado.edu/).
WebGlimpse can be obtained at http://glimpse.cs.arizona.edu/webglimpse/. It is comprised of Glimpse, a C-based enhanced grepping engine, Glimpseindex, another C program that creates the index, the WebGlimpse script itself, written in Perl, and an assortment of Perl utilities that you use to create and manage your indexes.
Installation is mostly automated but definitely not foolproof. Once installed, you need to run a Perl script that creates the WebGlimpse index by using Glimpseindex. One of WebGlimpse's claims to fame is that its space requirements for the index are minimal (less than 10% of the source). Other welcome features include the ability to index only pages that have been added since the last index, a facility to index off-site links, the ability to set a tolerance for spelling errors, and the ability to establish neighborhoods. Neighborhoods are defined as all links within an arbitrary number of hops from a page or all pages within a directory.
Running the index can consume quite a lot of time. Using WebGlimpse's option that allows for indexing of external links, as well as local pages, took 45 minutes to index almost 600 files on my site. Once that index was done, however, a re-index without the external option took only a few minutes.
Once the index has been established, you can use a cron job to run it periodically to maintain it. The installation routine even creates the job for you.
Using the WebGlimpse Perl script (created by the install) to perform searches is easy. After aliasing to the proper directory, you call the script with a parameter that indicates where the index resides. The user sees a menu similar to the one in Figure 57.6 if the script is called directly.
Figure 57.6 : Calling WebGlimpse directly produces a default search form.
Alternately, you can include either of two code fragments in your Web pages to provide a nicer looking interface. The two interface styles are created using the HTML code fragments in Listing 57.6.
Listing 57.6 Glimform.txt-Two Forms for Calling WebGlimpse
<H2>Basic WebGlimpse Interface</H2> <CENTER> <TABLE border=5><TR border=0> <TD align=center valign=middle> <A HREF=http://glimpse.cs.arizona.edu/webglimpse> <IMG src=/images/glimpse-eye.jpg alt="WG" align=middle width=50><BR> <FONT size=-3>WebGlimpse</FONT></A></TD> <TD> <FORM method=get ACTION=/$CGIBIN/webglimpse$ARCHIVEPWD> <INPUT NAME=query size=20> <INPUT TYPE=submit VALUE="Search"> <INPUT name=file type=hidden value="$FILE"> <A HREF=/$CGIBIN/webglimpse-fullsearch$ARCHIVEPWD?file=$FILE> Search Options</A></TD></TR> <TR><TD colspan=2> Search: <INPUT TYPE=radio NAME=scope VALUE=neighbor CHECKED> The neighborhood of this page <INPUT TYPE=radio NAME=scope VALUE=full>The full archive </TD></TR></FORM></TABLE></CENTER><HR> <H2>Full-Featured WebGlimpse Interface</H2> <TABLE border=5> <TR><TD align=center valign=middle> <A HREF=http://glimpse.cs.arizona.edu/webglimpse> <IMG src="/images/glimpse-eye.jpg" align=middle></TD> <TD align=center valign=middle> <A HREF=http://glimpse.cs.arizona.edu/webglimpse> <FONT size=+3>WebGlimpse </A> Search<BR></FONT></TD> </TR> <TR><TD colspan=2> <FORM method=get ACTION=> <INPUT name=file type=hidden value=/home/msmith/public_html/big/index.html> Search: <INPUT TYPE=radio NAME=scope VALUE=neighbor> The neighborhood of <a href="">the ACNielsen Web Site </A> <INPUT TYPE=radio NAME=scope VALUE=full CHECKED>The full archive: <A HREF="">the ACNielsen Site including links offsite</a> </TD></TR> <TR><TD colspan=2> String to search for: <INPUT NAME=query size=30> <INPUT TYPE=submit VALUE=Submit> <BR> <CENTER> <INPUT NAME=case TYPE=checkbox>Case sensitive <!SPACES>    <INPUT NAME=whole TYPE=checkbox>Partial match <!SPACES>    <INPUT NAME=lines TYPE=checkbox>Jump to line <!SPACES>    <SELECT NAME=errors align=right> <OPTION>0 <OPTION>1 <OPTION>2 </SELECT> misspellings allowed <BR> </CENTER> Return only files modified within the last <INPUT NAME=age size=5> days. <BR> Maximum number of files returned: <SELECT NAME=maxfiles> <OPTION>10 <OPTION selected>50 <OPTION>100 <OPTION>1000 </SELECT> <BR>Maximum number of matches per file returned: <SELECT NAME=maxlines> <OPTION>10 <OPTION selected>30 <OPTION>50 <OPTION>500 </SELECT> <BR> </FORM> </TD></TR> <TR><TD colspan=2> <CENTER> <FONT size=-2><A HREF=http://glimpse.cs.arizona.edu> Glimpse</A> and <A HREF=http://glimpse.cs.arizona.edu/webglimpse> WebGlimpse</A>, Copyright © 1996, Arizona Board of Regents. </CENTER> </FONT></TD></TR> </TABLE></CENTER> </CENTER>
The sample page in Figure 57.7 demonstrates one of the two available user interfaces for WebGlimpse.
The first interface is short, sweet, and perfect for an unobtrusive search facility. The second interface enables the user to select a neighborhood search or a full archive search, choose case sensitivity, partial match, and spelling error settings, optionally jump to the line in a found document, and control the date and number of documents returned.
One annoying aspect of the WebGlimpse indexing routine is that it automatically appends the user interface code at the bottom of each page it indexes, unless you comment out the appropriate line. While this feature is a nice service for those who want it, being able to turn it off is a must. My personal preference is to add a link to the search facility rather than the entire user interface. However, due to WebGlimpse's concept of page neighborhood, putting this code on every page can make sense.
A page neighborhood is obviously context sensitive. For example, you can define a page's neighborhood as every other page that is within two jumps (a link to a page that links to one other page). If page A has a link to page B and page C, and each of those pages links to one other page, pages BA and CA, then page A's neighborhood is pages B, C, BA, and CA. However, if you follow the links to page BA, for example, you may find it links to pages D and E, making its neighborhood much different. Because the context determines the neighborhood, you need a unique call to WebGlimpse on each page rather than a generic (search the whole site) search page. By the same token, if you define a neighborhood as all files in the same directory, the context of the WebGlimpse search changes depending on the starting page.
If your site is massive, or if you want to allow for more context-sensitive searching, you may prefer to have unique calls to WebGlimpse embedded on each page of your site. For example, you might have a site that offers a number of Web utilities. Each utility is available in a variety of languages and for a variety of operating systems. If a user is reading about one of the programs and wants to know more about its implementation in Perl, he or she doesn't want to search the entire site and then have to wade through scads of listings for irrelevant utilities. In this instance, a neighborhood search is appropriate. If the site is organized properly, the information should be available either within a few hops or within the same directory.
The output of WebGlimpse looks similar to Figure 57.8.
This output from WebGlimpse shows that a link is provided to the found document. In addition, context is provided by including all lines in which the search terms are found. WebGlimpse automatically limits the number of found files as well.
An interesting feature of WebGlimpse is its setting for spelling errors. The example given in the documentation is a search for the name, Schwarzkopf. Many people do not know how to spell this name. Therefore, there may be spelling errors both in the user's search terms or in the documents on the site. Because WebGlimpse uses Glimpse, which in turn builds on the powerful agrep, it supports approximate matching (allows for spelling errors). So if the material on your site comes from a variety of sources, varies in grammatical quality, or your users can't spell, the ability to be forgiving of spelling errors is a definite plus.
WebGlimpse basically uses a modified grepping approach but applies the grepping to an index. Although there is some flexibility offered in the spelling error tolerance feature, complex searches are not offered and there is no ranking of results by confidence level.
WebGlimpse takes the grepping approach just about as far as it can go. To achieve better results, a more complicated search methodology is needed.
Implementing an Indexing Search Engine with Excite for Web Servers Excite for Web Servers (EWS), available from Excite at http://www.excite.com/navigate/home.html, is a full- featured and fast indexing search tool based on the same technology as the Excite search service. Despite being a commercial search engine, it is available for use on your Web site for free. The only restriction in the user license is that you cannot use it to provide services for a third party (by establishing a service to compete with Excite, for example).
EWS is not strictly a keyword search engine. Excite claims that EWS understands plain English queries such as, "How to stay healthy by eating well" or "Learn to speak Tagalog." Queries using concepts are more likely to produce effective results than simple keyword searches, according to the company.
When you run a search, EWS lists search results in decreasing order of confidence. Each result consists of a title, an URL, a confidence rating, and an automatically-generated summary of what the page is about. Excite also supports relevance feedback, or query-by-example searching. Using this technique, if you visit a found page and find it is pretty much what you're looking for, you can return to the search results and click the icon next to the listing to initiate another search. The subsequent search uses the found page as a parameter and will return similar pages.
Installing Excite is described as Plug and Play, and it couldn't be easier. Download the distribution archive (along with the C++ libraries if you need them), run a shell script that asks a few questions, and you're just about ready to go. You need to run an administrative script that creates the index, and another script that creates the search page. Both scripts are run from Web forms.
EWS took 16 minutes, 40 seconds of CPU time to index my site; elapsed time was 23 minutes. It thoughtfully provided status pages that allowed me to keep tabs on the progress of the indexing. EWS created an index that was around 7M in size on a collection of 4,490 files comprising slightly more than 90M. It even e-mailed me when it was done.
After generating the index, you then generate the search page using an HTML form. EWS creates a page that includes a search form and a link to the custom-generated search script for this collection. The resulting search page looks similar to Figure 57.9.
Notice that the form does not provide options for case sensitivity or boolean searches. This is because Excite employs concept matching to do its searching. The company suggests creating queries that are descriptions of information rather than lists of keywords:
"Excite for Web Servers will search for documents that are a best match for the words in your query. Excite for Web Servers will also search for documents that are about the same concepts that your query describes, so sometimes Excite for Web Servers will bring back articles that don't mention any of the words in your original query."
The more search words, the better the query. Unfortunately, because the search algorithm is proprietary, you just have to trust that EWS will perform.
Excite for Web Servers uses Excite's proprietary Intelligent Concept Extraction (ICE) search method. An excellent discussion of search strategies can be found on Excite's site at http://www.excite.com/ice/tech.html. Although Excite does not provide a lot of detail about their patent-pending proprietary search techniques, ICE is described as a means to find and score documents based on a correlation of their concepts, as well as actual keywords. Excite states that this ability to go beyond simple boolean searches of keywords is the key to their technology.
Using techniques similar to Latent Semantic Indexing, Excite claims to be able to perform rapid searches without significant resources as well as maintaining performance when the size of the index is scaled up. According to Excite, "Unlike other systems which need more time to perform a query as the size of the database increases, the Excite search engine can perform most queries in a constant amount of time."
A typical results page resembles Figure 57.10.
Producing this search page took a little more than a second of CPU time and five or six seconds of elapsed time. You'll notice that, although there are links to the found pages, at first glance there doesn't appear to be any context provided. However, if you click the summary link, you see an automatically generated summary of the page contents. EWS ignores stop words. These words are maintained in a table, but there appears to be no way to edit or add to them.
Excite for Web Servers is quite an impressive search tool that is easy to install and easy to implement. It creates a small index file and searches consume little system resources and are quite rapid. The inability to maintain the stop words tables and the lack of significant documentation on the operation of the system are its only drawbacks. But given its ease of use and strong features, such complaints are minor.