How it works

ht://Dig © 1995-1998 Andrew Scherpbier
Please see the file COPYING for license information.

The system performs three major tasks that should be performed in the following order:

Digging

Before you can search, a database of all the documents that need to be searched has to be created.

Once the document database has been created, it has to be converted to something that can be searched quickly. Also, if you want to only update changed documents, these changes have to be merged into the searchable database.
Even though this task could be performed at the same time as the Digging, it is a separate process for efficiency reasons. It also gives more flexibility to what actually happens at merge time.

Searching

Finally, the databases that were created in the previous steps can be used to perform actual searches. Normally, searches will be invoked by a CGI program which gets its input from the user through an HTML form.

Digging

Digging is the first step in creating a search database. This system uses the word digging while other systems call it harvesting or gathering. In the ht://Dig system, the program htdig performs the information gathering stage. In this process, the program will act as a regular web user, except that it will follow all hyperlinks that it comes across. (Actually, it will not follow all of them, just those that are within the domain it needs to gather information on...)
Each document it goes to is examined and all the unique words in this document are extracted and stored.

The digging process will create at least two files. The first one is the list of all the words and the second one is a database of URLs and information about the URLs.

Merging

Once the digging process has completed, it needs to be converted into something the search engine can actually use. The htmerge program will use the information from previous digs to create a database that the search engine can use. It uses the term 'merge' because it will take data from several databases and merges them into several other databases. The source databases include the databases created by the Digging process but also a previous merged databases. These old databases are used if the Digging process produced information only for documents which have changed.

There are several optional tasks which also fit into the merge phase:

Expiration notification:: The ht://Dig system includes a handy reminder service which allows HTML authors to add some ht://Dig specific meta information in HTML documents. This meta information is used to email authors after a specified date. Very useful to maintain lists that contain those annoying 'new' graphics with new items. (Hint: things really aren't all that 'new' anymore after 6 months!)
The htnotify program performs this task.
Fuzzy word index creation:: To allow the searches to use ``fuzzy'' algorithms to match words, the htfuzzy program can create indexes for several different algorithms.

Searching

Searching is where the users actually get to use all the information that was gathered during the dig and merge stages. The htsearch program performs the actual searches. It produces HTML output which will be seen by the users.

Andrew Scherpbier <andrew@contigo.com>

Last modified: Wed Jan 1 20:46:33 PST