/xenix: Filing and Indexing
©1987 - Richard A. Bilancia - All Rights Reserved
to be published first in the May 1987 issue of UNIX&reg/World

At Uniforum in Washington, D.C. this past January Dave Flack, the Executive Editor of UNIX&reg/World, was gracious enough to host a UNIX&reg/World editorial staff dinner meeting. As you might guess, a lot of UNIX&reg was discussed at that meeting.

At one point in the evening the topic turned to how we all manage the massive amounts of written information, so that we can quickly and easily retrieve that information when necessary. Accordingly, what follows in this month’s column is a description of the two procedures that I shared with the other editors and that I use on my XENIX&reg system: one to handle the filing of magazine articles, product literature, and incoming correspondence, and the second to handle the automatic retrieval of machine readable copies of my outgoing correspondence, notes, and articles.

Indexing pieces of paper.
Each month I must see thousands of pieces of paper. Most of them come via the U.S. Postal Service; some in the form of vendor advertising, some in in the form of magazines and books, and some in the form of letters from clients and associates. Fortunately most of these pieces of paper can be thrown away, but some I like to save for future reference.

I’ve developed a very simple and easy to implement procedure to automate the filing and retrieval of documents on my XENIX&reg system that requires a single file to store the information about the documents and a single command to look up the place where the document is filed.

The procedure works like this: after I’ve read and decided that something needs to be filed, I place it in a single pile in the corner of my desk (this includes articles from magazines that I’ve torn out and stapled the pages together). Once a week I go through the pile adding a description of the item (along with other such information such as the author, article title, and magazine title for magazine articles, for example) preceded by a manually assigned sequential number to the document (that I also write in the upper right hand corner) to a data file on my XENIX&reg system that I call ’paper.index’, one line per document. I then file the documents, in simple order by the sequence number that I’ve assigned, in my semi-portable filing cabinet (a brown cardboard filing box).

When I need to look up a topic, author, keyword, etc., I simply type:

grep -y keyword ~/paper.index

This command often results with several lines of data that I visually examine for appropriateness, select one or more items for retrieval by jotting the sequence number on a small piece of paper, and then go to the filing cabinet to pull the indicated items. When I need to refile a document that I’ve pulled from the file, it already has a sequence number assigned so I just return it to the correct place in the cabinet. Numerical filing really is easier than alphabetic anyway!

Indexing machine readable files.
Back in the UNIX&reg Version 7 days, a package of utilities to handle the indexing of bibliographic information was included with the standard UNIX&reg distribution. These tools are often called the "refer" tools that may now be included with AT&T’s distributions of Documentor’s Workbench (DWB). Until recently they were included with the XENIX&reg development system distribution, such as TRS-XENIX&reg version 01.02.00 for the Tandy 68000 based computers.

Three of these tools are utilities to read a file or files, build an index of key the words, invert that index, and then provide an extremely fast method of searching the index and retrieving the record or document from which the index was originally created. The official discussion of these tools can be found in an article by M.E. Lesk entitled "Some Applications of Inverted Indexes on the UNIX&reg System." These tools have far greater applicability than Lesk’s article illustrates as I found out after considerable time experimenting and constructing the two applications that follow. These applications illustrate a simple search method that can be used to inquire on a phone number database and to build and inquire an index on all text files on a system.

A telephone index.

This example using a telephone list shows how to build an index to the individual records in a single file. You will likely note that since this is a very simple retrieval procedure that it could be accomplished more quickly with ’grep’. However, the purpose of the example is to provide a basis for more complicated applications, especially if the data file is very large.

To install the phone indexing application, follow these steps:

1. Using your favorite editor, enter and save the two shell programs ’make1’ and ’search’ exactly as shown in Figures A and B. After saving the files and exiting from the editor, type the following shell command to make the files executable: chmod +x make1 search

Figure A - "make1"

# a program that makes an inverted index of the telephone

# numbers in a file "telnos"

rm -f Index*

sed -e ’s/$/\

/’ < telnos > tmp.$$

/usr/lib/refer/mkey tmp.$$ | /usr/lib/refer/inv -v -h997

rm tmp.$$

# a program that reads an inverted index of the telephone

# numbers in a file "telnos"

exec /usr/lib/refer/hunt Index

Figure B - "search"

2. Also using your favorite editor, create a file called "telnos". This file is a free form file with each line (record) containing a phone number (with or without an area code) and a name, address, and/or personal note. I recommend that the phone number be entered after the text and aligned with an appropriate number of tab characters, but that isn’t necessary. A hypothetical sample file appears in Figure C.

Figure C - "telnos"

AT&T Computers (800) 247-1212

AT&T UNIX Toolchest on-line number (201) 522-6900

American Express (Be My Guest) (800) 528-8000

Budget Rent-A-Car (800) 527-0700

Burroughs Corp (Detroit) (313) 972-7000

Computer Knowledge Center (books) (800) LIBRARY

Cucumber Book Shop (301) 881-2722

Heathkit Catalog (800) 253-0570

Hertz Rent-A-Car (800) 654-3131

IBM DIRECT (cust# 097-9999) (800) 631-5582

MCI Mail (800) 323-7751

MCI Mail - HELP! (800) 424-6677

McGraw Hill Book Division (314) 227-1600

To create a file of keys, and invert the file of keys to form the index, type: make1

In a few seconds, if the file is not too large, you’ll see a message similar to:

39 key occurrences, 33 hashes, 13 docs

and the following three files will have been added to your current working directory: "Index.ia" - the entry file; "Index.ib" - the posting file; and "Index.ic" - the tag file.

To search the index and file interactively, simply type: search

At this point the program will be waiting for your input, so you can type in the first six letters of any word in the telnos file. (The key must be in lower case and no more than six characters long. See the Lesk article for more details.) For example, if you typed: comput
using the "telnos" file in Figure C, you would see:

AT&T Computers (800) 247-1212

Computer Knowledge Center (books) (800) LIBRARY

Similarly, if you typed two keys on a line the search would return only those lines (records) that contain both keys. For example, if you typed: comput books
using the same "telnos" file in Figure C, you would see:

Computer Knowledge Center (books) (800) LIBRARY

You can continue to request as many searches as you like, and when you’re finished, type a Control-D to return to a shell prompt.

Lines 5 through 7 of ’make1’ deserve some explanation. The stream editor ’sed’ is used to insert a blank line between every line in the telephone numbers file, placing the output in a temporary file. This temporary file is then processed through two of the ’refer’ tools, ’mkey’ and ’inv’. The -v option is for verbose mode, and the -h997 indicates the hash table size.

A document index.
The second and last example of how to use the ’refer’ tools shows how to build an index to several different files or documents. Once again using your favorite editor enter and save the ’make2’ program exactly as shown in Figure D.

Figure D - "make2"

# a program that makes an inverted index of all the

# text files in a system

rm -f Index*

find $HOME -print > tmp.$$

/usr/lib/refer/mkey -w -k100 -n400 -f tmp.$$ > Index.id

/usr/lib/refer/inv -v -h4999 < Index.id

rm tmp.$$

Next, as before, make the file executable with: chmod +x make2

To create a file of keys, and then invert the file of keys to form the index, type: nice make2 &
This slightly different syntax will execute the program in low priority and in the background, allowing you to do other tasks in the meantime. Depending upon the size and number of files in your home directory (or on the entire system if you run this program as the ’root’ user) the entire job may take an hour or longer. The job will be completed when you see a message similar to:

44582 key occurrences, 3821 hashes, 722 docs

The following four files will have been added to your current working directory: "Index.ia" - the entry file; "Index.ib" - the posting file; "Index.ic" - the tag file; and "Index.id" - the keyword references file.

In order to search the index and file interactively, type: search

At this point the program rill be waiting for your input, so you can type in the first six letters on any word or words you wish to select as indicated in the first example above. Note that only the first 1000 characters are retrieved. This limit can be controlled with the "-l" option of ’hunt’.

Sometimes it may be more appropriate to obtain the file name[s] of the file[s] containing a reference to a key, rather than the actual text of the file[s]. When this is the case, the following command can be used:

grep keyword Index.id

A listing of the full path name[s] of the file[s] containing the keyword, followed by all of the keywords identified for that file, will be printed on the standard output.

Conclusion.
For a full understanding of the options that I’ve used in ’make2’ for the ’mkey’ and ’inv’ programs, you might want to read Lesk’s article in detail. Also, if you like the facility of this approach to indexing documents, you might also want to consider using ’cron’ to run ’make2’ in the middle of the night on some periodic basis. If you have similar procedures that you use on your XENIX&reg system, let me know as I’d enjoy hearing from your In any event, I hope you enjoy these tools and "Happy indexing!"

Rich Bilancia is the owner and founder of Computer Guidance & Support, Littleton, Colorado, a consulting firm that specializes in UNIX&reg based computerized accounting and database applications. He can be sent e-mail at...