/xenix: Filing and
Indexing
©1987 - Richard A. Bilancia - All Rights
Reserved
to be published first in the May 1987 issue of
UNIX®/World
At Uniforum in Washington, D.C. this past January Dave Flack, the Executive Editor of UNIX®/World, was gracious enough to host a UNIX®/World editorial staff dinner meeting. As you might guess, a lot of UNIX® was discussed at that meeting.
At one point in the evening the topic turned to how we all manage the massive amounts of written information, so that we can quickly and easily retrieve that information when necessary. Accordingly, what follows in this month’s column is a description of the two procedures that I shared with the other editors and that I use on my XENIX® system: one to handle the filing of magazine articles, product literature, and incoming correspondence, and the second to handle the automatic retrieval of machine readable copies of my outgoing correspondence, notes, and articles.
Indexing pieces of
paper.
Each month I must see thousands of pieces of paper. Most of
them come via the U.S. Postal Service; some in the form of
vendor advertising, some in in the form of magazines and
books, and some in the form of letters from clients and
associates. Fortunately most of these pieces of paper can be
thrown away, but some I like to save for future
reference.
I’ve developed a very simple and easy to implement procedure to automate the filing and retrieval of documents on my XENIX® system that requires a single file to store the information about the documents and a single command to look up the place where the document is filed.
The procedure works like this: after I’ve read and decided that something needs to be filed, I place it in a single pile in the corner of my desk (this includes articles from magazines that I’ve torn out and stapled the pages together). Once a week I go through the pile adding a description of the item (along with other such information such as the author, article title, and magazine title for magazine articles, for example) preceded by a manually assigned sequential number to the document (that I also write in the upper right hand corner) to a data file on my XENIX® system that I call ’paper.index’, one line per document. I then file the documents, in simple order by the sequence number that I’ve assigned, in my semi-portable filing cabinet (a brown cardboard filing box).
When I need to look up a topic, author, keyword, etc., I simply type:
grep -y keyword ~/paper.index |
This command often results with several lines of data that I visually examine for appropriateness, select one or more items for retrieval by jotting the sequence number on a small piece of paper, and then go to the filing cabinet to pull the indicated items. When I need to refile a document that I’ve pulled from the file, it already has a sequence number assigned so I just return it to the correct place in the cabinet. Numerical filing really is easier than alphabetic anyway!
Indexing machine readable files.
Back in the UNIX® Version 7 days, a package of
utilities to handle the indexing of bibliographic
information was included with the standard UNIX®
distribution. These tools are often called the
"refer" tools that may now be included with
AT&T’s distributions of Documentor’s
Workbench (DWB). Until recently they were included with the
XENIX® development system distribution, such as
TRS-XENIX® version 01.02.00 for the Tandy 68000 based
computers.
Three of these tools are utilities to read a file or files, build an index of key the words, invert that index, and then provide an extremely fast method of searching the index and retrieving the record or document from which the index was originally created. The official discussion of these tools can be found in an article by M.E. Lesk entitled "Some Applications of Inverted Indexes on the UNIX® System." These tools have far greater applicability than Lesk’s article illustrates as I found out after considerable time experimenting and constructing the two applications that follow. These applications illustrate a simple search method that can be used to inquire on a phone number database and to build and inquire an index on all text files on a system.
A telephone index.
This example using a telephone list shows how to build an index to the individual records in a single file. You will likely note that since this is a very simple retrieval procedure that it could be accomplished more quickly with ’grep’. However, the purpose of the example is to provide a basis for more complicated applications, especially if the data file is very large.
To install the phone indexing application, follow these steps:
1. Using your favorite editor, enter and save the two shell programs ’make1’ and ’search’ exactly as shown in Figures A and B. After saving the files and exiting from the editor, type the following shell command to make the files executable: chmod +x make1 search |
Figure A - "make1"
# a program that makes an inverted index of the telephone
# numbers in a file "telnos"
# Copyright (c)1984, 1987 - Richard A. Bilancia
rm -f Index*
sed -e ’s/$/\
/’ < telnos > tmp.$$
/usr/lib/refer/mkey tmp.$$ | /usr/lib/refer/inv -v -h997
rm tmp.$$
# a program that reads an inverted index of the telephone
# numbers in a file "telnos"
# Copyright (c)1984, 1987 - Richard A. Bilancia
exec /usr/lib/refer/hunt Index
Figure B - "search"
2. Also using your favorite editor, create a file called "telnos". This file is a free form file with each line (record) containing a phone number (with or without an area code) and a name, address, and/or personal note. I recommend that the phone number be entered after the text and aligned with an appropriate number of tab characters, but that isn’t necessary. A hypothetical sample file appears in Figure C.
Figure C - "telnos"
AT&T Computers (800) 247-1212
AT&T UNIX Toolchest on-line number (201) 522-6900
American Express (Be My Guest) (800) 528-8000
Budget Rent-A-Car (800) 527-0700
Burroughs Corp (Detroit) (313) 972-7000
Computer Knowledge Center (books) (800) LIBRARY
Cucumber Book Shop (301) 881-2722
Heathkit Catalog (800) 253-0570
Hertz Rent-A-Car (800) 654-3131
IBM DIRECT (cust# 097-9999) (800) 631-5582
MCI Mail (800) 323-7751
MCI Mail - HELP! (800) 424-6677
McGraw Hill Book Division (314) 227-1600
To create a file of keys, and invert the file of keys to form the index, type: make1
In a few seconds, if the file is not too large, you’ll see a message similar to:
39 key occurrences, 33 hashes, 13 docs
and the following three files will have been added to your current working directory: "Index.ia" - the entry file; "Index.ib" - the posting file; and "Index.ic" - the tag file.
To search the index and file interactively, simply type: search
At this point the program will be waiting for your
input, so you can type in the first six letters of any word
in the telnos file. (The key must be in lower case and no
more than six characters long. See the Lesk article for more
details.) For example, if you typed: comput
using the "telnos" file in Figure C, you would
see:
AT&T Computers (800) 247-1212
Computer Knowledge Center (books) (800) LIBRARY
Similarly, if you typed two keys on a line the search
would return only those lines (records) that contain
both keys. For example, if you typed: comput
books
using the same "telnos" file in Figure C, you
would see:
Computer Knowledge Center (books) (800) LIBRARY
You can continue to request as many searches as you like, and when you’re finished, type a Control-D to return to a shell prompt.
Lines 5 through 7 of ’make1’ deserve some explanation. The stream editor ’sed’ is used to insert a blank line between every line in the telephone numbers file, placing the output in a temporary file. This temporary file is then processed through two of the ’refer’ tools, ’mkey’ and ’inv’. The -v option is for verbose mode, and the -h997 indicates the hash table size.
A document index.
The second and last example of how to use the
’refer’ tools shows how to build an index to
several different files or documents. Once again using your
favorite editor enter and save the ’make2’
program exactly as shown in Figure D.
Figure D - "make2"
# a program that makes an inverted index of all the
# text files in a system
# Copyright (c)1984, 1987 - Richard A. Bilancia
rm -f Index*
find $HOME -print > tmp.$$
/usr/lib/refer/mkey -w -k100 -n400 -f tmp.$$ > Index.id
/usr/lib/refer/inv -v -h4999 < Index.id
rm tmp.$$
Next, as before, make the file executable with: chmod +x make2
To create a file of keys, and then invert the file of
keys to form the index, type: nice make2 &
This slightly different syntax will execute the program in
low priority and in the background, allowing you to do other
tasks in the meantime. Depending upon the size and number of
files in your home directory (or on the entire system if you
run this program as the ’root’ user) the entire
job may take an hour or longer. The job will be completed
when you see a message similar to:
44582 key occurrences, 3821 hashes, 722 docs
The following four files will have been added to your current working directory: "Index.ia" - the entry file; "Index.ib" - the posting file; "Index.ic" - the tag file; and "Index.id" - the keyword references file.
In order to search the index and file interactively, type: search
At this point the program rill be waiting for your input, so you can type in the first six letters on any word or words you wish to select as indicated in the first example above. Note that only the first 1000 characters are retrieved. This limit can be controlled with the "-l" option of ’hunt’.
Sometimes it may be more appropriate to obtain the file name[s] of the file[s] containing a reference to a key, rather than the actual text of the file[s]. When this is the case, the following command can be used:
grep keyword Index.id
A listing of the full path name[s] of the file[s] containing the keyword, followed by all of the keywords identified for that file, will be printed on the standard output.
Conclusion.
For a full understanding of the options that I’ve used
in ’make2’ for the ’mkey’ and
’inv’ programs, you might want to read
Lesk’s article in detail. Also, if you like the
facility of this approach to indexing documents, you might
also want to consider using ’cron’ to run
’make2’ in the middle of the night on some
periodic basis. If you have similar procedures that you use
on your XENIX® system, let me know as I’d enjoy
hearing from your In any event, I hope you enjoy these tools
and "Happy indexing!"
Rich Bilancia is the owner and founder of Computer Guidance & Support, Littleton, Colorado, a consulting firm that specializes in UNIX® based computerized accounting and database applications. He can be sent e-mail at...