Some people eat, sleep and chew gum, I do genealogy and write...

Wednesday, February 28, 2018

The Indexing Challenge


Indexes and catalogs are finding devices used to provide entry points into large information sources such as libraries and online databases. By virtue of the fact that genealogists use historical records, they come in contact with a variety of indexes and catalogs. Large online genealogical database companies, such as FamilySearch.org and others, use indexing to assist in computer-aided searches of their historical records. FamilySearch relies on a vast network of volunteers to provide indexes to its records. Other companies hire indexers from countries around the world. There seems to be a general consensus that indexing is a valuable asset to the genealogical research's tools.

But there are some major limitations to the process and product of an indexing effort. But before making any comments about the limitations, I need to emphasize that indexing is, in fact, a very valuable tool for any researcher, genealogical or otherwise. But by not knowing indexing's limitations, the researcher may be essentially shut out of valuable records.

To start, I need to briefly review the difference between a catalog and an index. Cataloging, as a tool used in libraries, has been around since antiquity. Catalogs increase their utility when there is a one to one relationship between the catalog entries and the position of the physical books and records on a library shelf. However, catalogs depend on the vagaries of the system used to catalog the entries. The catalog entries are subject-based and do not reflect necessarily reflect any specific information about any one book or record. As paper catalogs have been migrated to computers and included on the internet, the entries have been expanded to include listings by geographic location, record titles, authors, and keywords. An example of using a catalog entry is the FamilySearch Catalog.


Without knowing the title and/or the author of a specific book, record or document, it is difficult to find the item using just the catalog.

An index links the user to a specific entry in a book, document or record. Indexes have been used for a very long time as finding aids within books or other documents. But again, the items included in an index is a reflection of the choices made by the author, creator or indexer. Some books such as the Bible and other books and records have complete indexes called concordances. Exhaustive concordances contain a reference to almost every word in a particular document. A Bible Concordance can be almost as long as the Bible itself.

Computer technology gave indexers the ability to search every word of a document as long as the document was in a compatible format. Computer programs called optical character recognition programs (OCR) can recognize printed characters and translate them into "text files." A text file contains "plain text." This post is in plain text with some minimal formatting. The obvious limitation of OCR is that it is confined to text documents, i.e. documents that are typeset or typed in text. The frontier of computer programming for documents is handwriting recognition software which is still in the preliminary development stage.

For the time being, genealogists have to rely on human indexers for nearly all of their searchable documents. Now, let's get down to the issues. Since humans are fallible, indexes compiled by humans or using humanly designed programs will never be 100% accurate. In addition, most indexing schemes index only a specific selection of words in an entire document. You are out of luck in finding your ancestor in the document, if he or she was only mentioned in the parts that are not indexed.

Now, here is an example of a complete text search document from Archive.org.


Rather than index some of the entries, this book has been completely indexed, i.e. through OCR, every word is searchable.

Now, what is the challenge of Indexing? I am not referring to the difficulty in deciphering old handwriting or whatever, I am referring to the fact that so many researchers believe that they are "searching" for their ancestors' names in indexed documents when the names they are looking for have not been included in the indexing. As I mentioned, when a document is indexed, only certain fields are included in the index. If you ancestor shows up in an unindexed field, you will never know unless you do a comprehensive word by word search of the document.

Even if you search a record page by page, in a microfilm copy, for example, you can never be perfectly certain that you haven't missed something important. Careful researchers may end up searching the same records over and over to make sure they haven't missed anything. Full-text searches, such as the Archive.org example, are part of the answer. Accurate handwriting recognition will help, but ultimately and for the foreseeable future, accurate and complete searches of most documents will still rely on careful page by page searching by researchers.


No comments:

Post a Comment