Simple Science

Cutting edge science explained simply

# Computer Science# Digital Libraries# Information Retrieval

Improving Search for Physical Records

A new method enhances the search for physical archival materials.

― 8 min read


Archival Search MethodArchival Search MethodInnovationsfor physical records.New indexing method improves searches
Table of Contents

Even with a lot of digital content today, many important records are still found only on paper or microfilm. Traditional methods for organizing and finding these physical records use manual descriptions, like what is in a folder or a box. When someone tries to find something, they often end up going through many physical items to locate what they need. This paper talks about a new way to index these materials using selective digitization and proximity-based Indexing.

A New Approach

Instead of relying solely on conventional indexing methods, this new approach involves digitizing a small part of the physical records. By looking at a few digitized documents, we can create a system that helps users find specific content more easily. Tests with boxes of records show that this method can help people search more effectively.

Links Between Records

There is an idea called homophily, which means people connect with others who share similar characteristics. This concept can also apply to content in archives. When archivists organize records, they keep the original order of the documents. This following of original order helps maintain the context and proves useful when researchers look for specific information. If an archivist respects the original order, it’s likely that users who understand that order will find value in it too.

The Importance of Original Order

When archivists arrange materials, they place them into folders, boxes, and series. This original order can serve as a guide for future researchers. By keeping the original order intact, archivists make it easier to open collections for research without much additional work. This is because the order already served a purpose for the people or organizations that created the records.

The idea is that if we know about the content of a few records, we can guess where related records might be. However, while this claim makes logical sense, proving its accuracy is a different task. This paper presents evidence that supports this idea in one case, indicating it might hold true more generally, though more exploration is needed.

Growth in Digital Records

Over the last 50 years, there has been a significant increase in digital records. As the number of digital documents grows, archival repositories are beginning to fill up with them. There are many tools available to search through digital records, and new methods for managing these tools are being developed.

However, many existing tools can also be used to search physical records, but due to costs and logistical limits, digitizing everything is a challenge. Just to illustrate, in a span of five months in 2003, the National Archives processed 13 million pages. At that pace, it would take hundreds of years to finish digitizing all paper records in their collection.

Finding Records on Paper

The first step to finding physical records is knowing where to search. Scholarly citations play a big role in this process. Research suggests that many historians and anthropologists often follow leads found in published literature. It is common for researchers to reach out to archivists before visiting archives. Researchers also rely on finding aids created by archivists to help them understand the content and layout of collections.

One problem with finding aids is that they only depend on information provided by archivists. However, the same constraints that limit the digitization of materials also affect the quality of descriptions made. This could lead to a situation where only a fraction of collections have comprehensive online descriptions.

The Subject-Numeric Files

In the U.S., the Department of State keeps records regarding foreign relations. Between the years of 1963 and 1973, they used a “Subject-Numeric Files” system for their records. In this system, the first level consists of a three-letter code for the subject. The next level specifies the country, and the third level is a specific numeric code. This system contains millions of pages, all currently located in the National Archives.

Recently, Brown University started a large-scale project to digitize records related to Brazilian politics. As a part of this project, they arranged for around 14,000 items from the State Department's Subject-Numeric Files to be digitized. These records, representing parts of 52 boxes, were made available online.

The BoxFinder System

The PDF files created from digitized records are searchable, meaning they can be found easily using text search tools. However, for undigitized content, users only have the folder titles to rely on. This situation makes it necessary to request boxes based on folder labels, which can be a slow and tedious process.

The goal of the BoxFinder system is to speed up the search process by suggesting which box to examine. Users can input a query, and the system will recommend a relevant box. Alternatively, if users are examining documents, the system will point to boxes that may contain similar records.

To build the index for the boxes, the system takes some digitized documents and uses their text to create searchable terms. By analyzing the text from a few pages of documents within each box, the system creates an index that can help identify where to find similar content.

Testing the System

The testing phase involved simulating two types of searches. In the first type, users input a query based on the title of a document from the Metadata provided by Brown University. The system's success was measured by determining how often it correctly identified the right box.

For the second type of search, the system used the text from selected documents to guess their associated box. Researchers calculated how often the system was able to find the correct box using this method.

Results showed that when searching for documents, the BoxFinder system performed significantly better than random guessing. Although the accuracy was not perfect, the system showcased a greater likelihood of identifying the correct box compared to chance.

Random Guessing vs. System Performance

The findings indicated that BoxFinder was able to locate boxes with about 27.9% accuracy using specific queries. This performance is a notable improvement over random guessing, which would only succeed around 2.9% of the time.

While 27.9% might not seem high, it’s still a meaningful result given the challenge of searching through many physical items. Even when the system didn’t guess the exact right box, it often suggested nearby options, increasing the chances of a successful find for users.

Using Folder Labels for Indexing

An alternative method to improve search results involved generating terms from the folder labels where documents were housed. By consulting the subject-numeric codes, the system could replace these codes with descriptive titles provided in a classification guide. This change allowed for a more effective search when users entered queries.

The folder labels also contain dates, which can provide temporal context for the content. Including this information allowed for more robust indexing, aiding in the search process.

Results from Testing

The tests showed that using titles based on folder labels resulted in a better understanding of the content. For instance, when searching with these titles, the system had rates of finding the correct box that were still better than random guesses.

The performance was even better when searching using the full OCR text compared to using short title metadata. This improvement suggests that longer queries provide more context and details for the search.

Future Directions for Research

The findings support the idea that there is a relationship between digitized and undigitized materials, which can facilitate more effective search processes. Several avenues for further exploration are evident. One idea could be to focus on specific parts of documents that lend themselves to indexing.

There’s potential to capture more details from document layouts as well, such as who sent or received documents and the document’s date. Recognizing types of documents could also lead to better searches tailored to specific formats.

Additionally, instead of combining content from multiple documents into a single representation, the system could maintain individual details for each document. Approaching it this way allows each document to contribute to the decision, potentially leading to better overall results.

Generalizing the Approach

This indexing method doesn’t have to be limited to just boxes; it may work for folders, series, collections, or entire repositories as well. For example, if one archive holds a letter from a notable person, it might contain other related documents.

Variations in how to index materials will adapt depending on the scale of the collection being searched. However, the next step is to gather a variety of collections for training and testing. By doing so, researchers can refine and tune the system to work better across different contexts.

Conclusion

The research shows that using a hybrid approach of indexing and proximity can support searches for physical records that have yet to be digitized. By taking the context into account, this method adds a layer of understanding that enhances traditional searching systems.

In the future, combining insights from document layouts, metadata, and indexing methods could lead to even better outcomes. Engaging in further experiments with a diverse array of collections will be key to refining this system to benefit researchers and users looking to access archival content more efficiently.

More from author

Similar Articles