Computing That Serves

A Green Form-Based Information Extraction System for Historical Documents

Tae Woo Kim's MS Thesis Defense

Tuesday, May 30th, 9:00 am

3350 TMCB- Conference Room

Advisor: David W. Embley


Many historical documents are rich in genealogical facts.  Extracting these facts by hand is tedious and almost impossible considering the hundreds of thousands of genealogically rich family-history books currently scanned and online.  As one approach for helping to make the extraction feasible, we propose GreenFIE—a “Green” Form-based Information-Extraction tool which is “green” in the sense that it improves with use toward the goal of minimizing the cost of human labor while maintaining high extraction accuracy.  Given a page in a historical document, the user’s task is to fill out given forms with all facts on a page in a document called for by the forms (e.g. to collect the birth and death information, marriage information, and parent-child relationships for each person on the page).  GreenFIE has a repository of extraction patterns that it applies to fill in forms.  A user checks the correctness of GreenFIE’s form filling, adds any missed facts, and fixes any mistakes.  GreenFIE learns based on user feedback, adding new extraction rules to its repository.  Ideally, GreenFIE improves as it proceeds so that it does most of the work, leaving little for the user to do other than confirm that its extraction is correct.  We evaluate how well GreenFIE performs on family history books in terms of “greenness”—how much human labor diminishes during form filling, while simultaneously maintaining high accuracy.