In Prague, the traditional and contemporary collide. In this “city of a hundred spires,” telephone lines interlace before facades of centuries-old gothic cathedrals, while in the streets businessmen in suits pass Czech babickas on their way to market. This city, at once ancient and modern, was an ideal setting for the Conference on Electronic Corpora of Ancient Languages, held on November 16 and 17 of 2007. Two BYU Computer Science students, James Carroll and Robbie Haertel, traveled to the conference and presented papers.
James, a member of Dr. Kevin Seppi’s Applied Machine Learning Laboratory, is also a part-time faculty member in BYU’s Department of Ancient Religion. Robbie, who researches with Dr. Eric Ringger in the Natural Language Processing Laboratory, is a recent graduate of the masters program in BYU’s Linguistics Department. While in Prague, Robbie was able to present his master’s thesis, a paper on a Wikipedia-like interface for a database of Mayan hieroglyphs. Both James and Robbie are seeking PhDs in computer science as a means to combine their divergent interests in ancient languages and modern technology.
James was the primary author on the paper, entitled “Utility Issues with Active Learning for Annotation of Ancient Language Corpora.” Robbie, fellow student Peter McClanahan, and Drs. Ringger and Seppi are listed as co-authors. As the name indicates, corpora (singular: “corpus”) are large bodies of text. Corpora are often tagged with linguistic information, such as parts of speech. Tags on words allow researchers to quickly and efficiently search through the text for particular word usage in specific contexts. This enables the gleaning of important information and analysis from the writings. In recent years, scholars in BYU’s Center for the Preservation of Ancient Religious Texts (CPART) completed such a project involving the Dead Sea Scrolls. Current research is focused on bodies of writing in the Syriac language, which is a dialect of Aramaic, one of the languages spoken by Christ and His disciples in the meridian of time. A number of early Christian documents, including a huge body of work written by a man known as Ephrem the Syrian, were written in the Syriac language. Creating a corpus of these writings should unlock new insights into the life of Christ and His early followers as well as into the nature of early Christian churches in the Near East.
However, the act of tagging, or the process of actually annotating corpora, is arduous and costly when performed by humans. The costs of human annotation can be reduced through machine learning techniques; however computerized annotation can be less accurate. As a result, it is difficult for researchers to decide when to let computers do the work of annotation and when it is necessary for humans to take over. James’ and Robbie’s paper analyzes the interactions between human and machine and creates a statistical model of the annotation process using a Bayesian approach. Their model shows when computers should shoulder the task of annotation, and when it’s best for humans to step in. The model also shows researchers places where humans should tag the text in order to most effectively teach computers to do the work in the future. By using this Bayesian approach, James and Robbie are able to minimize human annotation cost and better implement machine learning techniques. Their discoveries may act as a guide for researchers as they plan future corpus creation projects.

