Skip navigation
Brigham Young University
Login
Computer Science

Computer Science

Data Extraction

Dr. David Embley

Research in the Data Extraction Laboratory focuses on information extraction from web pages.  Students working in the lab with Dr. David Embley envision a future in which it will be possible for individuals to have their personal computers act like personal assistants: answering specific questions when the user types in a search phrase instead of simply generating lists of web pages and references, automatically scheduling doctors' appointments and airline flights, even taking care of a person's online shopping list.  This new technology has the potential to dramatically alter the ways in which humans interact with everything from surfing the web to researching family history.  A popular example in the lab involves the process of finding a used car.  Currently, potential buyers can't go on the web and ask for a used red car for sale in Utah Valley priced for under $1000.  However, the research in the Data Extraction Laboratory makes it possible for users to ask for such a car and receive an actual answer: a used red car for sale in Utah Valley priced under $1000 is available at Bob's Used Cars on State Street in Orem. 

One of the most exciting areas influenced by Dr. Embley's research applies to family history.  The developing technology makes it possible to automatically extract data and structure it into something like a database-searchable and infinitely more accessible than pages of text.  Students in the lab are teaching machines how to extract, for example, birth and death dates from obituaries, which involves teaching the computer to pick out dates from text and then use keywords and other indicators to tell what kind of date it is.

One project in particular, is focused on family history.  The Distributed Family Tree project, housed in the Data Extraction Laboratory, is an attempt to create an open network of genealogical data, combining all available genealogical information on the Internet into a single distributed network-a single family tree. 

Currently, the focus is on an element of the Distributed Family Tree known as Genesis, a system which creates hyperlinks between online genealogical databases that refer to the same individual.  This technology connects two otherwise independent family trees into one, bridging gaps between the works of genealogists throughout the world.  Genesis uses a three-step process to create links.  First, it allows users to import their genealogical data, either using web services or techniques developed in the lab.

Next, the system analyzes the imported data and links together individuals using machine learning techniques previously developed by BYU Computer Science students. Users have the ability to accept or reject the connections made.   

Finally, the system publishes the results back to the Internet on a server that supports the linking model.  Each step can be extended with plug-ins to support other systems and perform other types of analysis, bringing users a step closer to realizing one cohesive family tree. 

eStore