Friday, February 13, 12-2 p.m.
Mining the Bibliome: Information Extraction from the Biomedical Literature

This talk is a progress report on a Penn-based research project that aims to find qualitatively better methods for automatically extracting information from the biomedical literature. Participants include many faculty, students and staff at Penn, at Children's Hospital of Philadelphia, and at GlaxoSmithKline R&D. The general approach is to use human text annotation to train computer programs that initially can increase the productivity of human annotators, and later can replace them.

The first part of the talk will survey the project's concrete results so far: annotation of entities and relations in two specific areas of enzyme inhibition and cancer genomics; additions to existing "treebanking" practice to cover the syntax of nominal structures, and the integration of syntactic analysis with shallow semantic analysis such as "entity tagging"; new or improved interactive tools for various types and stages of biomedical text annotation; application of machine learning methods to automate such annotation; and progress towards publication of a first set of annotated texts for use in algorithm development by others.

The second part of the talk will highlight aspects of this topic that may be of more general interest to cognitive scientists: the nature of linguistic structure; the relations between language and models of the world; and the development of intersubjectively stable representations for those relations. It will be suggested that the engineering problem of textual information extraction (IE) provides a helpful domain in which to explore some traditional problems in the philosophy and psychology of language. Somewhat more cautiously, it will also be suggested that results from the cognitive sciences may be helpful in IE engineering.

Finally, the exciting prospects for near-term practical (and perhaps theoretical) progress in this area will be sketched, with the goal of enticing others to participate.