LANGUAGE MODELING OF BIOLOGICAL DATA
Final Report List of Participants   Workshop Homepage  Accommodations Travel Information  Program
 
When available, the presenter slides are included with the presenter contact information on the List of Participants Webpage. Additional information is also available here BioInforMar2001.ppt and BioInforMar2001.htm Gary Strong Presentation, San Diego, March 2001

The Institute for Research in Cognitive Science (IRCS)  and the Center for Bioinformatics are pleased to announce  -- Language Modeling of Biological Data Workshop. This exciting workshop will be held at the University of Pennsylvania  February 25, 2001 through  February 27, 2001. At this workshop we plan to bring together a group of distinguished, active computational biologists and computational linguists.

This workshop will provide a unique opportunity to explore new areas of research that involve a linguistic/language processing view of biological data from the perspective of bioinformatics. We expect the interactions to be very stimulating.

 
The workshop will be structured to provide computational biologists with a forum to describe their work and its contribution to the field as well as layout new areas, which may require stronger linguistic models. In turn, the computational linguists will be provided with an opportunity to comment on these presentations as well as describe recent work on language modeling with an emphasis on possible applications to biological modeling. After which, the computational biologists will be given an opportunity to comment on these presentations. Thus, the workshop will be highly participatory and interactive.

The workshop is co-sponsored by DARPA, IRCS and the Center for Bioinformatics.


Organizing Committee
Sean Eddy
Alvin Goldfarb Professor of Computational Biology
Howard Hughes Medical Institute
Department. of Genetics
Washington University School of Medicine
Saint Louis, Missouri
 
Aravind K. Joshi
Henry Salvatori Professor of Computer and Cogntive Science
Co-Director, Institute for Research in Cognitive Science
University of Pennsylvania
Philadelphia, Pennsylvania
 
David Searles
Vice President and Director
Bioinformatics
SmithKline Beecham
Radnor, Pennsylvania

Primary Goal of the Workshop

    *To assess the status of the application of language modeling in descriptionand recognition of the structure of biological  sequences, including the so-called secondary and higher structures. Recognition of these higher structures is a major unsolved problem.
    *The assessment in item 1 above will also serve to inform computational linguists, who might become interested in applying or adapting their techniques to the study of biological sequence structures.
    *What applications may recent advances in natural language parsing techniqueshave to building tools for search and analysis of biological sequences, going beyond conventional regular expression search?
    * What are the new opportunities in exploring stronger linguistic models suggested, for example, by the so-called mildly context sensitive formalisms and other stronger models and their stochastic counterparts? The models get at the structural properties directly and therefore have relevance to the topological structures in biological sequences.
    * Are there any useful correspondences to be drawn between the evolution of languages and the evolution of biological sequences? Can some joint cooperative efforts be set up to investigate these issues further and come up with specific computational models and evaluate them on real data?

Workshop Rationale
What may be called a linguistic view of biological data has proven to be a very useful technical approach in bioinformatics. While the field was dominated in its early years by string-matching algorithms and database development, the need for more sophisticated pattern-matching search has led to the introduction of syntactic methods, and in general a more structured, model-based view of sequence data. We feel that the time is ripe for a more concerted effort to consolidate the various such approaches that are being explored, and to bring together in a workshop atmosphere the leading practitioners.

The introduction of such a linguistic view in this field is perhaps typifiedby the problem of gene-finding. Ab initio gene-finding methods, which are based upon general properties and characteristics of protein-encoding genes, began some two decades ago with simple statistical measures of the "coding potential" of exonic versus intronic and intergenic sequence, using a wide variety of word frequency and abstruse signal processing metrics, tallied in moving windows across putative open reading frames. This approach reached its zenith with the first release of the famous GRAIL program, which combined many such lines of evidence as input to a neural net. The limitations of purely statistical measures of coding potential, which ignored useful biological knowledge of gene structure, were addressed by syntactic or model-based methods that also took account of signals such as those at splice junctions, as well as constraints associated with reading frame. At the same time, accounting for intron/exon structure led to the combinatorial problem of assembling the optimal consistent gene structurefrom many potential exons, which was largely solved by dynamic programming approaches.

The model-based approach to ab initio gene-finding has culminated with the applicationof hidden Markov models (HMMs) from the field of speech processing, which because of their state-based architecture are well suited to representing both cyclic transitions between exons and introns, and the statistical and periodic properties within each such state, as well as boundary states and the profiles characteristicof biological signals. Moreover, HMMs have associated withthem well-known dynamic programmingalgorithms not only for recognition of patterns butalso learningthose patterns, within a well-founded Bayesian framework.These advantages have led to a proliferation of HMM-based gene finders including Genie and GeneMark.hmm. Indeed, the highly model-based GENSCAN software package is currently considered to have an overall edge in this crowded field.

HMMs, indeed, have proven to be very versatile computational constructs in the bioinformatics domain, and are representative of a marked trend in the field in recent years toward Bayesian statistical methods. Eddy's HMMER is a suite of programs that builds HMM profiles describing families of proteins and searches with them, similar to Haussler's SAM system which, on at least one test set of remote homologs, demonstrated performance that was somewhat better than iterative methods such as PSI-BLAST (and, of course, far superior to conventional pairwise search). HMMs can also be profitably used to build more elaborate models for other bioinformatics problems, as has been noted;profile HMMs have even been extended to deal with detection of genomic sequences coding for folded RNA structures, such as tRNA genes and methylation guide small nucleolar RNAs in yeast, by modeling the covariation at base-paired positions. These, in fact, constitute stochastic context-free grammars, and are formal generalizations of a view of HMMs as stochastic regular grammars.

The realization that probabilistic methods could be profitably combined with syntactic constructs mirrors a trend in natural language processing to statistical parsing methods.  Another interesting recent development has been efforts to apply certain mildly context-sensitive formalisms to modeling of phenomena observed in both nucleic acid and protein structures, following on the early work of Searls. These techniques are either identical or directly related to some of the techniques developed at Penn based on a topological specification of a set of primitive structures and the associated composition operations. These approaches so far tend to mimic the linguistic formulations directly instead of taking the insights from these linguistic approaches and then building on them, incorporating the topological properties directly into the system. There are at least two reasons for taking this different approach: 1. In the context of biological sequences the dependencies are 'real' in a sense, i.e., they are the direct consequence of the spatial proximity of residues which share bonds (thus leading to folding) and also minimum energy considerations. Therefore it is very attractive to incorporate these dependencies (foldings) directly into the initial structures of the 'linguistic' modeling, and 2. Since the topological structures are functionally important the proposed perspective has the potential of directly relating the structural description provided by the 'linguistic' model to the functional aspects related to the topological structure.


Language Modeling of Biological Data
Institute for Research in Cognitive Science
University of Pennsylvania
400A 3401 Walnut Street
Philadelphia, PA 19104-6228
phone: +1-215-898-0357
fax: +1-215-573-9247

last updated: 27 March 2001
If you have any questions or suggestions please send mail to trisha@ircs.upenn.edu


Today's date is Saturday, 17-May-2008 16:46:03 EDT PENN