
This workshop will provide a unique opportunity to explore new areas of research that involve a linguistic/language processing view of biological data from the perspective of bioinformatics. We expect the interactions to be very stimulating.
The workshop is co-sponsored by DARPA, IRCS and the Center for Bioinformatics.
Primary Goal of the
Workshop
What may be called a linguistic view of biological data has proven to be a very useful technical approach in bioinformatics. While the field was dominated in its early years by string-matching algorithms and database development, the need for more sophisticated pattern-matching search has led to the introduction of syntactic methods, and in general a more structured, model-based view of sequence data. We feel that the time is ripe for a more concerted effort to consolidate the various such approaches that are being explored, and to bring together in a workshop atmosphere the leading practitioners.The introduction of such a linguistic view in this field is perhaps typifiedby the problem of gene-finding. Ab initio gene-finding methods, which are based upon general properties and characteristics of protein-encoding genes, began some two decades ago with simple statistical measures of the "coding potential" of exonic versus intronic and intergenic sequence, using a wide variety of word frequency and abstruse signal processing metrics, tallied in moving windows across putative open reading frames. This approach reached its zenith with the first release of the famous GRAIL program, which combined many such lines of evidence as input to a neural net. The limitations of purely statistical measures of coding potential, which ignored useful biological knowledge of gene structure, were addressed by syntactic or model-based methods that also took account of signals such as those at splice junctions, as well as constraints associated with reading frame. At the same time, accounting for intron/exon structure led to the combinatorial problem of assembling the optimal consistent gene structurefrom many potential exons, which was largely solved by dynamic programming approaches.
The model-based approach to ab initio gene-finding has culminated with the applicationof hidden Markov models (HMMs) from the field of speech processing, which because of their state-based architecture are well suited to representing both cyclic transitions between exons and introns, and the statistical and periodic properties within each such state, as well as boundary states and the profiles characteristicof biological signals. Moreover, HMMs have associated withthem well-known dynamic programmingalgorithms not only for recognition of patterns butalso learningthose patterns, within a well-founded Bayesian framework.These advantages have led to a proliferation of HMM-based gene finders including Genie and GeneMark.hmm. Indeed, the highly model-based GENSCAN software package is currently considered to have an overall edge in this crowded field.
HMMs, indeed, have proven to be very versatile computational constructs in the bioinformatics domain, and are representative of a marked trend in the field in recent years toward Bayesian statistical methods. Eddy's HMMER is a suite of programs that builds HMM profiles describing families of proteins and searches with them, similar to Haussler's SAM system which, on at least one test set of remote homologs, demonstrated performance that was somewhat better than iterative methods such as PSI-BLAST (and, of course, far superior to conventional pairwise search). HMMs can also be profitably used to build more elaborate models for other bioinformatics problems, as has been noted;profile HMMs have even been extended to deal with detection of genomic sequences coding for folded RNA structures, such as tRNA genes and methylation guide small nucleolar RNAs in yeast, by modeling the covariation at base-paired positions. These, in fact, constitute stochastic context-free grammars, and are formal generalizations of a view of HMMs as stochastic regular grammars.
The realization that probabilistic methods could be profitably combined with syntactic constructs mirrors a trend in natural language processing to statistical parsing methods. Another interesting recent development has been efforts to apply certain mildly context-sensitive formalisms to modeling of phenomena observed in both nucleic acid and protein structures, following on the early work of Searls. These techniques are either identical or directly related to some of the techniques developed at Penn based on a topological specification of a set of primitive structures and the associated composition operations. These approaches so far tend to mimic the linguistic formulations directly instead of taking the insights from these linguistic approaches and then building on them, incorporating the topological properties directly into the system. There are at least two reasons for taking this different approach: 1. In the context of biological sequences the dependencies are 'real' in a sense, i.e., they are the direct consequence of the spatial proximity of residues which share bonds (thus leading to folding) and also minimum energy considerations. Therefore it is very attractive to incorporate these dependencies (foldings) directly into the initial structures of the 'linguistic' modeling, and 2. Since the topological structures are functionally important the proposed perspective has the potential of directly relating the structural description provided by the 'linguistic' model to the functional aspects related to the topological structure.
last updated: 27 March 2001
If you have any questions or suggestions please send
mail to trisha@ircs.upenn.edu