LANGUAGE MODELING OF BIOLOGICAL DATA
WORKSHOP
ON LANGUAGE MODELING OF BIOLOGICAL DATA
Feb 25, 26 and 27 2001
IRCS and Center for
Bioinformatics
University of Pennsylvania
Philadelphia, PA 19104
General comment: All the participants were very pleased with the
workshop, both in terms of the content and the format. The two groups
(computational biologists and computational linguists) both genuinely felt that
they were learning from each other and could clearly see potential areas of
close interactions. The format of the workshop greatly helped in this
process. When available, the presenter slides are included with the presenter
contact information on the List of Participants Webpage.
Summary of recommendations:
(1) The workshop was built around the following major themes
and clearly, they constitute major areas of collaboration. These are listed
below.
(a) STRUCTURE OF GENES AND GENOMES KEY ISSUES: How much do
strongly model-based, "syntactic" algorithms enhance gene
identification and genome characterization? Can general-purpose or
domain-specific parsing methods find application in genome analysis?
(b) STRUCTURE OF MACROMOLECULES KEY ISSUES: What is the
practical utility of non-regular stochastic grammars in recognizing RNA
secondary structures? Can stronger (as compared to CFG) formal systems be
useful in protein structural studies?
(c) PATTERN SEARCH AND ANALYSIS KEY ISSUES: What do
stochastic methods add to sequence search and analysis? Are there uses for
recent statistical linguistic methods? Do linguistic methods apply to the
analysis of regulatory regions?
(d) INFERENCES FROM GENOMES KEY ISSUES: Are there lessons
from comparative linguistics for comparative genomics? How does phylogenetic
reconstruction resemble classical linguistics? How can multiple genomes be used
to infer phylogenetic relationships, protein interactions, etc? Are there
techniques from linguistics and/or machine learning that might bear on the
analysis of gene expression?
(e) DEVELOPMENTS IN PARSING KEY ISSUES: What are some of the
recent developments in statistical parsing in computational linguistics that
may be of relevance to CB.
(f) STRONG MODELS OF GRAMMARS (Grammars with structured
primitives) KEY ISSUES: What are some of the structural linguistic aspects
motivating the strong models? Computational and stochastic implications.
Relevance to topological structures in CB. Are there technologies that could be
successfully adopted, patterned after the success of HMMs in biology?
(2) From the CL side there are many techniques in
mathematical/computational linguistics and machine learning appear to be
extendible to CB problems, no doubt modified in appropriate ways. These results
may possibly feed back into CL work although this is difficult at this stage.
(3) A possible title for collaborative work between CB and CL
researchers was suggested, namely, computational biolinguistics
(4) The following 7 items were mentioned with special
emphasis. In a way they are covered by the themes in (1) above.
(a) Overlap of search strategies and global optimization.
(b) Evolutionary models.
(c) Common techniques: single value decomposition,
dimensionality reduction, clustering algorithms, etc.
(d) Discriminative vs. generative models and their
combinations.
(e) Formal language theory of macromolecules.
(f) Role of lexicalization.
(g) Protein fold recognition, comparative modeling, and
structural predictions, in general.
(5) Although the title of the workshop was Language Modeling
of Biological Data, the focus of the workshop was on modeling of biological
sequences. This choice was deliberate. It provided a strong focus and coherence
to a 2 and 1/2 day workshop. The following 4 items are topics that were not the
focus of the workshop. However, they could be considered as falling under the
general title of the workshop.
(a) Extracting information from biological texts, for
example, chemical pathways and possibly their integration across documents .
(b) Animal communication.
(c) Process modeling (e.g., RNA polymerization).
(d) Context-dependent signaling and communication.
(6) After the workshop several participants expressed the
desire to have a follow-up workshop in about a year but no later than two years
in order to track the progress of the collaborations.
With respect to sharing resources (possibly specially
prepared for the collaborative effort) the following resources were mentioned.
1)
PFAM -- protein domain
database -- Richard Durbin/Sean Eddy www.sanger.ac.uk/Pfam, pfam.wustl.edu
2)
GO Ontology --
www.geneontology.org
3)
NLP tools:
a)
ACL Natural Language
Software Registry (hosted at DFKI,registry.dfki.de)
b)
the LDC, ELRA/ELDA, TELRI
and Elsnet resources catalogues and repositories (www.ldc.upenn.edu,
www.icp.inpg.fr/ELRA, www.telri.de and www.elsnet.org/resources.html)
4)
Ontologies --> Lynette
Hirschman, who also agreed to serve as the point of contact for possible
applications of CL work in information extraction for the biology domain.
Also check out http://www.ccs.neu.edu/home/futrelle/bionlp/
last updated: 23 February 2001
Institute for Research in Cognitive Science
University of Pennsylvania
400A 3401 Walnut Street
Philadelphia, PA 19104-6228
phone: +1-215-898-0357
fax: +1-215-573-9247
If you have any questions or suggestions please send mail to language-modeling@cis.upenn.edu.
Today's date is Friday, 23-Mar-2001 19:19:35 EST ![]()