LANGUAGE MODELING OF BIOLOGICAL DATA
Final Report List of Participants   Workshop Homepage  Accommodations Travel Information  Program

 

WORKSHOP ON LANGUAGE MODELING OF BIOLOGICAL DATA

Feb 25, 26 and 27 2001

IRCS and Center for Bioinformatics

University of Pennsylvania

Philadelphia, PA 19104

 

General comment: All the participants were very pleased with the workshop, both in terms of the content and the format. The two groups (computational biologists and computational linguists) both genuinely felt that they were learning from each other and could clearly see potential areas of close interactions. The format of the workshop greatly helped in this process. When available, the presenter slides are included with the presenter contact information on the List of Participants Webpage.

 

Summary of recommendations:

(1)    The workshop was built around the following major themes and clearly, they constitute major areas of collaboration. These are listed below.

(a)    STRUCTURE OF GENES AND GENOMES KEY ISSUES: How much do strongly model-based, "syntactic" algorithms enhance gene identification and genome characterization? Can general-purpose or domain-specific parsing methods find application in genome analysis?

(b)    STRUCTURE OF MACROMOLECULES KEY ISSUES: What is the practical utility of non-regular stochastic grammars in recognizing RNA secondary structures? Can stronger (as compared to CFG) formal systems be useful in protein structural studies?

(c)    PATTERN SEARCH AND ANALYSIS KEY ISSUES: What do stochastic methods add to sequence search and analysis? Are there uses for recent statistical linguistic methods? Do linguistic methods apply to the analysis of regulatory regions?

(d)    INFERENCES FROM GENOMES KEY ISSUES: Are there lessons from comparative linguistics for comparative genomics? How does phylogenetic reconstruction resemble classical linguistics? How can multiple genomes be used to infer phylogenetic relationships, protein interactions, etc? Are there techniques from linguistics and/or machine learning that might bear on the analysis of gene expression?

(e)    DEVELOPMENTS IN PARSING KEY ISSUES: What are some of the recent developments in statistical parsing in computational linguistics that may be of relevance to CB.

(f)     STRONG MODELS OF GRAMMARS (Grammars with structured primitives) KEY ISSUES: What are some of the structural linguistic aspects motivating the strong models? Computational and stochastic implications. Relevance to topological structures in CB. Are there technologies that could be successfully adopted, patterned after the success of HMMs in biology?

 

(2)    From the CL side there are many techniques in mathematical/computational linguistics and machine learning appear to be extendible to CB problems, no doubt modified in appropriate ways. These results may possibly feed back into CL work although this is difficult at this stage.

(3)    A possible title for collaborative work between CB and CL researchers was suggested, namely, computational biolinguistics

 

(4)    The following 7 items were mentioned with special emphasis. In a way they are covered by the themes in (1) above.

(a)    Overlap of search strategies and global optimization.

(b)    Evolutionary models.

(c)    Common techniques: single value decomposition, dimensionality reduction, clustering algorithms, etc.

(d)    Discriminative vs. generative models and their combinations.

(e)    Formal language theory of macromolecules.

(f)     Role of lexicalization.

(g)    Protein fold recognition, comparative modeling, and structural predictions, in general.

 

(5)    Although the title of the workshop was Language Modeling of Biological Data, the focus of the workshop was on modeling of biological sequences. This choice was deliberate. It provided a strong focus and coherence to a 2 and 1/2 day workshop. The following 4 items are topics that were not the focus of the workshop. However, they could be considered as falling under the general title of the workshop.

(a)    Extracting information from biological texts, for example, chemical pathways and possibly their integration across documents .

(b)    Animal communication.

(c)    Process modeling (e.g., RNA polymerization).

(d)    Context-dependent signaling and communication.

(6)    After the workshop several participants expressed the desire to have a follow-up workshop in about a year but no later than two years in order to track the progress of the collaborations.

 

 With respect to sharing resources (possibly specially prepared for the collaborative effort) the following resources were mentioned.

1)      PFAM -- protein domain database -- Richard Durbin/Sean Eddy www.sanger.ac.uk/Pfam, pfam.wustl.edu

2)      GO Ontology -- www.geneontology.org

3)      NLP tools:

a)      ACL Natural Language Software Registry (hosted at DFKI,registry.dfki.de)

b)      the LDC, ELRA/ELDA, TELRI and Elsnet resources catalogues and repositories (www.ldc.upenn.edu, www.icp.inpg.fr/ELRA, www.telri.de and www.elsnet.org/resources.html)

4)      Ontologies --> Lynette Hirschman, who also agreed to serve as the point of contact for possible applications of CL work in information extraction for the biology domain.

 Also check out http://www.ccs.neu.edu/home/futrelle/bionlp/


Language Modeling of Biological Data
Institute for Research in Cognitive Science
University of Pennsylvania
400A 3401 Walnut Street
Philadelphia, PA 19104-6228
phone: +1-215-898-0357
fax: +1-215-573-9247

last updated: 23 February 2001
If you have any questions or suggestions please send mail to language-modeling@cis.upenn.edu.


Today's date is Friday, 23-Mar-2001 19:19:35 EST

PENN