IRCS Conference Room
Department of Computer and Information Sciences
University of Pennsylvania
Large-scale paraphrasing for natural language understanding and generation
I will present my method for learning paraphrases - pairs of English expressions with equivalent meaning - from the bilingual parallel corpora, which are more commonly used to train statistical machine translation systems. My method pairs English phrases like <thrown into jail, imprisoned> when they share an aligned foreign phrase like festgenommen. Because bitexts are large and because a phrase can be aligned many different foreign phrases (including phrases in multiple foreign languages), the method extracts a diverse set of paraphrases. For thrown into jail, we not only learn imprisoned, but also arrested, detained, incarcerated, jailed, locked up, taken into custody, and thrown into prison, along with a set of incorrect/noisy paraphrases. I'll show a number of methods for filtering out the poor paraphrases, by defining a paraphrase probability calculated from translation model probabilities, and by re-ranking the candidate paraphrases using monolingual distributional similarity measures.
In addition to lexical and phrasal paraphrases, I'll show how the bilingual pivoting method can be extended to learn meaning-preserving syntactic transformations like the English possessive rule or dative shift. I'll describe a way of using synchronous context free grammars (SCGFs) to represent these rules. This formalism allows us to re-use much of the machinery from statistical machine translation to perform sentential paraphrasing. We can adapt our "paraphrase grammars" to do monolingual text-to-text generation tasks like sentence compression or simplification.
I'll also briefly sketch future directions for adding a semantics to the paraphrases, which my lab has begun exploring for the new DARPA DEFT program.