Friday, November 16, 2007, 12-2 p.m.

Ani Nenkova
Department of Computer and Information Science, University of Pennsylvania

Multi-document summarization by people and machines

For people, summarizing a set of related articles is a non-deterministic process: different people choose somewhat different content for inclusion in their summary, as does the same person summarizing the articles in different points in time. This fact poses a big challenge for the much needed automation of summarization and for the evaluation of summarization output. Moreover, current automatic models, unlike people, are deterministic in their operation.

In this talk I will overview some relevant trends is automatic multi-document summarization and its evaluation. Specifically, I will present analysis of content overlap among multiple human summaries of the same text and an evaluation method based on the analysis that is diagnostic and predicts the possibility of having different but equally good summaries. I will also discuss how frequencies in the input and measures of topicality give a robust way of estimating information importance for content selection.  Finally I will discuss how these findings from automatic summarization can be helpful in developing a more faithful model of the human summarization process. Such model will be useful not only in giving an insight about how people operate, but also has the potential of improving content selection and text flow decisions in automatic methods.