IRCS
Upcoming Cogsci Events
(full calendar)
Technical Reports Technical Reports



08-01
The PDTB Research Group
The Penn Discourse Treebank 2.0 Annotation Manual

An important aspect of discourse understanding and generation involves the recognition and process- ing of discourse relations. Building on some early work on discourse structure in Webber and Joshi (1998), where discourse connectives as treated as discourse-level predicates that take two abstract objects such as events, states, and propositions (Asher, 1993) as their arguments, the Penn Dis- course Treebank (PDTB) has annotated the argument structure, senses and attribution of discourse connectives and their arguments.1

This report documents the annotation guidelines and annotation styles for the second release of the PDTB (PDTB-2.0).2 The PDTB-2.0. distribution is available through the Linguistic Data Consortium (LDC)3, and contains the corpus, annotation manuals, relevant publications as well as software to enable some simple and fast processing of the corpus data. PDTB-2.0 contains extensions and revisions of some aspects of the annotation since the first release, primarily with respect to the senses of connectives (Section 4) and the attribution of connectives and their arguments (Section 5). Discourse connectives in the PDTB include: Explicit discourse connectives, which are drawn pri- marily from well-defined syntactic classes, and Implicit discourse connectives, which are inserted between paragraph-internal adjacent sentence pairs not related Explicitly by any of the syntactically- defined set of Explicit connectives. In the latter case, the reader must attempt to infer a discourse relation between the adjacent sentences, and “annotation” consists of inserting a connective expres- sion that best conveys the inferred relation. Connectives inserted in this way to express inferred relations are called Implicit connectives. Multiple discourse relations (Webber et al., 1999) can also be inferred, and are annotated by inserting multiple Implicit connectives.

Adjacent sentence-pairs between which annotators found no Implicit connective to be appropriate are further distinguished as: (a) AltLex, where a discourse relation is inferred, but insertion of an Implicit connective leads to redundancy in its expression due to the relation being alternatively lexicalized by some other expression; (b) EntRel, where no discourse relation can be inferred and where the second sentence only serves to provide some further description of an entity in the first sentence (akin to entity-based coherence (Knott et al., 2001)); and (c) NoRel, where neither a discourse relation nor entity-based coherence can be inferred between the adjacent sentences.

Because there are no generally accepted abstract semantic categories for classifying the arguments to discourse connectives as have been suggested for verbs (e.g., agent, patient, theme, etc.), the two arguments to a discourse connective are simply labelled Arg2, for the argument that appears in the clause that is syntactically bound to the connective, and Arg1, for the other argument.

1 The Penn Discourse Treebank Project (http://www.seas.upenn.edu/~pdtb) was partially supported by NSF Grant: Research Resources, EIA 02-24417 to the University of Pennsylvania (PI: Aravind Joshi).
2 In April 2006, a preliminary version of PDTB (PDTB-1.0.) was released in order to get some feedback. This version is no longer available.
3 http://www.ldc.upenn.edu/
4 All connectives annotated in the PDTB have two and only two arguments. PDTB discourse-level predicate- argument structures are therefore unlike the predicate-argument structures of verbs at the sentence-level (propbank, (Palmer et al., 2005)), where verbs can take any number of arguments. At the same time, however, we note that certain types of constructions could be possibly viewed as structures with more than two arguments, such as “Lists”

Supplements to Arg1 and Arg2, called Sup1 for material supplementary to Arg1, and Sup2, for material supplementary to Arg2, are annotated to mark material that is relevant but not “mininally necessary” for interpretating the relation.

Annotation of Explicit connectives and their arguments consists of selecting the corresponding span of text in the source text files. Supplementary material is annotated in the same way. Implicit connectives are annotated by first selecting the first character of Arg2 as the textual span for the Implicit connective, then selecting the text spans for Arg1 and Arg2 of the relation, and finally providing a word or phrase to express the relation. In the case of AltLex, instead of providing a word/phrase, the text span in Arg2 expressing the relation is selected and marked. EntRel and NoRel annotations only involve selection of the first character of Arg2 as the placeholder for the relation and then selection of the adjacent sentences as Arg1 and Arg2.

Senses of connectives are annotated for Explicit connectives, Implicit connectives and AltLex relations. No senses are provided for EntRel and NoRel since no discourse relations are inferred for these. Sense labels are drawn from a hierarchical classification - a three-level hierarchy grouping connectives into classes, types and subtypes - and are annotated as features on connectives.

Attribution, which is a relation of “ownership” between individuals and abstract objects, is annotated for Explicit connectives, Implicit connectives and AltLex relations, as well as their arguments. The annotation scheme aims to capture both the source and degrees of factuality of the abstract objects through the annotation of text spans signalling the attribution, and of features recording the source, type, scopal polarity, and determinacy of attribution.

The annotation guidelines described in this document draw and expand on earlier reports presented in annotation tutorials and papers, notably Miltsakaki et al. (2004a,b); Prasad et al. (2004); Dinesh et al. (2005); Prasad et al. (2005); Webber et al. (2005); Miltsakaki et al. (2005); Prasad et al. (2006, 2007). The rest of this section discusses the source corpus and annotation style of PDTB-2.0, and presents an overview of the annotation contained in the corpus, including an overview of the extensions from PDTB-1.0. Section 2 presents the annotation guidelines for the argument structure of Explicit connectives. Annotation guidelines for Implicit relations and their arguments are presented in Section 3. Section 4 presents the guidelines for sense annotation. Section 5 describes the guidelines for attribution annotation. File structures and representation formats of the corpus are described in Section 6. Finally, Appendices A-H provide distributions of some aspects of the annotations.

Footer