OntoNotes Release 5.0
4
1 Introduction
This document describes the final release (v5.0) of OntoNotes, an annotated corpus
whose development was supported under the GALE program of the Defense Advanced
Research Projects Agency, Contract No. HR0011-06-C-0022. The annotation is provided
both in separate text files for each annotation layer (Treebank, PropBank, word sense,
etc.) and in the form of an integrated relational database with a Python API to provide
convenient cross-layer access. More detailed documents (referred to at various points
below) that describe the annotation guidelines and document the routines for deriving
various views of the data from the database are included in the documentation directory
of the distribution.
1.1 Summary Description of the OntoNotes Project
Natural language applications like machine translation, question answering, and
summarization currently are forced to depend on impoverished text models like bags of
words or n-grams, while the decisions that they are making ought to be based on the
meanings of those words in context. That lack of semantics causes problems throughout
the applications. Misinterpreting the meaning of an ambiguous word results in failing to
extract data, incorrect alignments for translation, and ambiguous language models.
Incorrect coreference resolution results in missed information (because a connection is
not made) or incorrectly conflated information (due to false connections). Some richer
semantic representation is badly needed.
The OntoNotes project was a collaborative effort between BBN Technologies, Brandeis
University, the University of Colorado, the University of Pennsylvania, and the
University of Southern California's Information Sciences. The goal was to annotate a
large corpus comprising various genres (news, broadcast, talk shows, weblogs, usenet
newsgroups, and conversational telephone speech) in three languages (English, Chinese,
and Arabic) with structural information (syntax and predicate argument structure) and
shallow semantics (word sense linked to an ontology and coreference). OntoNotes builds
on two time-tested resources, following the Penn Treebank for syntax and the Penn
PropBank for predicate-argument structure. Its semantic representation adds coreference
to PropBank, and includes partial word sense disambiguation for some nouns and verbs,
with the word senses connected to an ontology. OntoNotes includes roughly 1.5 million
words of English, 800 K of Chinese, and 300 K of Arabic. More details are provided in
Weischedel et al. (2011)
This resource is being made available to the natural language research community so that
decoders for these phenomena can be trained to generate the same structure in new
documents. Lessons learned over the years have shown that the quality of annotation is
crucial if it is going to be used for training machine learning algorithms. Taking this cue,
we strove to ensure that each layer of annotation in OntoNotes have at least 90% inter-
annotator agreement..
This level of semantic representation goes far beyond the entity and relation types
targeted in the ACE program, since every concept in the text is indexed, not just 100 pre-
specified types. For example, consider this sentence: “The founder of Pakistan's nuclear
program, Abdul Qadeer Khan, has admitted that he transferred nuclear technology to