IRCS Conference Room
School of Informatics
University of Edinburgh
A Distributional Theory of Content for NLP
Linguists and computational linguists have come up with some quite useful theories of the semantics of function words and the corresponding logical operators such as generalized quantifiers and negation (Woods 1968; Montague, 1973; as adapted by Steedman 2012). There has been much less progress in defining a usable semantics for content words. The effects of this deficiency are very bad: linguists find themselves in the embarrassing position of saying that the meaning of "seek" is seek. Computationalists find that their wide coverage parsers, which are now fast and robust enough to parse billions of words of web text, have very low recall as question answerers because, while the answers to questions like "Who wrote 'What Makes Sammy Run?'" are out there on the web, they are not stated in the form suggested by the question, "Budd Schulberg wrote 'What Makes Sammy Run?'" but in some other form that paraphrases or entails the answer, such as, "Budd Schulberg's 'What Makes Sammy Run?'". Semantics as we know it is not provided in a form that supports practical inference over the variety of expression we see in real text. I'll discuss recent work with Mike Lewis which seeks to define a novel form of semantics for content words using semi-supervised machine learning methods over unlabeled text. True paraphrases are represented by the same semantic constant. Common-sense entailment is represented directly in the lexicon, rather than delegated to meaning postulates and theorem-proving. The method can be applied cross-linguistically, in support of machine translation. I'll discuss extensions of the method to extract an aspect-based semantics for temporal entailment, and speculate concerning the relation of this representation of content to the hidden prelinguistic language of mind that must underlie all natural language semantics, but which has so far proved resistant to discovery.