The idea that the observed semantic structure of human language is a result of an adaptive competition between accuracy of expression and efficient communication is not new. It has been suggested in various forms by Zipf, Shannon, and Mandelbrot, among many others. In this talk I will discuss a novel technique for studying such a competition between accuracy and efficiency of communication, solely from the statistics of large linguistic corpora. By exploiting the deep and intriguing duality between source and channel coding in Shannon's information theory we can explore directly the relationship between the semantic accuracy and the complexity of the representation in a large corpus of English documents. We do this by evaluating the accuracy in identifying the topic of a document as a function of the complexity of the semantic representation, as captured by relevant hierarchical clustering of words via the information bottleneck method. What we obtain is a scaling relation (a power-law) that, unlike the famous Zipf's law, quantifies directly the statistical way words are semantically refined in human language. It may therefore reveal some quantitative properties of human cognition which can now be explored experimentally in other languages or other complex cognitive modalities such as music and mathematics. This work is partly based on joint work with Noam Slonim. See also: http://www.cs.huji.ac.il/labs/learning/Theses/Noam_phd1.ps.gz
Dr. Naftali Tishby is currently on sabbatical the at the
CIS department at U Penn. Until last summer he served as the founding chair of
the new computer engineering program at the School of Computer Science and
Engineering at the Hebrew University. He is a founding member of the
Interdisciplinary Center for Neural Computation (ICNC) and one of the key
teachers of the well known computational neuroscience graduate program of the
ICNC. He received his PhD in theoretical physics from the Hebrew university in
1985 and has been a research member of staff at MIT, Bell Labs, AT&T, and
NECI since then. His current research is on the interface between computer
science, statistical physics, and computational biology. He introduced various
methods from statistical mechanics into computational learning theory and
machine learning and is interested in particular in the role of phase
transitions in learning and cognitive phenomena. More recently he has been
working on the foundation of biological information processing and has developed
novel conceptual frameworks for relevant data representation and learning
algorithms based on information theory, such as the Information Bottleneck
method and Sufficient Dimensionality
Reduction.
http://www.cs.huji.ac.il/~tishby