# Evolution of Language: From Animal Communication to Universal Grammar

## Lecture :: Evolution of Language

**Martin Nowak**

It’s a great pleasure for me to be here. My background is, as Scott Weinstein said, in theorectical biology, so I apologize to all those who understand much more about language and psychology than I do. I think everybody in the audience, actually, will understand more. But what I would like to bring into the subject is evolutionary biology, so I would like to link thinking about language to models of evolutionary biology.

Evolutionary biology is actually a fairly mathematical discipline, because the basic ingredients of evolution, the mutation and selection, are very well described by mathematical models. And this has already a long-standing tradition starting with the work of Fisher, J.B.S. Haldane, and Sewall Wright in the 30s, linking Darwin’s ideas of evolution to Mendel’s observation of genetics.

So then I would like to talk about evolution of language today; I would like to talk about mathematical models of evolution of language. And, before I continue, I would like to tell you a story about a mathematical biologist. There’s a man and a flock of sheep, and another man comes by and says, “If I can guess the correct number of sheep in your flock, can I have one?” And the shepherd says, “Sure, try.” And the man looks at the sheep and says “Eighty-three.” And the shepherd is completely amazed, and the man picks up a sheep and starts to walk away, and the shepherd says “Hang on. If I guess you profession, can I have my sheep back?” And the man says, “Sure, try.” “You must be a mathematical biologist.” “How did you know?” “Because you picked up my dog.”

So, you will also have realized by now that my eight years at Oxford did not give me an Oxford accent, but maybe it’s some comfort to you that my students always say I speak exactly like Arnold Schwartzenagger.

So I’ve worked with a number of people on the evolution of language, and I would like to mention all of them: Natalia Komorova, David Krakauer, Joshua Plotkin (at Princeton right now), Peter ??? (a mathematician who moved from Princeton to Harvard recently), Andrea xxxxxxxxx and xxx (who is at the University of Chicago and is a computational linguist). David Krakauer, in fact, was the person who one evening came to my house in Oxford and said “Let’s work on the evolution of language.” And I said “Okay.”

Why would one want to work on the evolution of language? One reason is that the Linguistic Society of Paris officially banned any work on language evolution at their meeting in 1866. This was only a few years after Darwin published the “Origin of the Species.” By the way, Darwin made a number of interesting comments about language evolution. For him, it was totally clear that human language emerged gradually from animal communication. He also compared languages to species in the sense that languages would compete with each other and some languages go extinct, and once languages were extinct, they never reappear, like species. In close, I would like to work on languages evolution because Chomsky suggested that language might not have arisen by Darwinian evolution, which is a surprising statement for an evolutionary biologist. We could say we would like to work on language evolution because one view is that language came as a by-product of a big brain, which Steve Pinker compares to the idea that we have all the parts of a jetliner assembled in some backyard and then a hurricaine sweeps through and randomly puts the jetliner into place.

I would like to convince you that language is a very complex trait, and it is extremely unlikely that such a complex trait can arise as a random by-product of some other process or as one gigantic mutation or something like that. I would like to say I work on language evolution because language is the most interesting thing to evolve in the last several hundred million years. Maybe ever since the evolution of the nervous system, actually. Why is it the most interesting thing? In my opinion, because it is really the last of a series of major events that changed the rules of evolution itself. So if you ask, among all the things that are covered by evolution, what actually affected the *rules* of evolution? First, you will talk about the evolution of life, because there wouldn’t be anything, really, and maybe four billion years ago, who knows? Maybe on this earth, maybe somewhere else. But 3.5 billion years there were prokaryotes; there were already bacterial cells, so it’s a very short time, actually, if you really believe that it started four billion years ago. Because it then took two billion years to go from bacterial cells to higher cells. Approximately 600 million years ago there were multicellular organisms. Sometime, and who knows when, complicated language. Language changes the rules of evolution because it creates a new mode of evolution. It is no longer the case that information transfer is limited to *genetic *information transfer, as it was for most of the time of life on earth. But we have the ability to use language for unlimited cultural evolution; certainly animals have cultural evolution, but language allows us to bring this to a qualitatively new state and we can transmit information to other individuals and to the next generation on a non-genetic carrier, just on a linguistic carrier.

Now I would like give a very personal list of what I think is very remarkable about language, and this should serve again to make the point that language is indeed a complicated trait, and a complicated trait can *only* arise gradually and by natural selection. Language gives us unlimited expressibility. In the beautiful words of von Humbolt, it “makes infinite use of finite means.” Worldwide, there are approximately 6000 languages; many, by the way, are threatened by extinction. If you look at all these natural languages, there is no *simple* language. So all of these naturally occurring languages have unlimited expressibility. The only exception, maybe, is pidgin, as you know, and the very interesting observation is that pidgin, which is a very impoverished language happens whenever you bring two groups together that do not have a language in common. Nevertheless, in one generation, children who receive the input from this pidgin language turn it into a more-or-less full-blown language, which is called creole. This is a very interesting process to underline our innate ability to interpret linguistic input as referring to a full-blown language. I think it is made complicated by the fact that most of these children also receive input from another full-blown language, and I think this is something that is more and more appreciated at the moment. Even though this argument may not hold for the observation of sign language development, which was recently reported in Nicaragua, as I understand. Children were brought together for the first time who were in their own families and had very impoverished sign languages, and then because they used it as their primary communication, it developed very rapidly into a complicated language.

What is remarkable about language? All of you know about 60,000, and maybe many more, words. This is the average for a 17-year-old American high school student. If you take this figure, 60,000 words, then you learned about one new word per hour for sixteen years. Our language instinct is an association with an enormous memory capacity. It would be unimaginable to memorize 60,000 telephone numbers with the people they refer to. But in some sense, in our lexicon, we have arbitrary meanings memorized with each word. And of course it is an extremely hard question to ask, and people here have worked on this, what it means for a child to learn a *meaning* of a word. A three-year-old child gets more than 90% of grammatical rules right. This is something that I read in the books and I do not observe this with my own children. Apparently if you count all the mistakes they *could* make, 90% of the mistakes they *don’t* make. So very early on, they have some instinct for grammatical language.

Speech production is the most complicated mechanical motion we perform. As you are all aware, in order to produce the sounds of speech, various organs have to be coordinated with each other, movements have to be regulated within fractions of millimeters, and the timing has to be right within tenths, maybe several hundredths of a second. Likewise, speech comprehension occurs with an impressive speed. Artificially accelerated speech can allow you to recognize up to fifty phonemes per second, which is above the theorectical resolution of auditory input of twenty words. The reason for this is because at one moment of speech sound, several phonemes are packed into it. So we can make the argument that language is a complicated trait and that many different parts of our anatomy are really geared to deal with it in an almost perfect way. Talking is totally effortless—we can speak without thinking, and usually we do it. Again, I am fascinated by the comparison of how hard it is to multiply two numbers; it requires enormous concentration. At the same time, the computations that are involved in producing language and interpreting language are arguably much more complicated. So I’ve made all these points just to say that language is a complex trait that can only arise gradually by natural selection.

When did language evolve, is something that people ask over and over again. The question has to be refined a bit, but first let’s take some facts. The first fact is that all humans have complicated language. The other fact is that the most recent common ancestors of all humans lived about 100,000 years ago. This is actually not a well-confirmed figure at the moment because it was calculated for the male lines of ancestors taking the Y-chromosome. So this is Adam. Adam lived 100,000 years ago. If you take the female lines of ancestors, and take mitochondrial DNA, which we only get from our mothers, you find out that Eve lived about 200,000 years ago. So it’s not totally in agreement. You can actually get this disagreement if you assume that there was a higher male mortality. But sometime back 100,000 years, all of Africa was populated with an anatomically modern people. By 60,000 years, from East Africa, there was the spread to the whole world.

Since everybody now has complicated language, and language must have a biologic foundation, as I wanted to convince you here, we must assume that these people had language. So the most recent thing that you could argue is between 60,000 and 100,000 years. If you would like to say 60,000 years, then you would have to assume that the people who went from East Africa to the world 60,000 years ago also replaced the people that had already populated Africa at 100,000 years ago. And the other timing is that chimpanzees do not have language’s unlimited expressibility and humans and chimps separated about five million years ago. We all know that there are experiments about training chimpanzees and Nim Chimpsky was one of the individuals that was taught to speak. They are certainly very clever, but at the same time, nobody would argue that they have unlimited expressibility in their natural communication system.

At the same time, we have to be aware that monkeys have brain areas that are very similar to the human language centers. But they don’t seem to use it for vocalizations. Wernicke’s area and Broca’s area, as I understand it, is not used by these monkeys for vocalization. But again, experts here in the audience may know much more about it than I. What I want to say is that it is clear that language uses cognitive abilities that evolved long ago. So language is not just a great moment of evolution’s blind watchmaker, some 100,000 to 200,000 years ago, but is the consequence of playing with animal cognition for 500 million years. The question is therefore not *when* did language evolve, but when in evolutionary time can we find which aspect of language.

Evolution always uses the trick that it will use certain structures that have evolved for other purposes and re-wire them and use them for unexpected new purposes. The same must have happened for human language facility. Another question which is asked very very often is why do only humans have language. And again I will say more precisely one should say, why did only one family of the animal kingdom evolve communication with unlimited expressibility? Partly, this is what my research is about. In the sense that I would like to show you what are the steps in language evolution and what is the complication in making these steps. And where animal communication may *not* have taken the next step. But partly, this question remains unanswered because in one sense it is almost an historical question, and science doesn’t really allow you to make statements of one-time events.

So the goal of this research program is to formulate an evolutionary theory of language, and to study how natural selection guides the gradual emergence of the basic design features of human language. Such as arbitrary signs, syntactic signals, and grammar. And again, evolution is a mathematical theory and therefore I would like to formulate everything I say in terms of mathematical concepts. To define whatever we are talking about, in the very least.

So first, let us try to imagine how evolution leads to arbitrary signs. So let us look at a simple signaling system, maybe this is something that is used by animals. There is a matrix, which somehow links which referent of the world—whatever can be referred to—is linked to what signal. And with this xxx design still in current human language, where one can see it as the lexical matrix. In current human language, the lexical matrix would link word meaning with word form. A lot of entries in this matrix are zero, which means this word meaning is not linked to this word form, but some of these entries may be linked. You can think of this matrix as a binary matrix—either you are linked or you are not linked—or you can have graduation in this matrix, where you are more strongly associating it with one thing than with another thing.

In order to define speaking and hearing, one has to go from this lexical matrix to two other matrices, which have two stochastic matrices to derive probabilities. So, there is a mathematical transformation that generates out of this lexical matrix a matrix that encodes how a speaker uses language. This is basically, I want to say a certain referent, what is the probability that I use a certain signal for it? And the other way around: if I receive a certain signal, what’s the probability that I associate a certain referent with it. These languages cannot be identical but they can be similar, and they are related to this lexical matrix. So the way communication then works is that there’s a speaker who wants to communicate referent *i*, and uses a certain signal *j*. This matrix tells him or her which signal to use. And then the receiver will receive signal *j* and this matrix will tell the receiver which referent to associate with it. That is the basic mathematical ingredient of our simple language model here.

And then we use evolutionary game theory. Game theory is something which was developed John von Neumann and Oskar Morgenstern to describe human behavior and economic behavior in 1970, but John Maynard-Smith applied to animal behavior, and linked to evolutionary processes. So we assume that whenever a speaker and a hearer have correct communication about a certain referent, they both get the point. So we make a very simple assumption that communication is advantageous to both of us. I could make it more complicated, and at some stage I have to do this, where I would assume I give you information and it only helps you, it doesn’t help me, or I tell you to do something which is bad for you, so I manipulate you and that helps me but not you. So we can extend the model in these directions but at the moment I stay with the model that assumes it helps both of us. So we assume a cooperative model.

Then, I can write down this function, which is essentially the probability that I will use a certain signal to convey a certain meaning and you will associate exactly the same signal with this meaning, and I average over all the signals and over all the referents and then I take the average over ones I speak to you, and ones you speak to me. This is how you can derive this payoff function. Now I use evolutionary dynamics in conjunction with this payoff function.

So ‘evolutionary dynamics,’ what does that mean? There is a group of individuals; everyone starts with a random lexical matrix, so I have no associations to start off with. Everyone talks to everyone else, and there’s the same probability that one person talks to another person. For each successful communication there is a payoff of one point, as I have indicated in this function before. Individuals produce children proportional to their payoff—this is very important, by the way, language ability has to have consequences for biological fitness. If it doesn’t have it, then natural selection has no ability to shape our language instinct. So in some sense, language must have an impact on, say, survival probability. If survival probability increases, we may also have more children. And then finally, I will assume that children learn the lexical matrix of their parents. At first, this is convenience, because it is very hard to formulate the mathematical theory where you can actually learn from everybody, but we are doing this at the moment. So one could say predominately I learn my language from the parents, and I also have other input, but here the mathematical model at first we assume that you *just* get it from your parents.

If you could learn language randomly from anyone in the population, then I would actually destroy the fitness effect, then, because there’s no reward for speaking well. If I want to learn language from lots of people, I should also preferably learn from those who communicate well. So I can reward fitness in two different ways: either directly biologically by learning from the parents, or by preferably learning from those who communicate well, based on reputation. So if you round this on a computer, it’s not just a computer simulation, we can also write down equations and analyze them in detail, but in a computer simulation you could say that there are three individuals here in your population at time zero and they have some random entries, which overall don’t make sense. Then you run this evolutionary generator for some generations and then you will find that certain associations begin to emerge and if you run for long enough, you may find that everybody uses the same signal to convey the same meaning. And it is arbitrary because whether the first signal is associated with the first referent or with any other referent, that’s just something which happens. Whether or not I will arrive at this final stage of a coherent lexicon depends on how *well* children learn from their parents. This is exactly what I want to analyze.

So you could say children make certain errors during language learning, and one simple kind of error is that they do not learn *all *lexical items. So there is a certain lexical item and maybe it is so infrequent that children don’t get it, and it’s lost. Then we can calculate, given the probability that children do not learn a specific association (we call this *u*), which is somehow in the accuracy of this cognitive ability to learn lexical items. What’s the total number of items I can maintain in a population? We can show that the maximum lexicon size is given by *1/2u*. So if I give the accuracy of acquiring words, that’s the total number of words. We can also calculate the distribution of words that are known by people; some people know a lot of words, other people know few words. It is a Gaussian distribution and the average number of words that is known by individuals is the total number of words divided by Euler’s constant *e*. So this is a specific mathematical model for this one type of error. But of course this one type of error is not really the most general type of error because, what you would like to do is allow children to learn incorrect associations. This really gets to the problem of how to learn the meaning of a word because the children might have certain guesses and may get it wrong in the beginning but they may refine their guess and make it right later. Here, we are doing the mathematical analysis just at the moment, and we find in preliminary results that what you really need if you have this kind of mistake is a mechanism to sort out conflicts. Otherwise the communicative ability of a population will diverge. For example, if you have ambiguities here, you want to resolve those ambiguities, at least to some degree. As we know, lexicals have ambiguities—we cannot get rid of them—that’s also what we find here, but you want to reduce them to increase communication ability.

The other thing that is very important is that there must be something like a necessity of reduced expectation in the sense that the child must know which are potential referents. If there is a word, what *could* it label? The child cannot possibly be totally openminded about what it could refer to. So we can calculate, in terms of the mathematical model here, for a different probability to mistake one referent for another referent what is the total number of referents that the child can expect.

So I think that in the same way that some people like to talk about Universal Grammar, one could make an argument for the Universal Lexicon. There has to be some innate components to solving the task of learning the meaning of a word.

The next extension of our theory was to assume that there is an error matrix that essentially a noisy channel that links speaker and hearer. For example, I say ‘lion’ and you understand ‘banana.’ It’s an unlikely mistake, but other mistakes are more likely. This is also important for animal communication because signals may not be arbitrarily well resolved from each other. We have a setup here that is totally equivalent, or at least superficially equivalent, to Shannon’s Information Theory. There is again this matrix that determines speaking and this is like Shannon’s encoding matrix. And then we have an error matrix, which is the probability that one will mistake one signal for another, which is Shannon’s noisy channel. And then we have our hearer matrix which is decoding.

If we analyze this now in the population center as fitness, which is different information theory, then we find that there is an error limit. This is essentially the observation that there is a maximum fitness which is achieved by using a small number of signals to describe the most relevant objects in the world around you. Adding more signals in the system reduces the overall fitness of the individual. In such a setup, therefore, natural selection will *prefers *a limited repetoire. This is interesting because I think we will all agree that animal communication in natural situations is based on limited repetoires, perhaps because increasing the number of signals doesn’t really increase the communicative ability because of the chance of mistaking signals for each other.

How did *human* language deal with this error limit? The obvious thing that it did was that this error limit can be overcome by using signals that consist of sequences of basic units, which are phonemes. So human language works with sequences of phonemes. Mathematically, what we can show, and this is is similar to a theorem proved by Shannon, in Information Theory, that the maximum reachable fitness of this communication system increases exponentially with the code length, with the word length, essentially. Of course, you do not want to have arbitrarily long code because of the time of communication goes down. So if you want to optimize both, you have word length of intermediate size. But what I want to suggest here is that human language overcame this error limit by sequencing phonemes and therefore has the ability to generate large numbers of words which can be fairly well resolved from each other.

A fascinating question for me is do animals have words. Really, that depends on what you would call a word, because there are certain aspects to a word, some of which you will definitely find with animal communication. So if you say a word is an arbitrary sign, then I think that you would say that animals also have arbitrary signs. If words are sequences of phonemes, then you could say birdsong also in some sense makes use of sequencing sounds. We don’t really know how that is important for the meaning, because the meaning of the birdsong seems to be fairly standard. Then you could say the meaning of a word depends also on context in human language and here it is much *less* clear whether animals have the same ability in their natural communication systems. Again, I remind you of the situation where a child would be in the one-word stage and would say something like the word *kitty* and can use it, actually, to denote many different meanings: ‘there is the kitty,’ or ‘where is the kitty.’ Still, it is just a one-word utterance. I am not sure if there is an example of animal communication where they can do that. I wonder if the ability of children to do this is basically a consequence of the fact that they have listened to their parents’ syntactic structure and have seen that this one word can occur in different syntactic contexts and therefore without actually making use of the syntactic structure themselves make use of the context dependence.

So again, for me a most fascinating question is do animals have words and I would like to know what experts like John Smith or Dorothy Cheney would say about this. Animal communication: bees, most remarkable, can communicate to each other the location of food and the amount of food. It is like a three-dimensional analog system. And then, the famous example of animal communication, vervet monkeys, worked out by Dorothy Cheney, that the handful of signals and maybe the most interesting ones are ‘eagle,’ ‘snake,’ and ‘leopard,’ there are more of these and as you know they can use the signals to induce a certain behavior. So for example, ‘leopard’ and the monkeys jump up into a tree and I understand that children have to learn these signals and get it wrong initially and I also understand that there’s an anecdote that these signals can sometimes be used for deceptive purposes. One monkey has a banana, another one shouts “leopard!” the monkey drops the banana and jumps into the bush, the other one has the banana.

So, as you know, human language has an interesting design which is referred to as relative patterning. It means that human language makes use of combinatories on two different levels: sequences of phonemes form words (and I talked about this already), but then you have another stage where combinatories come into play, and this is sequences of words form sentences. I would like to talk in the next part of my talk on the evolution of syntactic communication.

Let me define syntactic communiction in the most trivial way, just to say that syntactic communication is whenever signals consist of components that have their own meaning. Most syntactic communication is the opposite, so you use signals that cannot be decomposed into their own parts with their own meanings. I ask, what is the condition that natural selection sees the advantage of syntactic communication? I mention a very simple scenario: say you have two nouns, *lion* and *monkey*, and two verbs, *running *and *sleeping*, and you want to refer to the events ‘lion running,’ ‘monkey running,’ ‘lion sleeping,’ and ‘monkey sleeping.’ So a non-syntactic communication system would have four arbitrary signals for these things. A syntactic communication system would have signals for the components of the signals, so for *lion* and *monkey* and for *running* and *sleeping*.

Now, I basically build a population dynamics evolutionary model and ask when does natural selection see the advantage of syntactic communication. Obviously what we observe is that syntactic communication allows a larger repetoire, allows to formulate new messages that have not been learned beforehand. So our sentences are of course syntactic signals and our words are non-syntactic signals. If you take words as listemes. All the words, we have to learn, but sentences we say are new messages. But it also allows us to use the signal in different contexts. There are clear advantages to syntactic communication. The interesting observation was, however, that in our evolutionary models we find that natural selection can only favor syntactic communication if the number of relevant messages exceeds a certain threshold. If individual components can be used in many different messages. There is a certain mathematical equation that one can write down, and if the social interactions of this group do not fulfill these conditions then natural selection prefers non-syntactic communication over syntactic communication. So, as the social complexity of interactions increased and more and more messages became relevant, only then can natural selection see why it would be better to use syntactic communication. Otherwise, you are better off sticking with the non-syntactic communication system, which does obviously not give you unlimited expressability. We call this a “syntax threshold” in evolutionary theory.

Now, in the final part of my talk, I would like to discuss with you the evolution of grammar. This is work together with Natalia Komorova, an applied mathematician at Princeton, and xxx, a computational linguist at Chicago. First, let us think, what is grammar? I think Chomsky called grammar the computational system of language. It is essentially the part that allows us to make infinite use of finite media. This is an architecture that was proposed by the linguist Ray Jackendoff describing what grammar is in the human language capacity. You have phonological rules, and these phonological rules are linked to hearing and speaking. There is an interface that links these phonological rules to syntactic rules and an interface that links these syntactic rules to conceptual rules or semantic rules, which then determine perception and action. Grammar is essentially an overarching rule system that includes all of these. So if you like grammar as a rule system that generates a mapping between hearing and speaking, which is our signal formulation, and perception and action. Very much as we had before in the lexical matrix.

So the very generalist way you can think of grammar as a rule system that generates such a matrix, that links referents to signals. Mathematically, the conventional representation of grammar is that it is a rule system that generates a subdivision of strings, of sets of sentences, of integers if you like. So you could imagine, these are the sets of all possible sentences and you could describe everything as a binary language, because you know, computers encode everything. So you can ennumerate all possible sentences. There are infinitely many, but we can ennumerate them. Some of them make sense, others don’t make sense at all. And then we can say that a grammar is a rule system that tells you which of the sentences make sense and which of the sentences do not make sense.

And now, what is the process of grammar acquisition? The observation is that children reliably acquire the grammar of their native language by hearing grammatical sentences and the information available to the child does not uniquely determine the underlying grammatical rules. This is called the poverty of the stimulus. This is not actually a matter of debate, but is a mathematical fact. This was first observed by Chomsky, but then made rigorous by Gold, who formulated it as a theorem. He mentioned that I have a rule system that can generate certain integers, and I’ll give you examples of my integers, because as I say, sentences can always be ennumerated. After a number of examples, and you can choose how many examples you would like to hear, I’ll give you as many examples as you wish, I ask you what is my rule. Or I ask you to give me some examples back. We can show mathematically that there is no way you could solve this problem if you had no preformed expectation about what could be my rule, or some preformed expectation about what is the most complicated rule I could possibly consider. This is a fundemental part of learning theory. People like Scott Weinstein have worked on this and written books about exactly this mathematical foundation of learning theory. But from that perspective, the notion of necessity of innate components is undebatable.

So what Chomsky then said with respect to linguistic acquisition is that children could not guess the correct grammar if they had no pre-formed expectation and this innate expectation is Universal Grammar. In that context, grammar acquisition works in the following way: there’s an environmental input which consists of sample sentences for example, and then there’s a learning procedure, the child evaluates this environmental input. This learning procedure tells the child to choose one of the candidate grammars that are available to the child in the search space. So this is the total hypothesis that can occur to the child during language learning. And this search space could be finite or infinite. If it is infinite, then the child must have a prior distribution of which candidate grammars are more likely than others.

What you would like to propose now is that Universal Grammar, what Chomsky called Universal Grammar, consists exactly of the rule system that generates the search space together with the mechanism for how to evaluate the input. So in this context Universal Grammar is equivalent to the mechanism of language acquisition. If I then say Universal Grammar has to be innate, I don’t say anything which is controversal, because you all agree that a mechanism for learning language has to be there, because you learn language. But *how *you learn language, you have to know it, it is not part of the learning mechanism. What, then, in reality, goes into this Universal Grammar? This is one of the most fascinating questions of cognitive science. It seems to be clear that it is not only linguistic structure that must be part of Universal Grammar, but also very general cognitive abilities like Theory of Mind. I think you could not learn the meaning of a word if you did not have a model for the theory of mind of the person who teaches you.

We will now proceed with a finite search space, because this is what we are able to formulate so far. We can extend these mathematical models to an infinite search space with a prior distribution. I do not think this would greatly change the quality of these results I’m presenting. So now I link this concept of language acquisition and Universal Grammar to population dynamics of grammar acquisition. I ask now, what conditions does Universal Grammar have to satisfy for a population to evolve and maintain coherent grammatical communication, because this is the real evolutionary question. We would like to know which cognitive abilities must be in place such that a population of individuals will have a chance to come up with and maintain a grammatical communication system.

In order to do this, I have to introduce a few technical aspects. So, imagine there’s a grammar *i* and a grammar *j*. These are rules that generate sets of sentences; they may have a certain overlap. I will define, with the number *aij* the probability that a speaker of grammar *i* says a sentence that is compatible with somebody who uses grammar *j*. This tells you the pairwise relationship between grammars. This will be very important for the task of learning which grammar is actually being spoken by a certain person, as you will appreciate very soon.

So I have introduced another *a* matrix, and I apologize for always using the letter ‘a’ for matrices, this is very different from the matrix I had before, but this *a* matrix now specifies the pairwise relationship between the candidate grammars in the search space. It doesn’t really tell you the full configuration of the search space, only considers the pairwise relationships. And now, this *a* matrix will be a consequence of your favorite theory of grammar acquisition. For example, principles and parameters or optimality. You don’t really know what this matrix looks like. Therefore we will do a trick that physicists use whenever they come across such a problem. There was a time when they wanted to describe how the components of a nucleus in an atom interact with each other and the interactions are very complicated, and therefore what they did is they assumed, let’s make them random. Let’s start with a random matrix, and try to see whether we can make some headway in this way. So we will assume that this matrix here is a random matrix. So we will ask, how does the system behave as if it would be a random matrix, with certain properties. So again, it’s the same as I had before with learning the lexicon. You must assume that communication has some consequence for fitness. They way we do it with grammar is actually very natural, because we have these interactions between the two grammars now. So if one speakers uses grammar *i* and another uses grammar *j*, then the payoff, which is the chance of mutual communication compatability is just the average of whenever *i* says something and *j* understands, or *j* says something and *i* understands. Take the average of the two. So we have here a link between this *a* matrix here and evolutionary gain. Communicative success.

Now as before, we write population dynamics of grammar acquisition. We assume that everybody talks with everybody else, some probability. Payoff translates into fitness, reproduction is proportional to fitness, children receive input from their parents and develop their own grammar. Everybody has the same Universal Grammar—that’s what we assume in the beginning. So, a very autonomous mathematical model where everybody has a particular mechanism of language acquisition and we ask, is this mechanism of language acquisition sufficiently powerful such that the population will converge on a coherent communication system.

The mathematical equation that we looked on has the following form: it is a system of differential equations where *xi* is the frequency of all those who use grammar *i* and the sum over *xi* is normalized to deal with a system that doesn’t consider changes in population size. The fitness of all those who use grammar *i* is a function of what everybody else uses in the population, so it’s the sum of all *j*, *xj* is the frequency of those who use grammar *j* and this is the payoff to somebody who uses grammar *i* gets from somebody who uses grammar *j*. This is then the biological features of all those individuals who use change of tape and *qij* is the probability that a learner will acquire grammar *j* from a teacher of grammar *i*. So this is where your learning mechanism comes into play. Perfect learning would be *qii=1* and you never make a mistake, but then you never learn perfectly, so you have a slightly different grammar from the grammar of your parents for example.

We have this term *Φ *here, which is the average frequency of the population, which is a term that makes sure that the overall population size is constant, but it has an interesting interpretation of our model: it is the grammatical coherence. It is precisely the probability that if one person says a random sentence, another person will understand the sentence. For us, as theorectical evolutionary biologists counting sheep, it’s interesting to see the relation that this language equation has with other equations in evolutionary biology. When on one hand you have this xxx species equation which was designed by Mumfred Eigen for the chemical evolution of the origin of life, and you have the so-called replicate equation which is the fundamental equation for all of frequency dependent selection and evolutionary game theory. Our equation is on the continuum between these two equations, generalizing both.

Now, what we see for the language equation is that their equilibrium solutions of this equation—there are two kinds of equilibria. One is everybody uses a different grammar, so this is kind of an “after the Tower of Babel” situation, where nobody understands each other and then one grammar’s predominating. Which of these solutions is stable? It depends on Universal Grammar, depends on the mechanism that individuals use to learn the language. That’s what you want to calculate.

First, we start with a random matrix, but first we make it even more simple, we start with a very symmetric matrix where we assume that all grammars are equally good and all grammars have the same distance from each other. So the chance that a person who uses grammar 1 says a sentence is compatible with grammar 1 is 1. It’s obvious that the diagonal here is 1. And then they overlap the pairwise relationship if xxx the grammar’s is constant. We do this because we can solve it analytically and later we’ll extend to this. So if you have this matrix for the pairwise relationship among grammars, then the learning mechanism will then generate a cue matrix that has the same structure. The *q *is the probabliity of learning it correctly and *p* is the probability of learning it incorrectly. So there are *n+1* equilibria in this equation, there are *n* assymetric solutions that are indicated in the one symmetic solution, which is no coherent communication. We can then write down equations that tell us when the solutions start to become existent and when they are stable and I shouldn’t really go into all these details, but show you a sort of bifurcation diagram and this bifurcation diagram has the following form: there is this accuracy of language acquisition *q*, which is the probability of children learning exactly the language of their parents, the grammar of their parents, and if this accuracy is below a certain threshold, the only solution you have is this where every grammar is equally likely to be adopted, so there is no coherent communication. But if this accuracy goes above a certain threshold, then this solution starts to become xxx and also stable (this yellow branch is the stable branch of the solution here) and they refer to the situation where the population has settled into one predominant grammar. Which of these *n* grammars in the search space, is arbitrary. So if assymetry breaking in the system then you go from uniform distribution of all candidate grammars to one dominating. And then, the uniform solution loses stability here, so *q>q1*is a necessary condition for grammatical coherence and *q>q2* is a sufficient condition. So this coherence threshold can be formulated in the following way: if *q>q1* then Universal Grammar can produce coherent communication in the population, and you can calculate this *q1* for this simple example.

More generally, we can say something about the maximum complexity of the search space. The learning inaccuracy is a declining function of the search space and *qn>q1* is an implicit condition for maximum complexity of the search space, that is compatible with coherent communication in the population. So that’s a condition that Universal Grammar has to fulfill, such that there is coherent communication.

I have not yet talked about an actual learning mechanism. But now I would like to do this. I would like to discuss two learning mechanisms which are arguably boundaries for the learning mechanisms that are actually used by the child. The one is the most simple learning mechanism you can imagine and the other one is a complicated one and our cognitive abilities are better than the xxx learner, but not sufficient to perform as good as the other learning mechanism I will describe, as so-called xxx learner. So the xxx works in the following way: you start with a randomly chosen grammar, a candidate grammar in your search space, you stay with the current grammar as long as you receive input that is compatible. Then you change to a different grammar if a sentence is not compatible and you stop after a certain number of sentences.

And if you have acquired this learning mechanism then you can calculate the coherence ratio that Universal Grammar has to induce which is that the number of sentences that the child receives has to be greater than a constant that is the size of the search space. So if this condition holds, then the mechanism of language acquisition can induce coherent communication in a population. And if we now put up the bifurcation diagram for a random *a* matrix, it looks much more complicated than what I showed before. There’s again something which is corresponding to this uniform solution where all grammars are more-or-less equally likely and then individual grammars have different branches that reflect when xxx the population and for certain language learning accuracy you can choose a subset of all the possible grammars in your search space. We know now that this equation that I showed here applies also to this situation here. It is a consequence of something that was proven for random matrix theory only very recently by a mathemetician at Temple University in Philadelphia.

The other learning mechanism I would like to discuss is a batch learner. A batch learner is like xxx men with infinite memory. It memorizes a large number of sentences, all sentences it receives, and after this it chooses which kind of grammar is most compatible with all this input. Our primitive capabilities are certainly not as good as that. And the the coherence ratio of the batch learner is that the number of input sentences has to exceed a constant times the logarithm of the size of the search space. And whatever we use most likely is between these two boundaries.

So we could extend this model in several ways, and partly we try to do this and partly we hope that others would do it. We can formulate it in a spacial context and then you can ask where in different locations different grammars become predominating. So we have described a homogeneous, well-mixed population that one grammar will dominate if the language acquisition mechanism is high enough. But in different regions of the space you will get xxx in different grammars. Then one has to ask whether in this setup the boundaries between languages would be stable. One can also formulate these dynamics in small populations, with a xxx deterministic equation. The deterministic equations always describe large populations, but more importantly one would like to have small populations that xxx a stochastic equation. We have started to analyze this. And you can also generalize it to assume that different grammars have different performances. What I’ve showed you so far is that each grammar was equally good, also for the random matrix. But you can assume that certain grammars express certain things but not other things, and then you can think of the population searching for more and more well-adapted grammars, in a large search space of possible grammars. But then you would have a cultural evolution trying to adapt for a grammar that has better and better expressibility or less and less ambiguity. And in that sense you can describe the cultural evolution of grammar and also compare ideas of this model to Xxx from his article *Language changes* from historical linguistics. So this is one way how one can make connections to observations about language change.

Then, what we did already do is evolution of Universal Grammar in a biological sense. So in the model so far I’ve assumed that everyone has the same universal grammar--does it give rise to coherence, or not. But what you would like to do is the variance of the mechanism of language acquisition, so the variance of Universal Grammar, so not very universal then. So suppose people differ in their genetic ability to deal with Universal Grammar and xxx halt that mutation does natural selection change Universal Grammar. In order to do this you have to study equations of the following type. Here there would be two different Universal Grammars in the population and you ask which one is favored by natural selection. What we find so far is that natural selection acting on Universal Grammars leads to a limited period of language acquisition. So on one hand you want to receive a large input, because then you learn accurately. But then you also learn for a very long time. So you can analyze this situation precisely where you say everybody has the same learning mechanism, the same search space. The only difference is how much input is being considered. If you do this, you find that natural selection prefers an intermediate amount of input, which is one idea to suggest why there could be a limited language learning acquisition period in children, because this is the evolutionary optimum. We also find in some sense (but this is very hard to make rigorous, at least at the moment) that natural selection leads to intermediate search spaces. So you could ask, if I want to maximize the accuracy of language learning I should just be born with whatever language we’re speaking, because it’s the most perfect way xxx. So you could survive xxx, but obviously it may just be impossible to do it. But then you will say, okay, so natural selection then would reduce the size of the search space as much as possible. The complication, however, is that imagine that there are two kinds of people, ones that prefer a very limited search space and ones with a larger search space. And now somebody has a cultural invention about language, say invents the concept of a sub-clause or so, and then the larger search space allows you to learn this. The smaller search space does not allow you to learn this. So you have natural selection for maintaining flexibility. I would like to make this more precise mathematically but it is hand-waving at the moment, and I would like to say that in other areas we know very well that evolution pays a big price for remaining flexible. So that could also be a reason why we would like to remain flexible for language acquisition. The price that we pay is essentially the inaccuracy or difficulty in learning language.

The very last thing that I would like to mention is that similar to the syntax threshold that I described there’s also something like a grammar threshold. If you imagine that you have a population of speakers and they have sentence types now that are relevant to their performance, and imagine there’s a finite number of relevant sentence types. Then you could say one strategy to learn language is don’t search for rules at all and just learn it by heart. So memorize the sentence types. Of course, with our current understanding of language this is stupid, because we know we don’t do this. But I would like to remind you that you have an enormous memory capacity associated with our language instinct, and this is precisely what we do for words. And, arguably, the distinction between grammar and lexical items is not so clear and certain grammatical rules we memorize as lexical items.

So we could then say, how is this competition between so-called list-makers and rule-finders? Rule-finders are those that are described just in the last part of the talk where you would have a search space and look for underlying rules and list-makers are those that just make note of all the structures that they receive and memorize them. And you can also calculate that rule-finders out-compete list-makers only if this number of relevant sentence types is above a certain threshold, which depends on the constant, which depends on the size of the search space that is induced by Universal Grammar. It also depends, of course, on the learning mechanism that you use, but for fairness here I used the best possible learning mechanism for the rule-finders.

In summary, I have talked about arbitrary signs, sort of components of the lexical matrix of human language and then I talked about how natural selection can shape the two aspects of the duality of patterning in human language that is word formation, words are sequences of phonemes, and syntactic signals, our sentences consist of components that have their own meaning. And in the final part of my talk I presented some ideas how to start to begin thinking about evolution and natural selection of grammars and Universal Grammar. Thank you very much.