Doing a Linguistic Simulation (Using R)

Gerhard Schaden

2013-07-12 12:23

In this squib, I will explain how and why I did what I did in my paper in the International Review of Pragmatics, focalising not so much on the results (which you can read in the paper), but rather on the decisions I had to make. In brief, it is a sort of a blog entry. I basically aim to provide the kind of help I would have liked to have when I started working on the diachronic simulation of present perfects.

The article mainly presents the results, but because of want of space, does not really permit to redo the experiments, and does not specify the underlying technicalities. This is what I am trying to expose here. (I am presupposing that you know the article at least a bit). I cannot give here a full introduction to GNU R. I try to explain the code enough so that you understand what is going on even if you do not know the language. But if you want to do simulations yourself, get yourself a good book on R from your local library or check out the web for video-tutorials. There are many of them around.

If you have GNU R installed, you can just copy and paste the code snippets into the R command line; otherwise, check out the instructions here.

Disclaimer: I am no programmer, and have only very limited knowledge of statistics. If you find errors, please feel free to enlighten me by mail.

How I came to do a simulation

I had been thinking about quite some time about a method to simulate the aoristic drift of the present perfect. There is of course a lot of literature on such domains involving game-theory for linguistics, and there even is an implementation by Gerhard Jäger of a simulation software specifically targeted toward linguists: EvolOT.

If you can frame what you want to model in Bidirectional OT, I urge you to check out this link. There is no good reason not to use it, even though the default gnuplot-output can sometimes look ~~pretty shitty~~ not as good as one might hope (but nothing that one could not configure, I guess).

One of the difficulties was that my problem did not seem to make sense from a BiOT approach. Therefore, I had to go for an other approach (so at least I thought), and do the whole on my own.

The basic intuition that I tried to explore was that the frequency of use had an effect on the meaning of present perfect (viz. simple past); yet I did not know exactly how that could be expressed. At the beginning, I went for a rather straightforward markedness hypothesis (one form is marked, the other one unmarked), but that did not get me very far. I took me some time to notice that - if one is serious about frequency - I would have somehow incorporate a frequency distribution into my model. For the formal semantician I am, this looked like a very frightening step. But eventually, I took the blue pill.

So, I assumed that the distribution had something to do with current relevance, and I was also familiar with Merin's work, which allow to quantify relevance. And then, at some time, I actually drew a probability distribution. I did it on paper, so I cannot show you the result. Initially, I thought that I would play it safe and go for a normal distribution. However, after some research, it turned out that - since I thought of Current Relevance as a percentage (i.e., a value between 0 and 1) - that I would need a beta-distribution instead. A beta distribution has predefined upper and lower limits.

I decided to use R for doing the plots I needed, principally because I already knew a little bit Baayen's book Analyzing Linguistic Data (I know that there are other books covering the use of R for linguists, but I happen to know only this one), and because R

contains most of the functions I needed (and so, it minimises effort);
has excellent graphics capabilities;
has a huge community, with excellent online resources; and finally
did not look too complicated for a non-programmer like me.

R has also at least one big disadvantage: it is not particularly fast. But since my simulation is spectacularly simple, this is nothing I needed to worry about. If you want to make complicated multi-agent simulations, you might want to go for another, faster programming language.

For instance, plotting a beta-distribution is extremely simple:

curve(dbeta(x,5,5),   # the density function of the beta distribution
        xlim=c(0,1))  # sets the lower (viz. upper) limit to 0 (viz. 1)

R ignores everything on a line behind a #; that is why we can put comments behind it. It is always a good idea to comment what you have done. And this is what the above gives as an output:

nil

Technically, a beta-distribution has two parameters, named α and β, respectively.

dbeta(x, α, β)

You can toy around a bit with these two values to get a feeling for the differences it makes when you vary them; in brief, if α=β, then the distribution will have its highest point (its modus) at 0.5, otherwise, it will lean towards 0 (if α < β) or 1 (if α > β). You can also have a look at the wikipedia page of the beta distribution, which features the following plot:

Making the simulation work

Now I had determined a suitable probability distribution for Current Relevance. So far so good. But the issue of diachronic dynamics is that there must be some (possibly very small) difference between generations, which will have wide-ranging consequences as time goes by. How can we get dynamics into the picture? My idea was that speakers and hearers do not have identical probability distributions, but that speakers overestimate theirs. What this looks like is stated in the paper, so we will concentrate here on how we can make the simulation.

R comes equipped for every distribution with

the probability density function, that is, how much probability mass is to be found for any specific values (in our case, dbeta, which we have already seen above)
the cumulative density function, (CDR) that is, how much probability mass is to be found below a specific value (in our case, pbeta)
the quantile function, which is the inverse of the CDR (in our case, qbeta)

I needed functions for production and grammar inference, that is, functions that

given a value of current relevance, indicate how much probability mass will be below and above it (this is the production part: which is the percentage of simple pasts vs. present perfects).
given a percentage of values for simple pasts and present perfects, tells us where will be situated the threshold value n of current relevance (this is the grammar inference part).

Now, luckily, it happens that in very simple cases (the unimodal ones) the production-side corresponds to the cumulative density function pbeta, and the inference-side to the quantile function qbeta. So we could just take those. But in order to practice, we will see how one can create a custom function in R.

production <- function(x) {
  pbeta(x,5.5,4.5) # values of α and β slightly shifted wrt hearer distribution
}

inference <- function(x) {
  qbeta(x,5,5)	   # same parameters for α and β as we have seen above
}

This indicates that both production and inference are functions that take one argument. It is not strictly necessary to create these functions in this case, but we will need to do so later anyway.

Let us now look how we can use this in order to do our simulation. The basic principle of Iterated Learning Models is that one learns based on the input provided by the production, and one does this a certain number of times. Suppose that our initial value is 0.999; the speakers will produce given this value, so we will call production(0.999); then the hearers will make an inference based on the output of this function.

# creating intermediate assignments:
x <- production(0.999)
inference(x)

# doing the same in functional style:
inference(production(0.999))

Given our values, calling either version will have the return value of 0.9979835. This is lower than the initial value of 0.999, as we expected. Now, we need to check what happens on a longer run. Therefore, we will iterate this whole procedure 40 times (and check afterwards if this is enough).

# initialize values for loop
n <- .999	# the initial value for n
N <- c(n)	# this where we will store the values we will plot; we add n
k <- 40		# the number of iterations 

while(k>0){
   n <- inference(production(n)); # make a learning cycle, assign the result to n
   N <- c(N, n);                  # add the new n to the end of the list N
   k <- k-1;                      # decrease the counter by one
}

plot(N)

We start with an initial value of n at 0.999, and make then a first inference-production round. The result will be added to the list we plot, already containing the initial value 0.999. We then decrease the counter, and then, the whole processus starts over.

And we get the following result:

nil

So, the value for n drops until it hits 0. We could now add all sorts of bells and whistles, like better labels for the axes, and a legend. I will not bother with this here. So let us move on to more complicated stuff, namely the move to multimodal distributions.

How to do multimodal distributions

The modus is the value that appear most frequently in a data set. In our example above, the modus is at 0.5. Of course, it may happen that two, or more than two two values have the same, highest frequency. This is what one calls a multimodal distribution (as opposed to the unimodal distribution displayed above).

Preliminaries

The key to making multimodal distribution in my simulation is the idea that a multimodal distribution is made up of at least two unimodal distributions, each of which represents a genre. Since they all are distinguished wrt current relevance, each genre can be represented as a beta-distribution, ranging from 0 to 1.

There is only one tiny trick to it: the multimodal distribution needs to be a probability distribution, which means that the sum of all occurrences will have to sum up to 1 (or to 100%). However, each individual genre - as a beta-distribution - also comes as a probability distribution (and its occurrences will add up to 1). Therefore, you will have to divide by the number of individual genres composing the multimodal distribution.

Now, we can rewrite our speaker and hearer functions as follows:

# the speaker probability distribution
dmyspeaker <- function(x) {
  (dbeta(x,18.5,3) + dbeta(x,4.5,22.5))/2 # Division by 2 is required
                                          # to keep it a probability
                                          # distribution
}

# the hearer probability distribution
dmyhearer <- function(x) {
  (dbeta(x,20,3.5) + dbeta(x,3.5,20))/2   # Division by 2 is required
                                          # to keep it a probability
                                          # distribution
}

curve(dmyspeaker(x),xlim=c(0,1))
curve(dmyhearer(x),xlim=c(0,1), add=T, # we don't want to overwrite, but to add
     col="red"                         # red colour to distinguish the two plots
     )

nil

We can see that the speaker's probability distribution is shifted to the right. In our example, we have two individual beta distributions as ingredients; therefore we divide by 2. If we had three or more, that part would have to be adjusted.

So far, so good. But the density function is not what we need (other to visualise whether we have well chosen the parameters, such that the speaker overrates his contribution…); we also need a production and an inference function. It turns out that the production function is simple, since we can simply add together the two CDFs and divide the result by 2:

# the production function, based on the hearer probability distribution
production <- function(x) {
  (pbeta(x,18.5,3) + pbeta(x,4.5,22.5))/2
}

The quantile function is less obvious. Helpful persons on the R mailing list pointed me to the distr-package, which contains what I needed, namely the function "AbscontDistribution". The distr package is not included in the standard R install, therefore you will need to install it individually. This is actually very easy: just type the following at your R command line:

install.packages("distr")

Then, you will be prompted for a mirror; chose one close to your location, and everything will be taken care of automagically. Once this is done, you have to load the package with

library(distr)

Finally, we can define our inference function:

## use generating function "AbscontDistribution" from the "distr"
## package. You could put it directly into the inference function, but
## that makes it MUCH slower
D <- AbscontDistribution(d = dmyhearer, low = 0, up = 1, withStand = TRUE)

# the inference (learning) function, based on the hearer function
inference <- function(x){
  q(D)(x)
}

Now we are set to do the simulation.

The simulation

The actual simulation is pretty much the same that what we had in the unimodal case: we will have a series of inference-production cycles, which we will iterate until something interesting happens (or not).

Therefore, we can basically take the same code we had for the unimodal distribution (but this time, the definition of production and inference are different).

# initialize values for loop
n <- .999	# the initial value for n
N <- c(n)	# this where we will store the values we will plot; we add n
k <- 200		# the number of iterations 

while(k>0){
   n <- inference(production(n)); # make a learning cycle, assign the result to n
   N <- c(N, n);                  # add the new n to the end of the list N
   k <- k-1;                      # decrease the counter by one
}

plot(N)

nil

Now, we see that - if we start with 0.999 - the curve does not drop all the way down to 0, but seems to converge to something like 0.6. We also see that R by default only shows the area where there are some values. We will see below how we can correct this behaviour.

Another question we may ask is what happens with other values, and what is the global behaviour of the system. We can explore this if we do not only plot one starting value, but several of them at regular intervals - let's say one at 0.1, another at 0.2, another at 0.3, etc., up to 1. The easy - but tedious - way would be to iterate the code snippet above and to start the loop again with another starting value. This is some copy and paste work, but for ten values, this might be possible. However, there is a less verbose way of achieving it, and I will show how it words by making 100 different starting points (which would be quite tiresome to write out).

The key to doing that is the sapply-function. It takes as first argument a function, and applies it then to each element of a vector one by one:

sapply(vector, function)

Our function will be the evolutionary while-loop (which we will have to transform into a function), and the vector will be composed of the different starting points. Lets start with the function. We can rewrite the while-loop as follows:

# evol is the function that wraps the evolution of each separate
# starting value of n, and checks what happens through k generations
evol <- function(n) {
  k <- generations;      # the counter, or: the number of generations to be tracked
  N <- c(n);             # the list we will plot
  while(k>0){
      n <- inference(production(n)); # the inference-production cycle
      N <- c(N, n);                  # update N with result
      k <- k-1;                      # decrease counter
  }
  # inside a loop, you need to print() the points, otherwise, there
  # will be no output:
  print(points(N), add=T)  # we don't want to overwrite previous plots
  return(N)
}

The evol function contains the while-loop plus the setting of generations and the vector N we want to plot. It will print out the points of the plot, and return the vector of values N (if we want to inspect what happened to the values).

Notice that in the loop, I have not used plot, but points (I could have used curve). The difference is that plot creates a new plotting window, erasing anything that you had before, whereas points (and curve) can add to the existing window.

Let us now look how we can construct the vector of starting points. Obviously, we could just have written it out:

vector <- c(0, 0.01, 0.02, 0.03) # and so on

But this really is annoying. Imagine you needed to check 1000 starting points! Fortunately, there is an easier way to do that, with the seq function:

seq(start-value, end-value, step)

seq(0, 1, 0.01)

So, the last expression will give us a vector that starts with 0, and then increases by 0.01, adds that result, repeats, until we obtain a vector with 101 members (0, 0.01, 0.02, … 0.98, 0.99, 1). We can now put together the pieces, and write:

generations <- 100

# initialize values for loop
n <- .999		# the initial value for n
N <- c(n)		# this where we will store the values we will plot; we add n
k <- generations	# the number of iterations 

while(k>0){
   n <- inference(production(n)); # make a learning cycle, assign the result to n
   N <- c(N, n);                  # add the new n to the end of the list N
   k <- k-1;                      # decrease the counter by one
}

plot(N,ylim=c(0,1))	# ylim sets the upper and lower end of the y-axis

sapply(seq(0,1,.01), evol)

And this is what it looks like:

nil

Notice the ugly Moiré pattern, particularly in the upper half. In practice, it is better to make less plots (I did 10 in the publication), to get a cleaner general look. We will see below how one can improve the graphics (and the axis labelling) below. Let us first look at what the graphics itself shows us.

In previous plots, the values either dropped or remained stable. In the picture below, some values actually rise. How come? Is there something wrong with our coding? Normally, speaker-overestimation should make the value of n drop, and here, at some places, they rise, even though the speaker overestimates each curve individually, as we saw above. Now, this might be an instance of numerical error. But it might just as well be a genuine property of the system. Can we know which one it is?

Indeed we can. The secret is to calculate the intercept (as Grégoire Winterstein was kind enough to explain to me) between speaker and hearer CDRs. The idea is the following: if the speaker probability distribution has more probability mass above n than the hearer probability distribution, the value of n will go down; if they are equal, n will be stable; if the hearer distribution has more probability mass above n than the speaker distribution, n will rise.

This is once again something we can plot. The CDR (in R, for a beta-distribution, pbeta) gives us the probability mass up to some point; so we can use this as a basis.

# the speaker's CDR
pmyspeaker <- function(x) {
  (pbeta(x,18.5,3) + pbeta(x,4.5,22.5))/2
}

# the hearer's CDR
pmyhearer <- function(x) {
  (pbeta(x,20,3.5) + pbeta(x,3.5,20))/2
}

# pmyspeaker gives what is below x; I want to know what is the
# probability mass above it, which is 1-pmy{speaker,hearer}
pintercept <- function(x) {
  (1 - pmyspeaker(x)) - (1 - pmyhearer(x)) 
}

curve(pintercept, xlim=c(0,1))	# plot the function, with values for n from 0 to 1

nil

We see that it looks like that most of the time, the speaker's CDR is higher than the hearer's CDR. However, the plot is not very legible, mainly because we do not see very well the bottom line. Crucially, we would like to know what happens at the interval where the values rise. We can add a straight line with the following command:

# inserts a straight line at 0, which is horizontal, and dotted
abline(h=0, lty="dotted")

And we are not interested in the whole area from 0 to 1. The above picture makes it rather clear that the values should drop on wide ranges (as they do, in fact); we want to zoom in on the specific region where they rise, which is roughly between 0.44 and 0.60.

plot(pintercept, .44, .605,		 # gives us the boundaries
     main = "Detail of the plot above",	 # gives the plot a heading
     ylab = "difference",		 # label the y-axis
     xlab = "n")			 # label the x-axis

abline(h=0, lty="dotted")		 # add a dotted baseline for x=0

nil

So it turns out that in the region where n rose, actually the speaker's probability mass is less than the hearer's, and that therefore, n will rise. Here, it comes handy that R auto-adjusts for the values: actually, as the y-axis makes clear, the difference between the two CDRs is actually very small.

We know now what we wanted to know. Let us now see how we can show others in a decent way what we know, and how we can save the result of our work. You are not obliged to work on the R command line; you can also create a file with your favourite text editor, and tell R to execute your commands. For instance, copy and paste the code below into a file, that you call "intercept.r". Then, in your R command line, call

source("/path/to/intercept.r")

(where you replace /path/to with the place the file is located on your computer, e.g., source("C:\\\stuff\intercept.r"), or source("/home/gummybear/intercept.r"))

# we will create a svg file, called "intercept.svg"
svg("intercept.svg")

# we define the function we will plot (the same as above)
pintercept <- function(x) {
  (1 - pmyspeaker(x)) - (1 - pmyhearer(x)) 
}

# put two plots one below the other
par(mfrow=c(2,1)) 

# plot the function, add a heading and axis-labels
plot(pintercept,
     main = "Subtracting Hearer CDR from Speaker CDR",
     ylab = "difference",
     xlab = "n")

# Add some lines to make clearer how things work
abline(h=0, lty="dotted")
abline(v=0, lty="dotted")
abline(v=1, lty="dotted")
abline(v=0.455, lty="dotted")
abline(v=0.5986, lty="dotted")


# and add some arrows, which explicit the system behaviour:
arrows(x0=0.82, y0=0.04, x1=.78, y1=0.04, length = 0.05, angle = 25,
        code = 2, col = par("fg"), lty = par("lty"),
        lwd = par("lwd"))
arrows(x0=0.51, y0=0.04, x1=.55, length = 0.05, angle = 25,
       code = 2, col = par("fg"), lty = par("lty"),
       lwd = par("lwd"))
arrows(x0=0.32, y0=0.04, x1=.28, length = 0.05, angle = 25,
       code = 2, col = par("fg"), lty = par("lty"),
       lwd = par("lwd"))
# end of first plot

# start second plot
plot(pintercept, .44, .605,
     main = 'Detail of the first plot',
     ylab = 'difference',
     xlab = 'n')
# note that it does not make any difference if you use "", or '' for
# the text

abline(h=0, lty="dotted")
abline(v=0.455, lty="dotted")
abline(v=0.5986, lty="dotted")

# close the device, in order to make sure the svg gets properly printed:
dev.off()

After having run source on the file, you should obtain a svg-file called intercept.svg, which should look like the following:

nil

Applications to other areas

I assume that the same type of modelling might apply to instances of speaker-hearer conflict in natural languages. One instance of an area where this seems to make sense might be sound change.¹ Clear pronounciation would appear to be rather in the interest of the hearer than in the speaker's (for whom it constitutes a source of articulatory effort). Similarly, homonymy seems (at least to me) to be also something that is makes life easier for the speaker, at the cost of putting a strain on the hearer.

That's all, folks. I hope that what I have written contributes to understanding how exactly the underlying mechanism in the paper works. Maybe, it can even encourage you to do a simulation yourself. After all, it's not difficult, once you know what exactly you want to do (but after all, doing research is principally about finding out what exactly one wants to do).

Footnotes:

But then, I am no phoneticist, nor a phonetician.