How to irritate a linguist, Part 4

The conditional probability of “dog” is higher if the preceding word is “my” than if the preceding word is “artichoke.”

Screen Shot 2017-11-27 at 18.59.13Screen Shot 2017-11-27 at 18.59.24

Here’s the closest that we come to “complexity” in linguistics: take a big sample of some language.  Build a statistical model of the conditional probabilities of all two-word sequences (“conditional” probability is the probability of some word given that the preceding word was X.  The conditional probability of dog is higher if the preceding word is my than if the preceding word is artichoke).  For that statistical model, you can calculate something something called perplexity.  It’s as close as linguists come to having any notion of “complexity” of language.  Here’s a bit of the Wikipedia page on perplexity:

In natural language processing, perplexity is a way of evaluating language models. A language model is a probability distribution over entire sentences or texts.

Using the definition of perplexity for a probability model, one might find, for example, that the average sentence xi in the test sample could be coded in 190 bits (i.e., the test sentences had an average log-probability of -190). This would give an enormous model perplexity of 2190 per sentence. However, it is more common to normalize for sentence length and consider only the number of bits per word. Thus, if the test sample’s sentences comprised a total of 1,000 words, and could be coded using a total of 7.95 bits per word, one could report a model perplexity of 27.95 = 247 per word. In other words, the model is as confused on test data as if it had to choose uniformly and independently among 247 possibilities for each word.


3 thoughts on “How to irritate a linguist, Part 4”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Curative Power of Medical Data

JCDL 2020 Workshop on Biomedical Natural Language Processing


Criminal Curiosities


Biomedical natural language processing

Mostly Mammoths

but other things that fascinate me, too


Adventures in natural history collections

Our French Oasis


ACL 2017

PC Chairs Blog

Abby Mullen

A site about history and life

EFL Notes

Random commentary on teaching English as a foreign language

Natural Language Processing

Université Paris-Centrale, Spring 2017

Speak Out in Spanish!

living and loving language




Exploring and venting about quantitative issues

%d bloggers like this: