January 2020 – Zipf's Law

How to smile your way through the Parisian transit strike: Citymapper

The Internet has given us Trump, revenge porn, and catfishing; in recompense, it has also given us free on-line versions of a number of historical French dictionaries, and a way to weather public transportation strikes with a smile.

Executive summary: there’s an app called Citymapper available on the iPhone and Android that does an excellent job of staying on top of metro, train, and bus line operating hours. Want to know about (1) linguistic trivia associated with strikes in French, and (2) public attitudes about the current action sociale? Read on.

One of the things that I find very striking about Paris is that although the building located at any particular spot might change, the function carried out there can remain constant over centuries. Millennia, even. For example: the spot where Notre Dame de Paris is located has been a place of worship since the Druids were there. The Palais de justice was the residence of the Roman administrator, and then the palace of the early French kings, before becoming the center of the French court system. And, most relevant to today’s ravings: the location of the Parisian City Hall has been where the city was run out of for as long as Paris has been run by its bourgeois.

City Hall–in French, L’Hôtel de ville–is located on the Right Bank of Paris. Although the Right Bank is very much the seat of Parisian power today, it started as mostly swampland. (That fact figures into how the city was taken by the Romans–a story for another time.) The expansion of Paris from the Left Bank to the Right in the early Middle Ages started with the area where the Hôtel de Ville is located today. It was an early area of business, and the riverbank–la grève–in front of its current location was a gathering spot for laborers looking for work. As the story goes (and I’m sorry that I can’t give you a citation for this, but I think that I ran across it in Metronome), over time the word for the place where laborers gathered became associated with strikes by laborers.

There’s some documentary evidence for this association. Let’s work our way backward. The Internet has given us Trump, revenge porn, and catfishing; in recompense, it has also given us free on-line versions of a number of historical French dictionaries. Les-voilà. Starting with the 8th edition of the Dictionnaire de l’Académie française, published 1932-1935, we have the following. The first sentence is A level, flat surface covered with gravel or sand, going along the edge of a sea or a large river:

Screen Shot 2020-01-15 at 08.38.45 — Screen shot from TheFreeDictionary.com. In the second paragraph (which I did not translate), they’re not shitting about the executions. Notable ones that took place there that of include Jacques de Molay, the last grandmaster of the Knights Templar, who was burnt at the stake there on March 18th, 1314; and that of Robert-François Damiens, who was drawn and quartered there on March 28th, 1757. (The event was extensively documented. If you have a copy of Michel Foucault’s *Discipline and Punish: The Birth of the Prison* on your bookshelf, you’ll find an accurate description of the event in the first chapter. The savagery was difficult to imagine–one of the professional executioners went into retirement after participating.)

Continuing back in time to the 18th century, we have this from Jean-François Féraud’s Dictionnaire critique de la langue française, published 1787-1788. It contains the definition level and sandy beach:

Screen Shot 2020-01-15 at 09.21.13 — Linguists will notice the prescriptiveness of the entry, which includes the observation that the verbal form of the word, which means “to harm,” is “not often used outside of the Palace, and in ordinary language is not good style,” as well as the facts that (1) Richelet found it a bit old (Phil dAnge, who was Richelet?), (2) Trév says that it was becoming a bit outdated (Phil dAnge: Trév.??), and (3) the Academy includes it without comment. Do note that he is talking about a verb, not about the “(river) bank” sense of *grève.)* Source: screen shot from https://fr.thefreedictionary.com/gr%c3%a8ve

Finally, going back to Jean Nicot’s Thresor de la langue française, published in 1606, we have the following, which includes words that I believe to mean “gravel, sand” (gravier and arena):

Screen Shot 2020-01-15 at 08.36.52 — Nicot’s entry includes another meaning of the noun, which I think is a part of a suit of armor that goes on the legs. Source: https://fr.thefreedictionary.com/gr%c3%a8ve

If you haven’t been reading the news from France lately: public transport workers in and around Paris have been on strike for the past six weeks. A public transport strike in these parts does not mean a complete cessation, but rather a diminution, of service. A given metro line might be operating at half capacity, or maybe only 1 out of 3 trains on the line are running; those services might be only available during the morning and evening rush hours (en heures de pointe), or just in the evening. Trains are packed to bursting, electric scooter rentals are maxed out; Uber is running, but the automotive traffic is so heavy that a 30-minute ride can easily take an hour. As I write this in mid-January of 2020, the exceptionally convenient low-cost mobility that is such a delight of normal life in the City of Light is only a fond memory.

Are Parisians frustrated by the disruptions caused by the strike? Of course. Are they complaining about it a lot? Not really. Here are typical comments from my friends about the motivation for the strikes–a proposed reorganization of the admittedly convoluted French retirement system:

The reforms won’t hurt me, personally–but, I’m worried for my child.
The transportation workers are striking for all of us.
The strike has to screw up Paris, or it won’t have any effect.

The comments reflect some underlying widespread French attitudes about their famous work stoppages: (1) Everybody has to earn a living, and (2) Your strike may be screwing up my life today, but my strike will be screwing up yours tomorrow. So: in general, people are pretty tolerant of this kind of thing.

…and with that, I’m off to check Citymapper to find the best way to get to the Musée de la paléontology et de l’anatomie comparée—one of the three best museums in the world, in my humble but reasonably informed opinion.

The picture of an écartèlement (“drawing and quartering” in English) at the top of this page is of a bas relief from northeastern Spain. I found it at https://fr.vikidia.org/wiki/%C3%89cart%C3%A8lement.

Conflict of interest statement: I don’t have any. Citymapper does not pay me, nor do they offer me free services.

What computational linguists actually do all day: The read-between-the-lines edition

Watch a movie like Arrival and you’ll get the impression that linguists spend their professional lives sitting around speculating about Sanskrit etymologies and the nature of the relationship between language and reality. I’m not saying that we never do such things, but, no: that’s not what we do with our typical workdays. I’m a computational linguist, which among other things means that what I do involves computers, which among other things means that I spend a certain amount of my time sitting around writing computer programs that do things with language. Often, those programs are doing things that do not look very…exciting. Not to the untrained eye, at any rate.

For other glimpses into the daily life of computational linguists, click here.

Case in point: yesterday I wanted to see how the statistical characteristics of language are affected by different decisions about what you consider a “word.” You would think that the word “word” would be easy to define–in fact, not only do linguists not agree on what a word is, but you would have a hard time getting all linguists to agree that words even exist. (One of the French-language linguistics books that I have my nose stuck in the most is Maurice Pergnier’s Le mot, “The Word.” The first 50 pages (literally) are devoted to theoretical controversies around the question of whether words actually exist–or not. Want a good English-language discussion of the issues? See Elisabetta Jezek’s The lexicon: An introduction.)

So, yesterday I got to thinking about one of the questionable cases in English: contracted negatives of modal verbs. Here’s what that means.

In English, there is a small number of frequently-occurring verbs that can (and do) get negated not by a separate word like no, but by adding a special ending, spelled -n’t:

is/isn’t
did/didn’t
have/haven’t
could/couldn’t
would/wouldn’t
does/doesn’t

Note that British English has another form:

I’ve not

…which means I haven’t.

Now, if you care about statistics, you care about counting things. Think about how you would count the numbers of words in these examples:

I want to go.
I do want to go.
I do not want to go.
I don’t want to go.

(3) and (4) are both perfectly acceptable ways of negating (1) and (2). How would they affect a program that counts the number of words? It depends. Here are the straightforward cases: if (1) has four words (I, want, to, and go), then (2) has five (add do to the previous four), and (3) has 6 (add not to the previous five).

The questionable case is (4). You could make a reasonable argument that don’t is a single word. You also could make a reasonable argument that don’t should be counted as two words. But, which two words? A reasonable person could propose do and n’t–just split the “stem” do from the negative n’t.

Fine. But, let’s look at a little more data:

I will go.
I will not go.
I won’t go.
I can go.
I cannot go.
I can’t go.

Clearly (1) has three words–I, will, and go. … (2) adds one more, with not. What about (3), though? Is it inconsistent to count will not as two words, but won’t as one? Maybe. If you’re going to split it into two “words,” what are they? Presumably wo and n’t? But, what the hell is wo? Is it the same “word” as will? Notice that we’ve now had to start putting “word” in “scare quotes,” which should tell you that knowing what, exactly, a “word” is isn’t quite as simple as it might appear at first glance. Think about this: in science you need to know what it is, exactly, the thing that you’re studying, which implies that you can recognize the boundary between one of those things and another.

What’s the right answer? Hell, I don’t know. I do know this, though: if you’re interested in the statistics of language (wait–what’s you’re? Hell, what’s what’s?), then you have to be able to count things, so you have to make some decisions about where the boundaries between them are. My issue du moment is actually not choosing between the options, but rather seeing what the consequences of those specific decisions would be for the resulting statistical measures, so I need to be able to test the effects of different ways of splitting things up (or not), so I need to write some code…

What you see below is me using a computational tool called a “regular expression” to find words that have a negative thing attached at the end (e.g. n’t) and separate the negative thing from the rest of the word. So, given an input like didn’t, I want my program to (1) recognize that it has a negative thing at the end, and (2) split it into two parts: did, and n’t. Grok (see the English notes for what grok means) the code (code means “instructions in a programming language”–here I’m using one called Perl), and then scroll down past it for an explanation of how it illustrates a piece of advice that I often give to students…

# this assumes input from a pipe...
while (my $line = <>) {

print "Input: $line";

# this doesn't work--why?
#$line =~ s/\b(wo|ca|did|could|should|might)(n't)\b$/\$1 $\2)/gi;
# works...
#$line =~ s/(a)(n)/a n/gi;
# this does what I want...
#$line =~ s/(a)(n)/$1 $2/gi;
# works...
#$line =~ s/(ca)(n't)/$1 $2/gi;
# works...
#$line =~ s/(ca|wo)(n't)/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|did)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|did|could)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|did|could|should|might)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|did|had|could|should|might)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|did|had|have|could|should|might)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|do|did|has|had|have|could|should|might)(n't)\b/$1 $2/gi;
# and finally: this pretty much looks like what I started with, but
# what I started with most definitely does NOT work...  what the fuck??

$line =~ s/\b(ca|wo|do|did|has|had|have|would|could|should|might)(n't)\b/$1 $2/gi;

       print $line;

} # close while-loop through input

The “regular expressions” in this code are the things that look like this:

s/\b(wo|ca|did|could|should|might)(n't)\b$/\$1 $\2)/gi

…or, in the case of a much shorter one, like this:

s/(a)(n)/a n/gi

(Note to other linguists: yes, I know that technically, the regular expression is just the part between the first two slashes, i.e. the underlined part s/(a)(n)/a n/gi in the second example. Don’t hate on me–I’m trying to make this at least somewhat clear.) The lines that start with # are my notes to myself—the “reading between the lines” that you have to do to see how irritating it can be to troubleshoot this kind of thing.

A regular expression is a way of describing a set of things. What makes it “regular”–a mathematical term–is that those things can only occur in a very limited number of relationships. In particular, that limited number of relationships do not include some phenomena that are very important in language, such as agreement between subjects and verbs–think of Les trois soeurs de ma grand-mère m’ont toujours aimé, “my grandmother’s three sisters have always loved me.” The issue here is that regular expressions can only describe sequences of things that you might think of as “next to” each other; les trois soeurs is separated from the verb avoir, which must be in the third person plural form ont, by ma grand-mère, which would require the third person singular form a. (Linguists: I know.)

Regular expressions, and the “regular languages” that they can describe, became of importance in linguistics when B.F. Skinner (yes, the famous psychologist) wrote a book about the psychology of language in which he suggested that they can describe human languages from a mathematical perspective. This claim caught the attention of one Noam Chomsky, who wrote a book review pointing out the inadequacy of the idea of regular expressions as a description of human language. The review brought him a lot of notice, and he went on to develop the ideas in that review into the most widespread and influential linguistic theory since the Tower of Babel. Today, if you’ve only heard of one linguist, it was almost certainly Chomsky.

Chomsky’s critique of “regular languages” included the observation that there are perfectly natural things that can be said in any human language that can’t be described by a regular language. For example:

Me, my brother, and my sister went to William and Mary, Indiana University, and Virginia Tech, respectively.

The problem that this illustrates for regular languages is that they don’t have a mechanism for accounting for the fact that you can have sentences where you have a list of things in an early part of the sentence, and then must have a list of things of the same length in a later part of the sentence. Don’t believe me? Go read a book on “formal languages,” and then try it.

Linguistic geekery

Regular expressions are pretty natural tools for people who work with textual data, and they’re especially natural for linguists. This is a surprise to a lot of computer scientists, some of whom are masters of regular expressions, but some of whom find them irritatingly bewildering. It turns out that if you take a course on the “formal foundations” of linguistics, i.e. its groundings in logic and set theory, you will run across regular languages, which fact makes regular expressions pretty easy to learn. And, for textual data, they are really useful even despite their limits–so much so that a programming language (named Perl) was created expressly for the purpose of making it easy to use regular expressions to “parse” textual data. So, when I found myself wanting to be able to rip through a bunch of textual data and find the negative things like n’t, Perl and its regular expressions were a logical choice.

I love a good monosyllable III: Fleam

Fleam: An instrument for opening veins in bloodletting.

One of the oddities of the lexicon–that is, the set of words that you know–is that you keep learning it for pretty much your entire life. This is quite different from anything else that you know about your native language: you know almost everything that you will ever know about the phonology and syntax of your language quite early in your childhood. In contrast, learning new words can continue to your dying day.

Of course, the rate of learning new words changes over the course of the lifespan. Young toddlers may learn fewer than 5 words a month; between the ages of 2 and 6 years, they’re probably learning more like 30 words a month. When you enter school, the range of semantic classes that the child is learning shifts in the direction of abstract words, from the concrete ones that formed most of their vocabulary acquisition up to that point. If you go to college (la fac in French), you will probably see another spurt in your learning of new words; by the time you finish it, the typical person will know most of the vocabulary that they’re going to have. Certainly not all of it, though. If you were to graph the number of words that you know over time, it would look something like this–fast growth early in life, followed by slow growth later in life, but no end to the growth. (Note that the numbers for vocabulary size are not realistic. Total vocabulary size by age 22 will be much larger than I have indicated, probably on the order of 30,000 words.)

vocabulary.size — Figure source: me. I generated it using a programming language called R. NOTA BENE: the vocabulary size figures are completely unrealistically low–the total vocabulary size should be something like 30,000 words, assuming a college-educated native speaker.

So: I’m 58 years old, and I spent a really long time in college and graduate school, but I am still learning new words in my native language. Some recent ones:

morganitic: relating to marriage between an aristocrat and a non-aristocrat, such that the issue of the marriage do not inherit ranks, titles, and the like.
aramid: a group of synthetic materials used to make textiles and plastics.
mephitic: nasty-smelling.
irredentism: political policy of claiming territories occupied by members of your ethnic group (think Hitler in the Sudetenland), or that were historically part of your political group (think what Hungary would like to do with Transylvania).

What doesn’t happen very often, though: I don’t learn a new monosyllable in my native language very often. Thus, when I run across one, it tickles me. So, when a recent trip to New Orleans found me in a pharmacy museum, I was delighted to come across this exhibit:

Where the fuck does this come from? Let’s go look. Merriam-Webster does not have an entry for fleam, although it does have one for fleam tooth:

A sawtooth shaped like an isosceles triangle

The Online Etymological Dictionary gives me this:

“sharp instrument for opening veins in bloodletting,” late Old English, from Old French flieme (Modern French flamme), from Medieval Latin fletoma, from Late Latin flebotomus, from Greek phlebotomos “a lancet”

So: I never would have guessed it, but it turns out to be historically related to our word phlebotomy, and in fact precedes it in English by centuries. And thus the pathetic life of a fat, bald old man is made happy by learning a new one-syllable word…

Geeky linguist notes

I’ve given 30,000 words as the size of a typical college-educated adult’s vocabulary. Take that with a grain of salt–counting the size of someone’s vocabulary is really hard, for a lot of reasons. You can find a good discussion of them in Elisabetta Jezek’s book The lexicon: An introduction.
I calculated cumulative vocabulary size to age 22 (i.e. approximately the completion of an American college education) using the rate of growth that I gave in the post for the 2:6 age range, because that was the only age range for which I could find numbers. This results in a drastic underestimate of total vocabulary size–by age 22, it gives just a bit over 7,600 words. With slow growth after leaving college, there is no fucking way to hit 30,000 in a human lifetime.

	Anonymous on The many ways to spell “…
	Anonymous on Nightmare after nightmare: How…
	zipfslaw1 on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …