What computational linguists actually do all day: The read-between-the-lines edition

Watch a movie like Arrival and you’ll get the impression that linguists spend their professional lives sitting around speculating about Sanskrit etymologies and the nature of the relationship between language and reality.  I’m not saying that we never do such things, but, no: that’s not what we do with our typical workdays.  I’m a computational linguist, which among other things means that what I do involves computers, which among other things means that I spend a certain amount of my time sitting around writing computer programs that do things with language.  Often, those programs are doing things that do not look very…exciting.  Not to the untrained eye, at any rate.

For other glimpses into the daily life of computational linguists, click here.

Case in point: yesterday I wanted to see how the statistical characteristics of language are affected by different decisions about what you consider a “word.”  You would think that the word “word” would be easy to define–in fact, not only do linguists not agree on what a word is, but you would have a hard time getting all linguists to agree that words even exist.  (One of the French-language linguistics books that I have my nose stuck in the most is Maurice Pergnier’s Le mot, “The Word.”  The first 50 pages (literally) are devoted to theoretical controversies around the question of whether words actually exist–or not. Want a good English-language discussion of the issues?  See Elisabetta Jezek’s The lexicon: An introduction.)

So, yesterday I got to thinking about one of the questionable cases in English: contracted negatives of modal verbs.  Here’s what that means.

In English, there is a small number of frequently-occurring verbs that can (and do) get negated not by a separate word like no, but by adding a special ending, spelled -n’t:

  • is/isn’t
  • did/didn’t
  • have/haven’t
  • could/couldn’t
  • would/wouldn’t
  • does/doesn’t

Note that British English has another form:

I’ve not

…which means I haven’t.

Now, if you care about statistics, you care about counting things.  Think about how you would count the numbers of words in these examples:

  1. I want to go.
  2. I do want to go.
  3. I do not want to go.
  4. I don’t want to go.

(3) and (4) are both perfectly acceptable ways of negating (1) and (2).  How would they affect a program that counts the number of words?  It depends.  Here are the straightforward cases: if (1) has four words (I, want, to, and go), then (2) has five (add do to the previous four), and (3) has 6 (add not to the previous five).

The questionable case is (4).  You could make a reasonable argument that don’t is a single word.  You also could make a reasonable argument that don’t should be counted as two words.  But, which two words?  A reasonable person could propose do and n’t–just split the “stem” do from the negative n’t.  

Fine.  But, let’s look at a little more data:

  1. I will go.
  2. I will not go.
  3. I won’t go.
  4. I can go.
  5. I cannot go.
  6. I can’t go.

Clearly (1) has three words–I, will, and go.  …  (2) adds one more, with not.  What about (3), though?  Is it inconsistent to count will not as two words, but won’t as one?  Maybe.  If you’re going to split it into two “words,” what are they?  Presumably wo and n’t?  But, what the hell is wo?  Is it the same “word” as will?  Notice that we’ve now had to start putting “word” in “scare quotes,” which should tell you that knowing what, exactly, a “word” is isn’t quite as simple as it might appear at first glance.  Think about this: in science you need to know what it is, exactly, the thing that you’re studying, which implies that you can recognize the boundary between one of those things and another.

What’s the right answer?  Hell, I don’t know.  I do know this, though: if you’re interested in the statistics of language (wait–what’s you’re?  Hell, what’s what’s?), then you have to be able to count things, so you have to make some decisions about where the boundaries between them are.  My issue du moment is actually not choosing between the options, but rather seeing what the consequences of those specific decisions would be for the resulting statistical measures, so I need to be able to test the effects of different ways of splitting things up (or not), so I need to write some code…

What you see below is me using a computational tool called a “regular expression” to find words that have a negative thing attached at the end (e.g. n’t) and separate the negative thing from the rest of the word.  So, given an input like didn’t, I want my program to (1) recognize that it has a negative thing at the end, and (2) split it into two parts: did, and n’t.  Grok (see the English notes for what grok means) the code (code means “instructions in a programming language”–here I’m using one called Perl), and then scroll down past it for an explanation of how it illustrates a piece of advice that I often give to students…

# this assumes input from a pipe...
while (my $line = <>) {

print "Input: $line";

# this doesn't work--why?
#$line =~ s/\b(wo|ca|did|could|should|might)(n't)\b$/\$1 $\2)/gi;
# works...
#$line =~ s/(a)(n)/a n/gi;
# this does what I want...
#$line =~ s/(a)(n)/$1 $2/gi;
# works...
#$line =~ s/(ca)(n't)/$1 $2/gi;
# works...
#$line =~ s/(ca|wo)(n't)/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|did)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|did|could)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|did|could|should|might)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|did|had|could|should|might)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|did|had|have|could|should|might)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|do|did|has|had|have|could|should|might)(n't)\b/$1 $2/gi;
# and finally: this pretty much looks like what I started with, but
# what I started with most definitely does NOT work...  what the fuck??

$line =~ s/\b(ca|wo|do|did|has|had|have|would|could|should|might)(n't)\b/$1 $2/gi;

       print $line;

} # close while-loop through input

The “regular expressions” in this code are the things that look like this:

s/\b(wo|ca|did|could|should|might)(n't)\b$/\$1 $\2)/gi

…or, in the case of a much shorter one, like this:

s/(a)(n)/a n/gi

(Note to other linguists: yes, I know that technically, the regular expression is just the part between the first two slashes, i.e. the underlined part s/(a)(n)/a n/gi in the second example.  Don’t hate on me–I’m trying to make this at least somewhat clear.) The lines that start with # are my notes to myself—the “reading between the lines” that you have to do to see how irritating it can be to troubleshoot this kind of thing.

regular expression is a way of describing a set of things.  What makes it “regular”–a mathematical term–is that those things can only occur in a very limited number of relationships.  In particular, that limited number of relationships do not include some phenomena that are very important in language, such as agreement between subjects and verbs–think of Les trois soeurs de ma grand-mère m’ont toujours aimé, “my grandmother’s three sisters have always loved me.”  The issue here is that regular expressions can only describe sequences of things that you might think of as “next to” each other; les trois soeurs is separated from the verb avoir, which must be in the third person plural form ont, by ma grand-mère, which would require the third person singular form a.  (Linguists: I know.)

Regular expressions, and the “regular languages” that they can describe, became of importance in linguistics when B.F. Skinner (yes, the famous psychologist) wrote a book about the psychology of language in which he suggested that they can describe human languages from a mathematical perspective.  This claim caught the attention of one Noam Chomsky, who wrote a book review pointing out the inadequacy of the idea of regular expressions as a description of human language.  The review brought him a lot of notice, and he went on to develop the ideas in that review into the most widespread and influential linguistic theory since the Tower of Babel.  Today, if you’ve only heard of one linguist, it was almost certainly Chomsky.

Chomsky’s critique of “regular languages” included the observation that there are perfectly natural things that can be said in any human language that can’t be described by a regular language.  For example:

Me, my brother, and my sister went to William and Mary, Indiana University, and Virginia Tech, respectively.

The problem that this illustrates for regular languages is that they don’t have a mechanism for accounting for the fact that you can have sentences where you have a list of things in an early part of the sentence, and then must have a list of things of the same length in a later part of the sentence.  Don’t believe me? Go read a book on “formal languages,” and then try it.

Linguistic geekery

Regular expressions are pretty natural tools for people who work with textual data, and they’re especially natural for linguists.  This is a surprise to a lot of computer scientists, some of whom are masters of regular expressions, but some of whom find them irritatingly bewildering.  It turns out that if you take a course on the “formal foundations” of linguistics, i.e. its groundings in logic and set theory, you will run across regular languages, which fact makes regular expressions pretty easy to learn.  And, for textual data, they are really useful even despite their limits–so much so that a programming language (named Perl) was created expressly for the purpose of making it easy to use regular expressions to “parse” textual data.  So, when I found myself wanting to be able to rip through a bunch of textual data and find the negative things like n’t, Perl and its regular expressions were a logical choice.


Why the fuck would you…

So, I’m wandering around backstage in a theater trying to keep my cousins from making me help build props when I come across the following sign on a storage locker:

…and I wonder: why the fuck would you tell someone to spray a paint can?

I slowly digest the bright-red color of the cabinet. I slowly digest the “FLAMMABLE” signs. I slowly digest the fact that I am apparently becoming senile.

A paint can:

Picture source: https://www.google.com/imgres?imgurl=https://shop.thepurplepaintedlady.com

A can of spray paint:

The verb “to spray:”

An apparently senile computational linguist [PHOTO OMITTED TO PROTECT PUBLIC SENSIBILITIES]

Nelson Algren’s “The man with the golden arm” and a problem in semantic theory

She’s the kind got the sort of heart you can walk in ‘n out of with boots on.

He brushed his shot glass off the table and stood up.

When I took the GREs–the Graduate Record Examinationsthe test that you take in the US when you want to go to graduate school–I scored in the top 1 percentile on vocabulary.  I say that not to brag, but to give you some quantitative measure for when I say that in English, I know a lot of words.  That doesn’t mean that I never have to look anything up, though.

Molly could not see him weaving against the table out there in the dark while he was trying to understand to himself whether it was time for him to leave, before she saw him, or time to go to her before he lost her again.

Eeyore is frustrated. The subtitle, “Le mot n’est pas la chose,” says “The word is not the thing.”

From a linguist’s point of view, the challenge of definition is not to say what a thing is.  (Please, no hate mail–yes, I know that we define words, not things.)  Rather, the challenge of definition is to say what it is not.  I don’t mean this in a Saussurean sense, necessarily, but just from a practical point of view: tell me what a chair is.  OK, I get that you are not talking about a bed.  But, is what you are describing distinguishable from a couch?  How about from a bench?  A loveseat?  A stool?  A loveseat?  A recliner?  A doll-sized chair?  A toilet? The table below gives you an example of the kinds of definitional gymnastics that you find yourself going through in such exercises.  I have adapted this from Sandrine Zufferey and Jacques Moeschler’s Initiation à l’étude du sens : sémantique et pragmatique , the best introductory text on semantics that I’ve seen thus far.  Unfortunately my copy is sitting at the base of the Rocky Mountains right now, so I made up the details.  Oh, yeah–and unlike their table, mine’s in English.

chair stool armchair couch loveseat bench
must have back x x x x
armrests x x x
room for two people but no more x
room for more than two people x x
can have as few as three legs x

He felt a sickening sort of shame, this was just the way he wished not to be in finding her again: broke, sick and hunted.  What was it someone had said of her long ago?  “She’s the kind got the sort of heart you can walk in ‘n out of with boots on.”

So, today I wake up at 4 AM, as I often do.  Normally I start my day with the American news, but the country that I love so much is falling apart so quickly these days that I felt like I needed a few minutes to prepare myself before facing the latest revelations regarding Trump helping Putin with his little Ukrainian problem.  I pulled out the novel that I’ve almost finished–Nelson Algren’s The man with the golden arm.  I laid it down last night at a point where our hero, on the lam from the coppers, has gone looking for his lost love in a bar in an even seedier part of town than his own.  There’s a sort of burlesque show in the bar, and he spots his flower in the chorus line.  He is in big trouble, he’s starting to jones for his next fix (that’s junkie slang: he is going into withdrawal and needs a hit of morphine: broke, sick and hunted), and he is truly at the end of his rope.  A lifesaver: he’s found his girl.  But: as she leaves the stage, he knows full well that he does not want her to see him like this.

Then the act was done and she was gone, they were all gone as if they hadn’t been there at all.  As though the whole act had been a kickback from an overcharge, something he’d formed in his brain out of beer fumes and smoke.

Herbert Terrace’s book on the topic. Spoiler alert: “not as far as I can tell.”

Being a linguist and knowing the primacy of not specification, but rather differentiation, in matters of definition, it bugs the shit out of me that I know lots of words such that I know what category of thing they are, but I could not begin to tell them apart from other things of the same class–by very venerable linguistic theory, this should not happen.  For example: I know that amaryllis, dahlia, and freesia are all flowers, but I could not point any of those three out to you on a bet.  I know that opal, tourmaline, and amethyst are gemstones, but again–hand me three gemstones and ask me if one of them is a tourmaline or not, and I’m just gonna scratch my beard and excuse myself to go to the bathroom.  (Minus the beard-scratching, that last tactic for dealing with social discomfort turns out to be a pretty plausible example of how people end up claiming that they have taught chimpanzees American Sign Language.  A story for another time, perhaps.)

Yet went weaving heavily through smoke and fumes toward the tiny dressing room offstage.

Wearing army brogans on his feet.

OK, so… I already know that brogans are a kind of footwear–it’s not like I’ve never run into the word before.  But, I couldn’t tell you what kind.  The character is a recently-discharged World War II veteran, and his brogans have been mentioned many times in this novel, rom other references over the course of the novel to his heavy-footed walking, I infer that they are…well, heavy.  But, Algren didn’t say a few sentences earlier that his love was “the kind got the sort of heart you can walk in ‘n out of with boots on,” and then specify what kind of footwear he’s wearing as he walks into her dressing room after not having seen her for months, by accident.  (Algren was a treasure of the post-war American novel–he doesn’t do shit like that by accident.  A French connection: he was Simone de Beauvoir’s other lover.  Of course she left him for Sartre, who had translated Algren’s novel Never come morning into la langue de Molière.)

So, off I go to the dictionary.  And to Wikipedia.  And to Google Images, too, ’cause it is sometimes a damn fine resource for jury-rigged visual definitions.  (A little topical reference there: jury-rigged, which means something like “improvised with whatever happens to be at hand,” is said to be derived from the wartime slang term to jerry-rig.)  What I find: a brogan is a low-topped boot.  The picture at the top of the page shows a pair of WWII-era US Army brogans.  The gaiters worn above them were made redundant when combat boots became standard issue–they’re higher, so you don’t need the gaiters to “blouse” your trouser legs.  A contemporary reader would have known what he meant; reading the book today, which was written before I was born–a very long time ago–I knew that brogans were footwear, but hadn’t a clue what kind.  So: top 1 percent on the vocabulary portion of the GRE (don’t be too impressed–I was around the 50th percentile on math, maybe even lower), but I had to look a word up.

That’s being a linguist for you… The beauty of it is that you’re constantly immersed in your data, and the horror of it is that you’re constantly immersed in your data.  As far as definitions go: as my colleague Orin Hargraves, a fine lexicographer, pointed out to me while we were working on our paper Three dimensions of reproducibility in natural language processing, in which we and a cast of thousands of other colleagues proposed a set of definitions for talking about the results of experiments–trying to propose definitions might be somewhat pointless anyways, as in the end word meanings are determined by how they are used within the structure of the language, not by any prescriptive authority.  Did my linguisticness interfere with my enjoyment of Nelson’s finely-wrought prose?  Did it actually make me more aware of its beautiful craftsmanship?  I don’t know.  What I do know: now I’m going to go see what happens when he gets to her dressing room.


Want to know more about the myriad complications of thinking about definitions?  See Elisabetta Ježek’s excellent book The lexicon: An introductionSource of the picture of a pair of brogans at the top of the page: Eastman Leather Clothing Blog, blog.eastmanleather.com/view-post/the-us-combat-boot.

English notes

He was trying to understand to himself whether it was time for him to leave, before she saw him, or time to go to her before he lost her again. 

…is weird.  I have never heard the construction understand to [someone].  A quick search on Sketch Engine, purveyor of fine linguistic corpora and the tools for searching them, reveals nothing similar (yes, I did a Word Sketch, too):

Screen Shot 2019-10-04 at 05.03.33

What you do on Saturday night if you have no life whatsoever

That’s a whole lotta accents…

If you have no life whatsoever, what you do on Saturday night is (a) study French verb conjugations, and (b) binge-watch the excellent Netflix series Criminal: France–and not necessarily in that order, either.

I’ve recently been working on the passé simple, a French tense that’s used in some genres of writing, but only very rarely in the spoken language.  I love les chapeaux chinois (circumflex accents), and one of the nice things about the passé simple is that it uses them.  Specifically, they appear in the nous and vous forms: nouss aimâmes/finîmes/prîmes, vous aimâtes/finîtes/prîtes.

Find a verb with a circumflex accent in the stem, and it gets really fun.  So, it’s Saturday night, and I’m sitting on the back porch smoking a cigarette and and doing some exercises on the French Verb Forms iPhone app (no, I am not sponsored by Netflix, French Verb Forms, or Apple–I pay for that stuff just like everyone else), when I am presented with the verb apprêter “to prepare” to conjugate: Circumflex City!

How to write a personal statement for a grad school application

There is a bit of an art to writing a personal statement for a graduate school application. Here’s how to do it.

Applying to a graduate program means filling out a lot of paperwork–and writing a thing or two yourself. One of those things is called a personal statement, and there is a bit of an art to writing one.  Here’s some advice for doing it.

The first thing to know about a personal statement is this: it’s not actually personal.  Your goal in a “personal statement” is not to tell the admissions committee who you are “as a person,” but rather to take advantage of this opportunity to speak to them to show that you would be a good fit for their program.

What that means: you want the admissions committee member who is reading your statement to finish saying this to themself: oh–they could work with our faculty member Dr. Zipf [insert some actual faculty member of the institution in question, unless you’re applying to my institution].   (The pronoun themself is explained in the English notes below.)

How you lead them to that happy conclusion: don’t tell them, but show them.  Here are some things that you can do:

  1. State that you are interested in one or two specific areas of research of that department.
  2. State that you became interested in the/those topic when doing a research project on that topic…
  3. or, if you have not done research on that topic, then that you got interested in it/them while doing research on some other topic and coming across a paper on the topic by some member of the faculty of the department to which you are applying.
  4. List some areas of specialization within that topic or some related topics that you would be interested in working on, where those specializations or related topics are actually areas of research that members of the department to which you are applying work within.

Why I say one or two: you very much want to avoid a situation where (a) only one person in the department works on a topic, and (b) you don’t know it, but that person is getting ready to retire/move to another institution/begin a three-year period as the Associate Dean for Reproducibility, or something.  You avoid that situation by either (a) talking about a topic that two or more people in the department actually work on, or (b) talking about more than one topic.

Now, you may be asking yourself: what if I can’t find anyone in the department who works on my area of interest? The answer:

If you cannot find anyone in the department who works in your area of interest, then that department is not a good fit for you.

…and that’s exactly what the department wants to know.  In fact, if you apply to a graduate school and they don’t accept you, it is entirely reasonable to assume until proven otherwise that they’re not rejecting you, but just don’t see their department as the right place for you.

Need to know how to ask for a letter of recommendation for graduate school?

Click here.

This post is written on the basis of my time on the admissions committee of a medium-sized graduate program in computational biology.  If you have other perspectives/opinions on the subject, please add them to the comments below!

English notes

When you get deep into the weeds of the English language, one of the things that you run into is dialectal variation in pronoun use.  For example:

Dative pronouns in conjoined subject noun phrases: In the Pacific Northwest region of the United States, if you have a subject with two more people joined by a conjunction (e.g. and or or), then the pronouns are in the dative form, not the subject form.  For example, look at these contrasts:

  • I’m going to the store.  (subject)
  • He’s going to the store.  (subject)
  • Me and him are going to the store. (dative)
  • Him and me are going to the store. (dative)
  • Anaïs is going to the store. (subject)
  • They are going to the store. (subject)
  • Anaïs and them are going to the store. (dative)

Even in the Pacific Northwest, you don’t have to talk this way–it’s pretty regionally specific, and people will understand you just fine if you say he and I are going to the store.  But, if you are in that part of the country, you have to be able to understand it.

Atypical reflexive pronouns: Other oddnesses have to do with the reflexive forms of pronouns.  For example, in my dialect, the third-person plural forms they/them/their are used if you don’t know the gender of the referent.  Straightforward enough–that usage goes back centuries in English. But: in a reflexive context (i.e. when the subject is doing something to itself or for itself), you get a variety of forms, depending on number:

  1. You want the admissions committee member who is reading your statement to finish saying this to themself: oh–they could work with our faculty member Dr. Zipf [insert some actual faculty member of the institution in question, unless you’re applying to my institution].  That is obscure enough that it does not even show up in Merriam-Webster’s online dictionary.
  2. My aunt and uncle bought themselves a new copy of the compact edition of the Oxford English dictionary. This plural form is totally standard American English.
  3. My aunt and uncle each bought themselfs a new pair of sunglasses. …and that one, again, does not show up in Merriam-Webster.

This raises a question: how would someone who doesn’t speak a dialect like this say (1) and (3)? I’m pretty sure that in (3), they would say themselves.  But, (1)?  I don’t know another way of saying it–native speakers?

The picture at the top of this post is of Oxley Hall on the Ohio State University campus. I had the pleasure of getting a master’s degree in linguistics there in the 1990s. Mostly we hung out in the basement analyzing spectrograms, but we would occasionally sneak up into the tower.  Fun.


Why doing the laundry makes me happy

Doing the laundry will make you happy if you spend sufficient time contemplating the zombie apocalypse.

What will suck about the zombie apocalypse is….well, everything, really. For example: when the zombie apocalypse comes, most people will be completely filthy most of the time. For a while, you’ll at least be able to scavenge clean clothes–you won’t have many opportunities to bathe, but let’s face it: Old Navy will not be the first store to be looted. Eventually the clean clothes will all be gone. Eventually the day will come when you’ll strip a coat off of a reeking zombie whose head you’ve just smashed like a watermelon and be happy that you have something to keep yourself warm.

Today I woke up at 5:30–late for me–and headed down to the basement laundry room. Then I went to work–in clean underwear, clean jeans, and a clean t-shirt from the 2007 Association for Computational Linguistics meeting in Prague. (I learned to say gde je stan’ce metra–where is the subway station–which was undeniably useful. I also learned to ask questions about the National Theater, which amused the taxi drivers but did not accomplish much else.)

When you compare it with how bad life is going to suck during the zombie apocalypse, doing the laundry was actually pretty fun. Going to work in clean clothes was a pleasure, as it is every day, and it always will be if you spend sufficient time contemplating the zombie apocalypse.  There’s a reason I’m the happiest person you know. Hell, I’m the happiest person you don’t know.  Think about it.

English notes

In American English, “like a watermelon” is a common simile for describing actions of crushing, smashing, and the like.  Some examples:





How I used it in the post: The day will come when you’ll strip a coat off of a reeking zombie whose head you’ve just smashed like a watermelon and be happy that you have something to keep yourself warm.

Language geekery: similes versus analogy

Simile and analogy are similar (is that a pun? if so, it’s not a very sophisticated one), but they’re not quite the same.  Analogy starts with focusing on similarity between unlike items, and then typically is followed by pointing out the differences between them.  In contrast, simile does not require any actual similarity between the unlike items, and does not include pointing out the differences.

Thus, the heuristic Detached roles is like a Hearst & Schütze super-category, but not constructed on a statistical metric, rather on underlying semantic components. (Source: Litkowski, Kenneth C. “Desiderata for tagging with WordNet synsets or MCCA categories.” Tagging Text with Lexical Semantics: Why, What, and How? (1997).)

A recursive transition network (RTN) is like a finite-state automaton, but its input symbols may be RTNs or terminal symbols. (Source: Goldberg, Jeff, and László Kálmán. “The first BUG report.” In COLING 1992 Volume 3: The 15th International Conference on Computational Linguistics, vol. 3. 1992.)

Therefore, a conversation is like a construction made of LEGO TM blocks, where you can put a block of a certain type at a few places only.  (Source: Rousseau, Daniel, Guy Lapalme, and Bernard Moulin. “A Model of Speech Act Planner Adapted to Multiagent Universes.” Intentionality and Structure in Discourse Relations (1993).) Note that a native speaker probably would have put this somewhat differently.  Where the authors say where, a native speaker might have said where you can only put a block of a specific type at a few places, or more likely, except that you can put a block of a specific type only specific places.

Given all of that: is this an analogy, or a simile? The day will come when you’ll strip a coat off of a reeking zombie whose head you’ve just smashed like a watermelon and be happy that you have something to keep yourself warm.  Scroll down past the gratuitous Lisa Leblanc video for the answer.

I sometimes use this blog to try out materials for something that I will be publishing.  This brief description of how to use analogy is intended for a book about writing about data scientist.  I would love to know what parts of it are not clear.  (My grandmother will tell me how great it is, so no need for you to bother with that.)

Answer: it’s a simile.  Note that we’re not asserting any difference between the way that you’re going to smash the zombie’s head and the way that you would smash a watermelon: a reeking zombie whose head you’ve just smashed like a watermelon.  Note also that we are not then contrasting the way that you’re going to smash the zombie’s head and the way that you would smash a watermelon.  Simile, not analogy.




How to irritate a linguist, Part 5: English irregular past-tense verb practice

You just stand there, silent–trying not to glare balefully, either at your new interlocutor or–more likely–at the back of your departing host.  This is a mistake.

You’re at a party, sipping a dark beer and minding your own business, when the host introduces you to another cheerful attendee.  Biggie, meet Zipf.  He’s a linguist. …and disappears.

You just stand there, silent–trying not to glare balefully, either at your new interlocutor or–more likely–at the back of your departing host.  This is a mistake.

Conditional probability is the likelihood of some event given some other event.  For example: the probability of the word barf being said is, in the absence of any other information, equal to the frequency of the word barf being said divided by the frequency of any word whatsoever being said.  For example: I went to the Sketch Engine web site, your home for fine linguistic corpora and the tools to search them with, and searched a collection of 15.7 billion words of English scraped from the Web in 2015 and found that the word barf occurred 0.12 times per million words.  In other words: in the absence of any other information about what’s being said, you can expect that you will run into the word barf once every 8 million words or so.

If dogs are being talked about, the situation changes.  If you look only in the vicinity of the word dog, then the frequency of barf is 2.41 times per million words.  In other words, when dogs are under discussion, you will run into the word barf every 415,000 words or so.   So: the probability of the word barf is 0.12, and the conditional probability of the word “barf” given that you have seen the word “dog” is 2.41.

An aside: it isn’t necessarily the case that having seen some word tells you anything about the probability of seeing another word.  For example, the probability of the word barf and the probability of the word barf given that you have seen the word the are probably equal.  When the probability of some event (say, seeing some word) and the probability of that event given some other event (say, having seen some other word) are equal, we say that they are conditionally independent.  When the probability of some event is not the same as the probability of that event given some other event, we say that they are conditionally dependent.  

So, you’re at a party, sipping a dark beer and minding your own business, when the host introduces you to another cheerful attendee.  Biggie, meet Zipf.  He’s a linguist. …and disappears.  You just stand there, silent–trying not to glare balefully, either at your new interlocutor or–more likely–at the back of your departing host.  This is a mistake, for the following reason: when you just stand there silently, you let the other person establish the grounds of the conversation.  (Note that I’m assuming a party in the United States, where we find silence uncomfortable, and thus there will, indeed, be a conversation.)

Someone saying Wow–irregular verbs, huh?  Aren’t they weird?  …is equal to the frequency of wow–irregular verbs, huh?  Aren’t they weird?  …being said, divided by the frequency of anything whatsoever being said.  In other words: vanishingly small.  However, the probability of someone saying Wow–irregular verbs, huh?  Aren’t they weird? …given that you have just been introduced to someone as a linguist is not vanishingly small–it is much, much larger than vanishingly small.

Just to be sure that we’re all paying attention here:

  1. Suppose that the probability of the word barf is not equal to the probability of the word barf given that the word dog has been said.  These two words are:
    1. conditionally dependent
    2. conditionally independent
  2. Suppose that the probability of the word barf is equal to the probability of the word barf given that the word the has been said.  These two words are:
    1. conditionally dependent
    2. conditionally independent
  3. The probability of someone saying Wow–irregular verbs, huh?  Aren’t they weird? is much higher if they have just been told that you are a linguist than the probability of someone saying Wow–irregular verbs, huh?  Aren’t they weird? …given no additional information.  Those two events are
    1. conditionally dependent
    2. conditionally independent

Answers: (1) conditionally dependent, (2) conditionally independent, (3) conditionally dependent.

What’s so irritating about this?  The answers to that question are probably as numerous as the number of linguists in the world (which is to say: not enormous, but not zero, either), but here are my top 5 explanations:

  1. Your question is looking for a specific answer–yes, they sure are weird–but I do not, in fact, think that English irregular past-tense verbs are weird, so I feel pressured into lying, so fuck you.
  2. Talking about what’s interesting about English irregular past-tense verbs (I said interesting, not weird) would require me finding a napkin and a pen with which to draw on it, and no one seems to carry pens anymore, so I would have to wander around the party like a bumbling idiot, breaking up innumerable conversations while I looked for one, plus I have facial hair, so I really need my napkin.
  3. A reasonable linguist would suspect that if they engage with you on this question, then they’re going to find themselves in an annoying conversation with you about linguistic complexity, and that would really ruin their evening, which given that I was just sipping a dark beer and minding my own business seems pretty unfair.

English is the language of my profession, and I know an enormous number of non-native speakers who can read and write it close to perfectly.  But: drink a couple of beers at the Association for Computational Linguistics convention meet-and-greet, get into an animated conversation about the inability of Big Data to demonstrate causality, and anyone will start to trip over irregular forms.  If you’re speaking English, that’s probably mostly going to involve irregular past-tense verbs.  But: practice makes perfect better, so: let’s practice!

Today we’ll look at irregular past-tense verbs that follow a specific pattern.  In this pattern, a verb with the vowel [i] (International Phonetic Alphabet) in the present tense has the vowel [ε] in the past tense.  Examples:

  • feed/fed
  • lead/led
  • meet/met
  • read/read
  • lead/led

Notice that I’m grouping these verbs by pronunciation, not by spelling–our goal here is to help you develop spoken habits.  (Mécanisation–thanks, Phil d’Ange!)  The astute reader (OK, a linguist) might also have noticed that those verbs all end with one of the two English “alveolar oral stop consonants:” that is, with a or a d.  Other verbs that have the [i]-in-the-present-[ε]-in-the-past pattern may add a or a d:

  • feel/felt
  • creep/crept
  • keep/kept
  • kneel/knelt
  • leave/left
  • mean/meant
  • leap/leapt (leaped is also possible)
  • cleave/cleft (cleaved and clove are also possible)
  • flee/fled
  • sleep/slept
  • sweep/swept
  • weep/wept
  • deal/dealt
  • dream/dreamt (dreamed is also possible)
  • plead/pled (pleaded is also possible, and I think common these days, at least in the US)

OK: practice time!  Here are some sentences that include past-tense verbs of the [i]-in-the-present-[ε]-in-the-past pattern.  Read them out loud, replacing the present-tense verb in parentheses with the past-tense form.  (In some of these examples, it’s actually a past participle that happens to have the same form as the past tense.)

All examples are from the New York Times story Trump and Putin have met five times.  What was said is a mysteryby Peter Baker, published January 15th, 2019.  I have edited some of them for clarity, e.g. by replacing they with Trump and Putin. 

The first time that Trump and Putin (meet) was in Germany.

Each of the five times President Trump has (meet) with Mr. Putin since taking office, he has fueled suspicions about their relationship.

The unusually secretive way he has handled these meetings has (leave) many in his own administration guessing what happened and piqued the interest of investigators.

At the height of the campaign, his son, son-in-law and campaign chairman had (meet) at Trump Tower with Russians on the promise of obtaining dirt on Mrs. Clinton from the Russian government.

Their most famous meeting came on July 16, 2018, in Helsinki, where they talked for more than two hours accompanied only by interpreters.  The Kremlin later reported that the leaders reached important agreements, but American government officials were (leave) in the dark.  American intelligence agencies were (leave) to glean details about the meeting from surveillance of Russians who talked about it afterward.

The picture at the top of this post is an MRI of the vowel [i] being pronounced.  Source: I don’t remember, but if you care, I’ll look it up.  Enjoying the How to irritate a linguist series?  Here are the previous episodes.