Zipf’s Law English: reduction

Spoken American English can be very difficult to understand. Here’s a video to help you cope with one of the problems therewith.

Walking out of the exam on oral comprehension during the testing for the Diplôme approfondi de langue française a couple months ago, I found a very unhappy-looking young man waiting for the elevator.  Are you OK?  He shook his head glumly: I flunked again, I know it.  I made sympathetic noises.  Was this your first time taking the test?  I responded in the affirmative.  He gave me a look of pity–clearly the expectation was that I was going to find the experience as brutal as he had.  Repeatedly, apparently.

Indeed, the oral comprehension exam got me my worst score out of the whole test.  Spoken French and spoken English can both be brutally difficult to understand if they’re not your native language, and for many of the same reasons.  One of those is their sets of vowels–both languages have vowel “inventories” (the technical term) that are shared by relatively few languages.  Another is a process called reduction, which leads to things having a range of ways that they could be pronounced, some of which are less distinct than others.  For example, in French, some unstressed vowels are optional in casual spoken language, so that cheveux is often pronounced chveux, matelot can be pronounced matlot, and so on.  Furthermore, the sounds that are “left behind” can be changed as a result, so that, for example, the in je becomes pronounced as ch when je suis is “reduced” to chuis.  So, when I describe this as becoming “less distinct,” think about this.  In French, there are these two words, and the difference between them is the sound of versus the sound of ch:

  • le jar: secret language, argot
  • le char: chariot; in Canada, car.

When becomes ch, as in chuis, the difference between the two sounds goes away, and in that sense, a “reduced” word is less distinct from other words than it might have been.

Reduction processes are rampant in spoken American English, and they can make the language pretty difficult to understand if you’re not a native speaker.  I’m trying my hand at putting some videos together that aim to help people learn to understand these reductions.  You can find the first one, on the topic of the reduction of let me to lemme, at the link below.  If you’re as mystified by spoken American English as I am by spoken French, check it out–I’d love to have feedback on what does and doesn’t work, whether that be here on this blog, or in the Comments section on YouTube.  Unfortunately, I haven’t figured out the whole subtitle thing, and I’d like to know to what extent that does or doesn’t interfere with the effectiveness (or lack thereof) of the video.  Any input at all would be appreciated, though!

Vocabulary hoax and social stratification

Recommended readings on language: the Great Eskimo Vocabulary Hoax, and what’s on the fourth floor.

Apropos of nothing, here is a blog post with this week’s suggested readings for a class that I’m teaching.  Some of them are quite interesting and not at all technical.  In particular, the Great Eskimo Vocabulary Hoax piece by Geoffrey Pullum talks about issues that have come up multiple times in the Comments section of this blog, and the Social stratification of (r) in New York City department stores piece by William Labov is quite fascinating if you’re interested in language and society.

Suggested readings for Weeks 2 and 3, natural language processing

English notes

apropos of nothing: used to introduce a new topic that isn’t related to anything that’s previously been under discussion.  Examples:

How it was used in the post: Apropos of nothing, here is a blog post with this week’s suggested readings for a class that I’m teaching.  

My last assclown

Picture source:

Winter will be past before we know it.  I’ll see the chestnuts blooming in the Place Cambronne on my way home from work (on my way to work, I study vocabulary, and don’t notice them), and rejoice in the knowledge that they will survive even the zombie apocalypse.  Not far behind will be National Poetry Month.  In anticipation of that, and after a long weekend of contemplating what exactly it means to have a thin-skinned assclown, a man who rages in response to tweets and threatens the press when he doesn’t like their reporting, with his fingers on the most powerful nuclear arsenal in the world, I propose a timely bit of Robert Browning.  Follow this link if you’d like to hear a pretty good recording thereof.  It’s pretty disturbing in and of itself, and all the more so with Trump in the presidency.  I gave commands;  then all smiles stopped together. There she stands  as if alive….Notice Neptune, though…thought a rarity, which Claus of Innsbruck cast in bronze for me!  (Rough translation: I had her killed.  Hey, look at this great thing that I have!)

My Last Duchess

Robert Browning

That’s my last Duchess painted on the wall,
Looking as if she were alive. I call
That piece a wonder, now; Fra Pandolf’s hands
Worked busily a day, and there she stands.
Will’t please you sit and look at her? I said
“Fra Pandolf” by design, for never read
Strangers like you that pictured countenance,
The depth and passion of its earnest glance,
But to myself they turned (since none puts by
The curtain I have drawn for you, but I)
And seemed as they would ask me, if they durst,
How such a glance came there; so, not the first
Are you to turn and ask thus. Sir, ’twas not
Her husband’s presence only, called that spot
Of joy into the Duchess’ cheek; perhaps
Fra Pandolf chanced to say, “Her mantle laps
Over my lady’s wrist too much,” or “Paint
Must never hope to reproduce the faint
Half-flush that dies along her throat.” Such stuff
Was courtesy, she thought, and cause enough
For calling up that spot of joy. She had
A heart—how shall I say?— too soon made glad,
Too easily impressed; she liked whate’er
She looked on, and her looks went everywhere.
Sir, ’twas all one! My favour at her breast,
The dropping of the daylight in the West,
The bough of cherries some officious fool
Broke in the orchard for her, the white mule
She rode with round the terrace—all and each
Would draw from her alike the approving speech,
Or blush, at least. She thanked men—good! but thanked
Somehow—I know not how—as if she ranked
My gift of a nine-hundred-years-old name
With anybody’s gift. Who’d stoop to blame
This sort of trifling? Even had you skill
In speech—which I have not—to make your will
Quite clear to such an one, and say, “Just this
Or that in you disgusts me; here you miss,
Or there exceed the mark”—and if she let
Herself be lessoned so, nor plainly set
Her wits to yours, forsooth, and made excuse—
E’en then would be some stooping; and I choose
Never to stoop. Oh, sir, she smiled, no doubt,
Whene’er I passed her; but who passed without
Much the same smile? This grew; I gave commands;
Then all smiles stopped together. There she stands
As if alive. Will’t please you rise? We’ll meet
The company below, then. I repeat,
The Count your master’s known munificence
Is ample warrant that no just pretense
Of mine for dowry will be disallowed;
Though his fair daughter’s self, as I avowed
At starting, is my object. Nay, we’ll go
Together down, sir. Notice Neptune, though,
Taming a sea-horse, thought a rarity,
Which Claus of Innsbruck cast in bronze for me!

English notes

assclown: “someone who, wrongly, thinks his actions are clever, funny, or worthwhile.”  ““someone who seeks an audience’s enjoyment while being slow to understand how it views him.”  A specific kind of asshole, defined as “A person counts as an asshole, when and only when, he systematically allows himself to enjoy special advantages in interpersonal relations out of an entrenched sense of entitlement that immunizes him against the complaints of other people.”  Sources: John Kelly on the Strong Language blog, and Aaron James, in his book Assholes: a theory of Donald Trump.

Fra: “used as a title equivalent to brother preceding the name of an Italian monk or friar” (Merriam-Webster).  My best guess is that it’s used here to suggest that the Duke things that the painter was overly familiar (brother) with his wife, and/or that his wife was overly familiar with the painter.

familiar: a word with at least two parts of speech (adjective, of course, but also noun).  In the (attempt at an) explanation above, it’s used with this range of meanings, again from Merriam-Webstera :  being free and easy


association of old friends> b :  marked by informality familiar essay>

For my money, I really need to get more sleep

I can’t sleep, which leads to tokenization issues and the definition of “for my money.”

I don’t sleep well.  That is to say: I don’t sleep very much.  Not at night, anyway.

In the best-case scenario, the middle of the night, when in theory I should be sleeping, is my time to study vocabulary or to read.  In the worst-case scenario, the middle of the night is when I return emails from people who are in North America, and therefore awake.

Tonight’s email brought a help-wanted ad from the School of Informatics at the University of Edinburgh, posted by the amazing Mirella Lapata.  (I say “amazing” because her paper with Regina Barzilay at the Association for Computational Linguistics annual meeting in 2005 opened my eyes to the possibilities for inventive evaluation strategies in computational linguistics in a way that my eyes had not previously been opened.)  For my money, the University of Edinburgh’s graduate program in computational linguistics is the best in the world, so I forwarded Mirella’s email to the students in our program, most of whom are not computational linguists, but most of whom would be quite suited for one of the advertised jobs in the School of Informatics.  I added the following introduction to the email:

Picture source: me.

This got me the following response from one of my students in the US (and therefore awake):

Picture source: also me.

Now, I love getting this kind of question, for many reasons.  It lets me repay the apparently endless patience of my colleagues in France for my crappy command of their language.  It lets me be the person who knows the answer to a question about language, which in French happens exactly never.  It gives me a socially acceptable excuse for talking about language, which I enjoy way more than is cool.  It suggests that someone actually both read and thought about what I wrote.  (You pick whichever one you think portrays me in the best light.)  In fact, I love that kind of question so much that I will often go out and find naturally-occurring examples, which like any good linguist these days, I do on the Interwebs.  A trip to the Sketch Engine web site and a search of the Open American National Corpus found me these:

Picture source: screen shot of the Sketch Engine web site.

…which, of course, like most things of interest, leads to a question. In this case, the question is: what’s wrong with the Sketch Engine web site?  Where did all of those spaces come from?  

The answer: there’s nothing wrong with the Sketch Engine web site.  Part of any analysis of written data is choosing an answer to this question: what is a word?  It’s not typically obvious what the answer is.  Give students in a beginning language processing class this sentence, and ask them what the words are:

My dog has fleas.  

(For reasons that are obscure to me but that I think have something to do with playing the ukulele, that is a famous sentence.)  Ask them what the words are, and the first answer will be anything separated by white space:

My dog has fleas.

…at which point they quickly realize that they’ve just posited that fleas. is a word, and they modify their hypothesis, to be anything separated by white space and stripped of punctuation: 

My dog has fleas .

(I’m not making this up–in fact, I did it in class last Tuesday.)  Next they figure out that they probably want My and my to be considered the same word, which means that they need to do something about the case of letters, and if they speak any of the bazillion languages that have more inflectional morphology (example in a minute) than English does, then they might want to do something with aller/allais/allai/allasse, etc.

Things get pretty complicated pretty quickly, though.  Suppose that you’re dealing with English.  What do you do with

wouldn’t don’t haven’t didn’t

Seems pretty straightforward–you want something like this:

would n’t do n’t have n’t did n’t

…except that it’s not straightforward at all, because then you have to propose

wo n’t

…which people generally aren’t happy about.

The table of contents of “Le mot,” by Maurice Pergnier. The point of the picture is that the first 46 pages of the book address the various arguments for and against the whole idea of the word. Picture source: me.

There are a variety of ways to answer these sorts of questions, and it does actually matter.  From a practical point of view, the choices that you make about how you do this–the process is called tokenization–is important enough that it affects the performance of computer programs that do things with language.  (Here’s a recent paper on the topic.)  From a theoretical point of view,  your choice takes a position on a hugely controversial topic in linguistics: what a word is.  (The best discussion of the controversy that I’m aware of is in the book Le mot, by Maurice Pergnier.)

So, why are those spaces there in the Sketch Engine output?  Let’s look at it again:

Picture source: screen shot of the Sketch Engine web site.

One of the immediately obvious things is that they have “tokenized” the punctuation off, so that “personal growth” becomes ” personal growth ” and (1995) becomes ( 1995 ).  The next thing that you might notice is that there is some ambiguity in the output.  Look at what happens to that’s and people’s ..

…which become

that ‘s and people ‘s

Now we have two ‘s …and they are different, but look the same.  What is a computer program to do with that?  Welcome to my world.  Nobody said that computational linguistics was going to be all about suicide prevention and curing cancer, right?

The Regina Barzilay and Mirella Lapata paper that I mentioned above:

Regina Barzilay and Mirella Lapata. 2005. Modeling Local Coherence: An Entity-based Approach. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, 141-148. Ann Arbor.

The declaration of competing interests: I don’t have any.  Sketch Engine doesn’t pay me–I pay them, and I get a hell of a lot of use out of it.  

French notes

— Je vous regardais tout à l’heure, vous étiez marants tous les deux le flicmane et vous.

— A tes yeux, dit la veuve Mouaque.

— “A mes yeux?  Quoi, “à mes yeux”?

— Marants, dit la veuve Mouaque.  A d’autres yeux, pas marants.

— Les pas marants, dit Zazie, je les emmerde.

Raymond Queneau, Zazie dans le métro

How would you say for my money in French, or more generally, label something as someone’s opinion, yours or otherwise?  There are a lot of options, and unfortunately, I don’t know the status of any of them with respect to register of language, contexts in which they are or aren’t appropriate, etc.  Here’s what I’ve come across so far, and I should point out that I also don’t know which of these can only gracefully be used to introduce your own opinion, versus which could also be used to introduce someone else’s opinion.  I’ll also mention (and then I’ll shut up) that of all of these, I’ve heard the first one (à mon avis) the most, the second one exactly once (in Raymond Queneau’s Zazie dans le métro), and the rest never, as far as I know.  If any of you native speakers out there can offer suggestions about when and where to use which of these, it would be great.

  • à mon avis
  • a mes yeux
  • selon moi 
  • à mon gré
  • de mon point de vue
  • d’après moi
  • d’après mon point de vue
  • à mon sens
  • de l’aveu de qqn: I think that this one implies something negative, along the lines of “as Chomsky himself admits,” as opposed to relaying an opinion about which you’re not necessarily making any judgment one way or the other.

In which I repay a random act of kindness by being a jerk

“Wakarimasen” means “I don’t understand” or “I don’t know.” Picture source:

I was walking down the street in Tokyo this morning when a fellow foreigner acknowledged my existence.

This is a far rarer occurrence than you might think in this country with a very low immigration rate, where running into another “Western” foreigner is pretty uncommon outside of tourist areas, and you might expect that it would lead to at least a smile, if not an actual conversation.  I’ve had many occasions when Japanese who spoke some English struck up random chats with me, but I’ve noticed that the few foreigners who you run into in Japan will, in general, resolutely avoid meeting your eyes.  (Note that I’m talking about foreigners who live here–not tourists.)  Why?  I can only guess.  OK, my guess: foreigners here in Japan struggle so very hard to integrate themselves into the culture that I suspect that they’re loath to, in some sense, admit that they are “others” by sharing in the otherness of some random visitor such as myself.

So, when a clearly foreign guy caught my eye and smiled at me this morning on my way back from a morning visit to the neighborhood shrine, I was so surprised that I don’t think I smiled back.  Then I felt like a total jerk.  Maybe being someone who lives here–you don’t come out of the very busy Ochanomizu station at that time of the morning unless you’re going to work, so I’m guessing that he does–he’s used to getting that reaction from other foreigners.  Still: I felt like even more of an asshole than I usually do.

French notes

le sanctuaire shinto: Shinto shrine

English notes

to meet someone’s eyes: to look directly into someone’s eyes, acknowledging the contact.

How it was used in the post: I’ve noticed that the few foreigners who you run into will, in general, resolutely avoid meeting your eyes.

to be loath to: to be deeply unwilling to do something.  (Definition adapted from Merriam-Webster.)

to loathe: to dislike to the point of disgust. 

Keeping track of the difference between these two is actually quite difficult even for native speakers.  You can read an article about the history of the problem here on the Merriam-Webster web site.  There are two parts to it.  One is keeping straight the fact that the verb ends with an e, and the adjective doesn’t.  The other is that the verb is pronounced with the th of this and the, while the th of the adjective can be pronounced with the th of this and the, or with the th of thin.    

How this showed up in the post: foreigners here in Japan struggle so very hard to integrate themselves into the culture that I suspect that they’re loath to, in some sense, admit that they are “others” by sharing in the otherness of some random visitor such as myself.

Why you can’t unsee things: compositionality and Haussmann’s apartment buildings

You can unscrew a lightbulb, you can unplug your monitor, and you can unbuckle your suspenders, so why can’t you unsee things?   It has to do with the prefix un- when it’s attached to verbs.  In order to be able to un- a verb:

  • The verb has to refer to changing the state of something.  So, you can undress yourself (changing your state from being dressed to not), you can unclog a pipe (changing its state from being clogged to not), and you can unlock a door (changing its state from being locked to not).
  • The state has to be reversible.  So, you can dress/undress yourself, you can clog/unclog a pipe, and you can lock/unlock a door.  But: you can bake a cake, but can’t unbake it; you can dry a shirt, but as far as I know, you can’t undry it; you can break an egg, but you can’t unbreak it.

So: you can see something, but you can’t unsee it, because when you see something, you’re not changing its state, and that’s the sine qua non of verbs that can take un-.

Ack–data!  I almost forgot that I’m an empiricist!  In fact, the verb to unsee occurs a lot.  It occurs with a frequency of 0.02 words per million in the enTenTen13 corpus (19.7 billion words of English, available on the Sketch Engine web site).  But, it’s cool: it doesn’t mean to undo the seeing of something.  When we talk about unseeing things, we’re usually talking about the very fact of not being able to unsee them, and what that actually means is this: we can’t forget them, and/or we can’t move beyond whatever we learned from what we saw.

In fact, the interwebs are full of talk about things that can’t be “unseen.”  Some examples:

Why does unsee work so well for this use, when it can’t have the meaning that you would think it would?  I suspect that it’s precisely because (a) it’s basically an impossible verb, and (b) it’s used only to describe an impossible action.  And, the fact that the meaning of unsee is not the meaning of see plus the meaning of un- is important here.  We’ve talked often about the basic principle of compositionality–the idea that meaning in language comes from something like “adding together” the meanings of different things.  Here is a case where the meaning is clearly not compositional–to unsee something, were it possible, would not be what it is if it were compositional.  (Were it possible explained below in the English notes.)  So: cool, if you think that it’s cool to violate the expectations of linguistics, computer science, and philosophy.  (I do think it’s cool, but maybe that’s why I’m single.)

What I can’t unsee: pierres d’attente.  I took a guided tour of Haussmannian Paris the other day.  What that means: the enormous redesign of Paris in the 3rd quarter of the 19th century, when huge swaths of the city were torn down and rebuilt into the stereotype that you’re thinking of when you visualize Paris today.  (See here for a post about the typical Haussmannian streets and how they relate to your ability to survive the zombie apocalypse in Paris, as well as here for a post about the typical Haussmannian apartment buildings and how they, too, relate to your ability to survive the zombie apocalypse in Paris.)

The new Haussmannian buildings went up in the order in which their lots were appropriated, the old buildings torn down, and the new buildings financed.  That meant that it was often the case that buildings were put up that one day would have neighbors, but didn’t yet.  In anticipation of the need to line up with adjacent buildings–lining up with things was very important in Haussmann’s Paris–the front-facing walls of the buildings had projections that were meant to facilitate alignment with future neighbors.  So, pierre d’attente: “waiting stone,” I guess.  (I think they can also be called pierres d’accord.)

Now, at some point, architects realized that if you have pierres d’attente sticking out of the side of your building, they catch rain, and then it can run into your walls, and that is most definitely not a good thing for your building.  So, people started cutting them off, which is why you will see things like this:

Apartment building with pierres d’attente removed. Picture source: me, on the rue La Fayette.

But: not everyone was happy about this.  Haussmannian apartment buildings are part of our patrimoine, and pierres d’attente are part of Haussmannian apartment buildings, so those pierres d’attente are part of our patrimoine, and no asshole should be cutting them off, right?  Point taken, and cutting off your pierres d’attente is apparently no longer allowed.  But, hey, this is France, and we’re logical–so, what you can do is, you can cut them so that there’s a pente, a slope, on the top edge.  (I just had to throw the French word in there, on account of the fact that when I memorized it, I thought that I would never, ever get to use it–and there, my friends, is a very concrete example of Zipf’s Law in action.)

The guided tour was great.  Seulement voilà (the thing is)…the tour guide explained pierres d’attente to us, and now I can’t stop seeing them.  It’s OK–frankly, the more there is to occupy my fevered little brain, the better…

English notes

Anglophone students of French whine about the French subjunctive, and frankly, I’m not sure that Francophone professors are thrilled about teaching it to us, but: the fact is, English has a subjunctive voice, too.  Or, more accurately: it can.  This varies quite a bit by dialect, but English can have a subjunctive, in at least the following circumstance: talking about things that are not real at the moment.  For example, here are some options, with and without the subjunctive:

  • If I were you, I wouldn’t tell him to fuck off–he’s a lot bigger than you are.
  • If I was you, I wouldn’t tell him to fuck off–he’s a lot bigger than you are.

You can recognize the subjunctive by the weird agreement of If I were you, rather than If I was you.  Both are correct, and most Americans would say If I was you, but If I were you is more natural in my dialect.  (I come from a relatively obscure area in the northwest of the country.)

  • Would you prefer that he give you a pat on the back, or a kick in the ass?
  • Would you prefer that he gives you a pat on the back, or a kick in the ass?

Again, you can recognize the subjunctive by the weird agreement of he give you versus he gives you.  

How the subjunctive was used in the post: Here is a case where the meaning is clearly not compositional–to unsee something, were it possible, would not be what it is if it were compositional.  I chose obscenity-laden examples to make clear that this isn’t a formality thing–the subjunctive is just more natural in my dialect.  Again, most American speakers of English would say the form of these two sentences without the subjunctive, but both are fine.  I have no idea how this works in the United Kingdom–can any of you Brits comment on this?


If Paris were full of the living dead

Who among us has not looked across the majestic sweep of the Place de la Concorde, up the stretch of the Champs Elysées, or through the luxurious Luxembourg Gardens and wondered: what will this place look like when it’s overrun by zombies?

Picture source:

I first published this on November 13, 2015, from Denver, Colorado. Not long afterwards, phone calls and texts started coming in fast and furious: relatives who were hearing about the Islamic State terrorist attacks that would kill 130 people and injure another 368 that evening.  The post didn’t seem so funny in that context, and I took it down after an evening of trying to reach family and friends in Paris.  14 months later, Paris has brushed off her shoulders and kept walking, as she always does, and I am ready to play my infinitesimally small part in that.

Who among us has not looked across the majestic sweep of the Place de la Concorde, up the stretch of the Champs Elysées, or through the luxurious Luxembourg Gardens and wondered: what will this place look like when it’s overrun by zombies?  Who among us has not looked down an unending line of the 7-story Hausmannian apartment blocks that make Paris look like Paris and thought: it would really suck to have to clear 7-story building after 7-story building–with optional basement–of zombies…

The English Wikipedia page on zombies is quite long, and discusses zombies from every angle that one could think of–folklore, the evolution of the zombie archetype, the zombie in modern fiction, the significance of the zombie apocalypse, and the zombie in popular culture–each with its sections and subsections.  In contrast, the French Wikipedia page on zombies is pretty much just this sentence:

Un zombie (ou zombi) est, dans le folklore, un mort-vivant ou un individu infecté d’un virus nuisible à certaines parties du cerveau.

Of course, even with just one sentence, Zipf’s Law brings us some new vocabulary items:

  • le mort-vivant: living dead.
  • nuisible: harmful, damaging, injurious; pest.

I have no idea what it means that there is a long English Wikipedia page on zombies and a very short French one.  Probably something profound about France and America, but I don’t know what.  I do know this: I hate zombies.

About 14 months later, the French Wikipedia page on zombies is considerably longer, and I’ve reached a new level in my thinking about the relationship between zombies and those Haussmannian apartment buildings: they will contain the zombies nicely, so they’re actually going to be a big help in recovering from the zombie apocalypse.  However, I’m leaving this post as it was on November 13th, 2015–a fond memory of a more insouciant time.