The Holy Brotherhood

Nobody who’s anybody walks in LA.

The Missing Persons song got it right: nobody walks in LA.

It’s time to renew my visa, which means a flight to Los Angeles to render myself to the French consulate tomorrow morning (you are assigned to one consulate or another depending on where in the US you live–mine is in Los Angeles), which means that I spent three hours this afternoon photocopying every @#$% document that the application requires, arranging them all in my little French plastic sleeve in the exact order in which they appear on the instructions page on the consulate web site, imploring the poor lady at FedEx to take my mug shot in such a way that I might appear adorable, or at least not hideous; and walking.  Possibly the Missing Persons lyrics should have been “nobody who’s anybody walks in LA,” ’cause I wasn’t actually the only one. There was the enormously, enormously, enormously obese white woman wearing a halter top and a muumuu, sitting in front of a house that must have cost several million dollars (I shit you not), with all of her belongings in three very chaotic-looking shopping carts, singing softly to herself.  The black lady of my age or so sitting at an empty table in Starbucks, staring at nothing, her lips silently moving as her legs twitch like… well, I suck at analogies, but the poor lady’s legs twitched non-stop.  The oddly-well-groomed-despite-wearing-shorts-and-sneakers-with-tube-socks white guy of my age or so pacing the sidewalk with a blank canvas under his arm, becoming increasingly agitated as he stops by my table again and again to ask if it’s not the case that the car parked in front of the cookie shop is there illegally.  The thin black woman of my age or so (what the fuck is going on with the people my age in LA??) sitting on a bench, waving her hands and having an animated conversation with someone visible only to herself; on her lap is a checklist on which is written חֶבְרָה קַדִישָא, which is Aramaic for “The Holy Brotherhood,” which is the term for a Jewish volunteer burial society.  (Just don’t fucking ask why I can read Aramaic well enough to catch things written on random strangers’ checklists, OK?)

The streets of Paris are full of beggars (see this post for information on why that’s the case, and why it has been the case for centuries).  What the streets of Paris are not full of, though, is vulnerable psychotic people.  Why?  In the United States, we have no national health care system.  In France, there is a national health care system.  Want to know which other first-world countries don’t have national health care systems?  None.  And what are the Republicans hot to do?  Get rid of the closest to national health care that we’ve ever been able to get.  Vote in 2018…

The folks at the consulate were super-nice, and I’m happily re-established in Paris–legal until the end of April, yay!

English notes

I shit you not: I’m not kidding you; I’m telling you the truth.

I woulda figured something Athabaskan…

Since Bigfoot is mostly sighted in Oregon and Washington and was apparently captured in Alaska…

Pre-contact distribution of the Athabaskan languages. CC BY 2.0,

The Athabaskan languages are a family of languages native to North America.  Currently there are about 53 of them left.  They’re spoken in Alaska and northwestern Canada, in pockets of the coastal areas of Oregon and Washington, and in the south-central United States.  Since Bigfoot is mostly sighted in Oregon and Washington and was apparently captured in Alaska, I woulda figured that he would speak something Athabaskan, but apparently not…

I woulda figured explained in the English notes below. #cleaningthebasement

English notes: Colloquially, one meaning of to figure is to guess or to think.  Some examples:





Estimate your vocabulary size

When you figure out how to draw a representative sample of language, notify the linguists, because we sure as hell haven’t figured it out…

Want to see an application of Zipf’s Law?  Go to this web site, where you can get an estimate of your vocabulary size in any of 21 different languages.  I don’t know the details of how they come up with these estimates, but as an indicator of their accuracy (or lack thereof), I can tell you that my percentile placement on their English-language test was the same as my percentile placement on the GRE (the exam that you take if you want to go to graduate school in the United States).  It estimated my French vocabulary size at just over 7,600, which seems reasonable–I would guess that I’ve been learning about 3,000 words a year for almost three years, of which I probably forget about a third due to not running into them a second time (Zipf’s Law: 50% of the language that you will run into today consists of words that occur only very rarely–but that do, indeed, occur), which would work out to just under 6,000; add in another 500 for the one semester of French that I took in college (“university” to you French-speakers in the audience) and you get within 10% of their estimate, which seems reasonable.  (According to the web site, this lands me in the top 44% or so–of whom?  No clue.)

How would you use Zipf’s Law to do this kind of estimate?  Remember what the curve described by Zipf’s Law looks like:
Zipf’s Law: a small number of words occur very frequently, while the vast majority of words occur very rarely–but, they do occur. Credit: @ASvanevik.

One way to use this to estimate a vocabulary size would be to figure out how far to the right (towards 100) a person can “go,” so to speak.  If someone can’t reliably understand words above a rank of, say, 20, there is a massive number of words that they don’t know.  On the other hand, if someone can reliably understand words in the 90-100 range, their vocabulary is enormous.  How do you turn enormous into an actual number?  I have no clue how they do that–quantifying vocabulary size is hugely difficult, and as far as I know, it’s not possible to do it precisely for anyone, even for very, very young children. The SWAG approach would be to figure out the rank at which you stop recognizing words reliably, and then calculate the number of words above that rank. The Devil would, of course, be in the details–what texts would you use to determine your curve? Load those texts heavily with scientific journal articles about linguistics and someone like me would probably do pretty well–load them heavily with scholarly analyses of metaphors for love in Finnish epic sagas and I would probably do pretty poorly. Use a representative sample, you say? When you figure out how to draw a representative sample of language, notify the linguists, because we sure as hell haven’t figured it out…


Want to know some of the many technical details that make quantifying vocabulary size more or less impossible, even in principle?  See pages 22-28 of my colleague Elisabetta Jezek’s book The lexicon: An introduction.

English notes 

hugely: an adverb meaning “very.”  Is it English?  It first appeared in the language in the 12th century (along with archangel, asleep, dittany, lion, whoredom, and welkin–how cool is Merriam-Webster’s “Time Traveler” feature, and WTF is dittany??).  Have you ever come across it before?  Quite likely not–here are the relative frequencies of hugely and very:

Screen Shot 2017-10-20 at 05.47.42
Screen shot from the Google Ngram Viewer.

…but, it’s hard to argue that it’s not part of the language.

Want to see some cool shit?  Click on the version of the graph that you see below.  Do I REALLY think that this is cool?  Yes.  Is that fact related to the shockingly large number of times that I’ve been divorced?  I would imagine so.


Palimpsest upon palimpsest

Dear Dr. Zipf,

Good day to you.  My name is [name removed to protect the guilty] and I am a Ph.D. in [field removed to protect the guilty] at [a hospital which shall remain nameless].  I need to learn how to use natural language processing to process the electronic medical record and provide data that can be used for analysis.  As you are an expert in this field I thought I would email you and ask for your assistance.  Are there any books or training courses out there that can help me learn biomedical natural language processing in a few weeks.   Any help you can provide will be greatly appreciated.  Please let me know.

Warmest Regards,

[Name removed to protect the guilty]

In a few weeks… In a few weeks…

Dear Dr. X,

Biomedical natural language processing is super-simple, and I would be surprised if you couldn’t learn it in a few weeks.  You might find this book helpful:

Cohen, Kevin Bretonnel, and Dina Demner-Fushman. Biomedical natural language processing. Vol. 11. John Benjamins Publishing Company, 2014.

Please let me know if I can be of any further assistance.

Warmest Regards,

Beauregard Zipf, PhD

Dear Dr. X,

The doctoral students in our graduate program typically spend five years learning biomedical natural language processing.  Personally, I’ve spent my entire career learning biomedical natural language processing, beginning with spending a number of years as a medic in the military, where I learned the “biomedical” part. I mostly did physiological monitoring–hemodynamics, electrophysiology, stuff like that.  I later got a bachelor’s degree in linguistics (double major in English, actually), as well as a master’s degree in linguistics, and a PhD in linguistics, which is how I picked up the “language” part.  Along the way I learned to program–the hard way, which is to say by making more mistakes than you could possibly imagine, from the painful to the just plain embarrassing.  (That’s the “processing.”) Since then, I’ve spent years trying to figure this stuff out, and I still wouldn’t say that I know very much about it.  But, hey, you’ve got a PhD in [redacted], so, yeah–you should be able to pick this up in a few weeks.  You might find this book helpful:

Cohen, Kevin Bretonnel, and Dina Demner-Fushman. Biomedical natural language processing. Vol. 11. John Benjamins Publishing Company, 2014.

Warmest Regards,

Beauregard Zipf, Registered Cardiovascular Technologist, Advanced Cardiac Life Support instructor, EMT, PhD

Dear Dr. X,

  • Are there any books or training courses out there that can help me learn biomedical natural language processing

What an interesting question–thank you for bringing it up.  When I Googled the words biomedical natural language processing, the first hit I got was this:


Cohen, Kevin Bretonnel, and Dina Demner-Fushman. Biomedical natural language processing. Vol. 11. John Benjamins Publishing Company, 2014.


Looks like it might be relevant?


Best wishes,


Hi, Dr. X,

It’s nice to hear from you.  You might find this book helpful:


Cohen, Kevin Bretonnel, and Dina Demner-Fushman. Biomedical natural language processing. Vol. 11. John Benjamins Publishing Company, 2014.


Please let me know if I can be of any further assistance.
Best wishes,



English notes
palimpsest:  “writing material (such as a parchment or tablet) used one or more times after earlier writing has been erased” (Merriam-Webster).  Back in the day, writing was mostly done on parchment, and parchment was expensive, so in the monasteries that preserved much of the ancient writing that we have today, it wasn’t uncommon to scrape the ink off of parchment if you didn’t really care about what was written on it, and write something on it that you did care about.  If you’re lucky, though, today we can recover an earlier text from the impressions that it left behind on the parchment, and there are some texts that are only known from a palimpsest.  Wikipedia lists most of Cicero’s De republica, as well as the oldest Koranic variant in existence.

The two complaints of Americans in France: Part I

One of the things with the biggest effect on what language people will speak to you in Paris comes from the fact that if you’re a tourist, you’re mostly interacting with people in some sort of customer service role. 


Today is Wednesday, and Wednesday is market day in my neighborhood, and I need a liter of milk. Normally I would pop into the supermarket across the street for that kind of thing, but if you want good milk–and if you want to support the little things that make life here what it is–you get your milk from a cheesemonger.  (Cheesemonger explained in the English notes below.)  The Wednesday market has plenty of cheesemongers, so under the metro tracks I went (I’m right by the elevated portion of the #6 line), and a cheesemonger I found.  Bingo: lots of bottles of milk.  I got in line.

The two most common complaints that I hear from Americans who have visited Paris:

  1. Nobody there speaks English!
  2. I tried and tried to speak French with them, but everybody just answered me in English…

Contradictory, right?  How can they both be impressions that are shared by so many people?  Seriously —I can’t tell you how many times I’ve heard both of these complaints.  Actually, they both reflect the same truth: that what determines the language that people will use with you here is super-complicated.  Briefly: you have to think about which language will be used in the context of every single interaction that you have.  That interaction takes place with specific people trying to do specific things under a specific amount of pressure.  Those people come into those interactions with specific amounts of background in the two languages, and with specific amounts of tolerance for embarrassment.  One of the implications of this complicated interaction is that the same person may use a different language with you in different contexts; different people may use different languages with you in the same context.  This is so complicated that it will take multiple posts to explain–hence, the title of this post: The two most common complaints of Americans in Paris: Part I. 

One of the things with the biggest effect on what language people will speak to you here comes from the fact that if you’re a tourist, you’re mostly interacting with people in some sort of customer service role.  The hotel desk clerk, the counter girl at the Monoprix (they’re almost all girls), and most of all, the waiter–these are people who have to deal with a lot of people, and deal with them quickly.  In a situation like this, people will use whatever language they think will be most efficient for interacting with you.  Your efforts to speak French are actually very much appreciated, but if that counter person or hotel clerk thinks that they’ll be able to take care of your needs and move on to taking care of the next person’s needs most quickly in English, then that’s what they’ll speak with you–if they can.  Not everyone here is functional in English (and why would we be??), but if they can, and if they’re in a hurry, they’ll speak English with you if your French isn’t up to a super-efficient interaction.

The lady in line in front of me chez the cheesemonger started asking questions–in English.  It was clearly her native language; it was clear that she was struggling to frame her questions simply and clearly–and slowly; and it was clear that the cheesemonger was not getting it, and was not happy.  A deep breath, eyebrows down, and a worried look on his face.  No problem–I speak English natively and I am passionné du fromage (crazy about cheese), so I jumped into the conversation.  The relative strengths of some bleus were discussed; the significance of Mont d’Or in the cycle of the French year was summarized–the cheesemonger was happy to talk about his wares, as long as he could do it in a language that was shared across both sides of the counter.  Euros were handed over, cheese was handed over in return, and the nice tourists went away, tickled with both the experience and the anticipation of some good cheese-eating.

I asked for, and received, my liter of milk.  On an impulse, I picked up a small St-Félicien. The cheesemonger handed me my bag–and a small, wrapped package.  A little something to thank you for the translation, he said.  Would he have been happy to speak English with these folks, if he could?  More than happy.  Was he worried that these non-French-speaking tourists were going to throw his entire waiting line into disarray?  Absolutely.  Did it all turn out fine, with no hurt feelings on anyone’s part?  Clearly.  A tiny little moment in the cheesemonger’s day, the tourists’ day, and my day–and yet, pretty illustrative of the complexities of the question of who will speak what language to you, under what circumstances.  That waiter who impatiently responds to your carefully-rehearsed-but-nonetheless-halting French in English?  If it weren’t the lunch rush, he might very well be up for having a long conversation with you about the rignons de veaux à la sauce de moutard — in your halting French.  But, in the context of a busy lunch hour, he’s going to go with whichever language works out most efficiently for getting your order taken and moving on to the next table.

The small, wrapped package contained a cheese.  Just a little guy–I’ve included my sunglasses and key in the photo to provide some scale.  But, based on what I had ordered, this was a perfect choice–similar to the kind of cheese that he knows I like, ’cause I just bought some (a Saint-Félicien); but, different, in the subtle kinds of ways that lovers of French cheese savor (it’s probably a Saint-Marcellin or a Pélardon (I’ll know when I eat it)).  Scroll down for the English notes.  Sorry, no French notes today–gotta jump on the train to get my convention d’accueil so that I can RENEW MY VISA!  🙂


English notes

cheesemonger, fishmonger, hate-monger, war-monger: English has a number of words that end with -monger.  The basic meaning of this affix is that it is someone who sells something specific.  So, a fishmonger sells fish (there are a few of them in the market under the metro tracks; I understand that if they lop the head off of your fish for you, you’re supposed to tip them a euro), while a cheesemonger sells cheese.

You also see this affix in words referring to people who try to spread something amongst people.  A war-monger is a proponent of war; a hate-monger tries to get people to hate other people.  Scroll down to see examples of all of these in use; be aware that the spelling of these words can be variable with respect to whether or not they’re written as one word, and if they are written as one word, variable as to whether or not it’s hyphenated.

The worst kind of war-monger, for my money–a guy who won’t fight, and whose kids won’t fight, either. (For context: I spent nine and a half years in the US Navy.) Source:
Fisherman buying fish on the way home...!
Picture source:

Computational linguistics–it’s not all beer and pétanque: French text normalization

…and then I start to notice stuff like this going across the screen…
Zipf’s Law: a small number of words occur very frequently, while the vast majority of words occur very rarely–but, they do occur. Credit: @ASvanevik.

Zipf’s Law in brief: languages have a small number of words that occur very, very frequently (English examples: the, I, to, and; French examples: de, la, et, à), and an enormous number of words that occur very rarely–but, they do occur.  Consequence for people trying to learn a second language: every day of your life, you’re going to come across words that you don’t know.

Now, even if you suck at math as badly as I do, Zipf’s Law isn’t that hard to wrap your head around–clearly words like the, too, and and are super-frequent, while words like hangnail and glory are not, and clearly there are hella more words (see the English notes below for an explanation of hella) like hangnail and glory than there are words like the, too, and and.  However: understanding something in the abstract and really feeling it in your gut are two very different things, right?  I mean, you don’t actually expect to get smacked in the face by Zipf’s Law on a regular basis.

And yet…

Screen Shot 2017-10-10 at 08.56.10
Translation: ‘”I’m going to do a little fast” versus “I’m going to fuck a young guy”: On the importance of the circumflex accent.’ 3.8 thousand likes, 8.4 thousand retweets, and thanks to Phil dAnge for telling me what it means before I got myself into even MORE trouble with this one than I already had.

Computational linguistics isn’t all drinking beer in Prague and dancing in the park with pretty girls long after midnight in Bulgaria.  In fact, far more of it than anyone would guess is writing computer programs to process your data into a form that would let you actually do something fun with it.  When you’re working with French, a common step in this procssing is to remove the accents from all of the letters.  Yes, this seems like heresy.  Yes, this creates ambiguity–jeûne (a fast) becomes jeune (a youth), châsse (reliquary–God, how I love casually dropping that one into a conversation) becomes chasse (hunt), and répéter becomes repéter (Um, Zipf… you have to say répéter (to repeat)otherwise, you’re saying repéter (to fart again).)  But, overall, the increase in ambiguity that comes from the deletion of accents (part of a set of techniques known as normalization of your text) is well-recompensed by the fact that you essentially (and probably counter-intuitively) get rid of a lot of potential errors by your program that way.  For example, I work with biomedical data, so I might want to be able to process something like this:

Mon père a perdu l’ordonnance de médicament.

…and alert someone that this guy needs attention.  Seulement voilà–the thing is–if I expect everything to be spelled correctly, then I’ll miss this:

Mon père a perdu l’ordonnance de medicament.

…and this:

Mon pere a perdu l’ordonnance de médicament.

…and this:

Mon pere a perdu l’ordonnance de medicament.

…and you don’t want to miss the fact that some little old guy (let’s say, me, but less bald and fat) has lost his prescription, right?  So, if you’re working with French, you will probably–quite early in your processing–remove all of the accents from everything.  Once you’ve done that, all four of the previous sentences look the same to the computer program, and you only have to write enough code to deal with one of them.

So, the other day I’m sitting in the lab getting ready for a meeting the next morning, and I’m working with French medical data, and I realize that I need some non-medical text so that I have something with which to compare my medical data.  (Something to compare my medical data with, in spoken English.)  I find some French novels available for free. I start running them through my program….and then I start to notice stuff like this going across the screen:

s’est-il \xe9cri\xe9, et il m’a saut\xe9 au cou. Les autres aussi m’ont embrass\xe9

Not intelligible (at least to a human), but not shocking, either.  Here’s the thing: in their guts, computers were made to deal with the English alphabet, and only the English alphabet.  Consequently, if you want to work with other languages, you have to find ways to “encode” their letters into a form that the computer can deal with.  Those strings of backslashes, numbers, and weird letters are one way of doing that “encoding.”

Thelenota ananas, a species of sea cucumber. Credit: By Leonard Low from Australia – Flickr, CC BY 2.0,

So: what to do?  The easy way out of this would be to replace all of those strings with the appropriate unaccented character.  Easy enough: saut\xe9 seems likely to have been sauté once upon a time, and embrass\xe9 was probably embrassé, so replace all of the sequences \xe9 with e, and you’re good to go.  Not nearly as glamorous as eating sea cucumbers in Hangzhou, but not super-difficult, either.

Then something like this flashes across your screen:

…l’abbaye de la Chapelle poss\xfc\xbe\x8e\x86\x94\xbcdait deux bergeries à Oye, et recevait la d\xfc\xbe\x99\xa6\x94\xbcme de…

With d\xfc\xbe\x99\xa6\x94\xbcme, we are clearly deep into the “long tail” of the Zipfian distribution.  But, being obsessive memorizers of the at-minimum 10 words a day that we come across in our daily lives but don’t know, and being in the midst of Les Misérables, the word dîme (tithe) comes to mind immediately, and seeing the word abbaye (abbey) in the vicinity is all that we need to confirm it.  So, now we know that \xfc\xbe\x99\xa6\x94\xbc should be replaced with i, and we go on about our business.


So, yes: Zipf’s Law is a thing–your day is full of words that are very, very rare, but that do occur.  And, yes: obsessive memorization of French vocabulary will occasionally get you out of a tight spot at work.  And, yes: computational linguistics is not all drinking beer in Prague and dancing in the park with pretty girls in Bulgaria long past midnight–but, when it is, it is so, so good!  And, when it’s not–it’s hella better than digging ditches!

English notes

hella: an extremely casual intensifier, it means “very” or “a lot of,” depending on whether it’s modifying an adjective or a noun.

Equivalent to very:

hella goofy means very goofy

hella cute means very cute

Equivalent to a lot of:

hella people means a lot of people

hella times means a lot of times, many times

Wanna try this at home?  Here’s the code–copy and paste it into an .Rmd file.  Don’t know what an .Rmd file is?  Don’t try this at home.

title: "GutenbergR demo"
author: "KBC"
date: "9/25/2017"
output: html_document

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#install.packages("gutenbergr") # only need to do this the first time that you run the script

# get list of French-language books <- gutenberg_metadata() %>% filter (language == "fr") <- gutenberg_works(languages = "fr", only_text = TRUE,
 rights = c("Public domain in the USA.", "None"), distinct = TRUE,
 all_languages = FALSE, only_languages = TRUE)

# retrieve the contents of those books <- gutenberg_download($gutenberg_id) <- gutenberg_download(796) # Stendahl <- gutenberg_download(799) # Verne

# fix at least some of the characters$text <- gsub('\xe9', 'e',$text)$text <- gsub('\xfc\xbe\x8d\xb6\x94\xbc', 'c',$text)$text <- gsub('\xfc\xbe\x8e\x86\x94\xbc', 'e',$text)$text <- gsub('\xe0', 'a',$text)$text <- gsub('\xfc\xbe\x98\x96\x94\xbc', 'e',$text)$text <- gsub('\xfc\xbe\x98\xa6\x98\xbc', 'u',$text)$text <- gsub('\xfc\xbe\x8d\x86\x98\xbc', 'o',$text)$text <- gsub('\xfc\xbe\x99\xa6\x94\xbc', 'i',$text)$text <- gsub('\xfc\xbe\x8e\x96\x98\xbc', 'u',$text)$text <- gsub('\xfc\xbe\x99\x96\x94\xbc', 'i',$text)$text <- gsub('\xfc\xbe\x8c\xa6\x94\xbc', 'a',$text)$text <- gsub('\xfc\xbe\x98\xa6\x84\xbc', '--',$text)$text <- gsub('\xfc\xbe\x98\xa6\x88\xbc', '',$text)$text <- gsub('\xfc\xbe\x8e\x96\x8c\xbc', '',$text)

How can you be 17 years old and 4 months old?

Humans are so good at “resolving” ambiguities that they usually don’t even notice them. Computers, though–computers have no such abilities, unless their designers give them to them.


One of the properties of every known human language is that they are ambiguous.  Being “ambiguous” means that something can have more than one interpretation.  Humans are so good at “resolving” ambiguities (i.e., figuring out the intended interpretation) that we rarely notice them, but in fact almost everything that you will hear/read or say/write today will be ambiguous in some way or another.

Humans are indeed quite good at resolving ambiguities.  If you want to get a computer program to do anything whatsoever with language, though, you have to give it the ability to deal with ambiguity–computer programs are just as incapable of ignoring ambiguity as humans are capable of resolving it.  So, one of my standard exercises for students in natural language processing (treatment of language by computers) courses is to have them go through some texts and find the ambiguities.  I typically have them do that with cartoons, since their humor is often based on playing with ambiguities.  Tomorrow, though, I’ll be teaching at the EUROLAN “summer school” on biomedical natural language processing, so I feel obligated to give the students a biomedical example.  Here’s what it’ll be.  It’s a text that would be completely typical in a health record (but it is not from an actual patient).  I read through it until I found 10 ambiguities, and then stopped–so, you should be able to find at least 10 points of ambiguity here–in just the first two sentences:

CLINICAL HISTORY: This prolonged video/EEG was performed on a 17 year and 4 month-old female.  This study was done to completion of Phase I surgical evaluation

TECHNICAL SUMMARY: The patient underwent…

Now, if you’re a normal human, you will not, in fact, be able to find 10 ambiguities in this text–we just don’t notice them, for the most part.  And that, in fact, is the point of the exercise.  I’ll follow the exercise with an illustration of those 10 points of ambiguity, many–or most–of which the students won’t have noticed.  Their computer programs, though–their computer programs won’t be able to miss them, and it’s their very ubiquity that beginning researchers need to have pounded into their heads.

See how many you can come up with, and then watch this space for the (or, at least, some) answers!