Estimate your vocabulary size

When you figure out how to draw a representative sample of language, notify the linguists, because we sure as hell haven’t figured it out…

Want to see an application of Zipf’s Law?  Go to this web site, where you can get an estimate of your vocabulary size in any of 21 different languages.  I don’t know the details of how they come up with these estimates, but as an indicator of their accuracy (or lack thereof), I can tell you that my percentile placement on their English-language test was the same as my percentile placement on the GRE (the exam that you take if you want to go to graduate school in the United States).  It estimated my French vocabulary size at just over 7,600, which seems reasonable–I would guess that I’ve been learning about 3,000 words a year for almost three years, of which I probably forget about a third due to not running into them a second time (Zipf’s Law: 50% of the language that you will run into today consists of words that occur only very rarely–but that do, indeed, occur), which would work out to just under 6,000; add in another 500 for the one semester of French that I took in college (“university” to you French-speakers in the audience) and you get within 10% of their estimate, which seems reasonable.  (According to the web site, this lands me in the top 44% or so–of whom?  No clue.)

https://www.arealme.com/vocabulary-size-test/en/

How would you use Zipf’s Law to do this kind of estimate?  Remember what the curve described by Zipf’s Law looks like:

zipfs-law-curve-https-::medium.com:@ASvanevik:how-i-learned-german-in-30-days-df7b7ff85654
Zipf’s Law: a small number of words occur very frequently, while the vast majority of words occur very rarely–but, they do occur. Credit: @ASvanevik.

One way to use this to estimate a vocabulary size would be to figure out how far to the right (towards 100) a person can “go,” so to speak.  If someone can’t reliably understand words above a rank of, say, 20, there is a massive number of words that they don’t know.  On the other hand, if someone can reliably understand words in the 90-100 range, their vocabulary is enormous.  How do you turn enormous into an actual number?  I have no clue how they do that–quantifying vocabulary size is hugely difficult, and as far as I know, it’s not possible to do it precisely for anyone, even for very, very young children. The SWAG approach would be to figure out the rank at which you stop recognizing words reliably, and then calculate the number of words above that rank. The Devil would, of course, be in the details–what texts would you use to determine your curve? Load those texts heavily with scientific journal articles about linguistics and someone like me would probably do pretty well–load them heavily with scholarly analyses of metaphors for love in Finnish epic sagas and I would probably do pretty poorly. Use a representative sample, you say? When you figure out how to draw a representative sample of language, notify the linguists, because we sure as hell haven’t figured it out…

41bpA-6YEzL._SX345_BO1,204,203,200_
Source: smile.amazon.com

Want to know some of the many technical details that make quantifying vocabulary size more or less impossible, even in principle?  See pages 22-28 of my colleague Elisabetta Jezek’s book The lexicon: An introduction.


English notes 

hugely: an adverb meaning “very.”  Is it English?  It first appeared in the language in the 12th century (along with archangel, asleep, dittany, lion, whoredom, and welkin–how cool is Merriam-Webster’s “Time Traveler” feature, and WTF is dittany??).  Have you ever come across it before?  Quite likely not–here are the relative frequencies of hugely and very:

Screen Shot 2017-10-20 at 05.47.42
Screen shot from the Google Ngram Viewer.

…but, it’s hard to argue that it’s not part of the language.

Want to see some cool shit?  Click on the version of the graph that you see below.  Do I REALLY think that this is cool?  Yes.  Is that fact related to the shockingly large number of times that I’ve been divorced?  I would imagine so.

[https://books.google.com/ngrams/interactive_chart?content=hugely%2Cvery&case_insensitive=on&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t4%3B%2Chugely%3B%2Cc0%3B%2Cs0%3B%3Bhugely%3B%2Cc0%3B%3BHugely%3B%2Cc0%3B.t4%3B%2Cvery%3B%2Cc0%3B%2Cs0%3B%3Bvery%3B%2Cc0%3B%3BVery%3B%2Cc0]

Palimpsest upon palimpsest

Dear Dr. Zipf,

Good day to you.  My name is [name removed to protect the guilty] and I am a Ph.D. in [field removed to protect the guilty] at [a hospital which shall remain nameless].  I need to learn how to use natural language processing to process the electronic medical record and provide data that can be used for analysis.  As you are an expert in this field I thought I would email you and ask for your assistance.  Are there any books or training courses out there that can help me learn biomedical natural language processing in a few weeks.   Any help you can provide will be greatly appreciated.  Please let me know.

Warmest Regards,

[Name removed to protect the guilty]


In a few weeks… In a few weeks…


Dear Dr. X,

Biomedical natural language processing is super-simple, and I would be surprised if you couldn’t learn it in a few weeks.  You might find this book helpful:

Cohen, Kevin Bretonnel, and Dina Demner-Fushman. Biomedical natural language processing. Vol. 11. John Benjamins Publishing Company, 2014.

Please let me know if I can be of any further assistance.

Warmest Regards,

Beauregard Zipf, PhD


Dear Dr. X,

The doctoral students in our graduate program typically spend five years learning biomedical natural language processing.  Personally, I’ve spent my entire career learning biomedical natural language processing, beginning with spending a number of years as a medic in the military, where I learned the “biomedical” part. I mostly did physiological monitoring–hemodynamics, electrophysiology, stuff like that.  I later got a bachelor’s degree in linguistics (double major in English, actually), as well as a master’s degree in linguistics, and a PhD in linguistics, which is how I picked up the “language” part.  Along the way I learned to program–the hard way, which is to say by making more mistakes than you could possibly imagine, from the painful to the just plain embarrassing.  (That’s the “processing.”) Since then, I’ve spent years trying to figure this stuff out, and I still wouldn’t say that I know very much about it.  But, hey, you’ve got a PhD in [redacted], so, yeah–you should be able to pick this up in a few weeks.  You might find this book helpful:
 

Cohen, Kevin Bretonnel, and Dina Demner-Fushman. Biomedical natural language processing. Vol. 11. John Benjamins Publishing Company, 2014.

 
Warmest Regards,

Beauregard Zipf, Registered Cardiovascular Technologist, Advanced Cardiac Life Support instructor, EMT, PhD


Dear Dr. X,

  • Are there any books or training courses out there that can help me learn biomedical natural language processing
  •  

What an interesting question–thank you for bringing it up.  When I Googled the words biomedical natural language processing, the first hit I got was this:

 

Cohen, Kevin Bretonnel, and Dina Demner-Fushman. Biomedical natural language processing. Vol. 11. John Benjamins Publishing Company, 2014.

 

Looks like it might be relevant?

 

Best wishes,

Zipf


Hi, Dr. X,

It’s nice to hear from you.  You might find this book helpful:

 

Cohen, Kevin Bretonnel, and Dina Demner-Fushman. Biomedical natural language processing. Vol. 11. John Benjamins Publishing Company, 2014.

 

Please let me know if I can be of any further assistance.
Best wishes,

 

Zipf


English notes
palimpsest:  “writing material (such as a parchment or tablet) used one or more times after earlier writing has been erased” (Merriam-Webster).  Back in the day, writing was mostly done on parchment, and parchment was expensive, so in the monasteries that preserved much of the ancient writing that we have today, it wasn’t uncommon to scrape the ink off of parchment if you didn’t really care about what was written on it, and write something on it that you did care about.  If you’re lucky, though, today we can recover an earlier text from the impressions that it left behind on the parchment, and there are some texts that are only known from a palimpsest.  Wikipedia lists most of Cicero’s De republica, as well as the oldest Koranic variant in existence.

Speak to us of drinking, not of marriage

The feeling was like what gay friends have described to me when they first learned that they weren’t the only guys in the world who wanted to have sex with other men.

A Basque joke about the alleged difficulty of the Basque language: The Devil wanted to tempt the Basques to sin, so he decided to learn to speak Basque.  He quit after seven years, only having learned the word “no.”  The Devil did better learning Basque than I’ve done learning French, because I still don’t know how to say “no” in French.  My stumbling block: the second clause in a contrast.  My father speaks Portuguese, but I don’t.  We have pinot noirs in Oregon, but not Brouillies.

Jean Girodet’s magisterial Pièges et difficultés de la langue française to the rescue.  According to Girodet, the issue comes up in what he calls ellipticals.  In this situation, he says that literary language tends to prefer non, while the spoken language tends to prefer pas: 

Dans les tours élliptiques, la langue littéraire préfère en général non, la langue familière pas.

He gives these examples:

Non Pas
Veut-on réformer la société ou non Qu’il travaille ou pas, je m’en moque !
Il néglige son travail, moi non. Elle aime le ski, moi pas.
Cette parole est d’un marchand et non d’un prince. J’irai en voiture, pas à pied.
Il habite une villa, non loin de Cimiez. Il tient un café, pas loin d’ici.
Il veut créer un art tout nouveau, pourquoi non ? Partir tout de suite ? Pourquoi pas, après tout.

OK, good so far: you can use either, with non sounding more literary, and pas sounding more casual.  But, why do you occasionally run into both of them together??  Here’s a clear elliptical in Girodet’s sense of the word: the refrain of the song Parlez-nous à boire, “Speak to us of drinking (not of marriage).”  There are many recordings of it available (sometimes with minor differences in the lyrics), but my favorite du moment is this one from the film Southern Comfort.  Lyrics follow, from CajunLyrics.com:

Oh parlez-nous à boire, non pas du marriage
Toujours en regrettant, nos jolies temps passé

Si que tu te maries avec une jolie fille,
T’es dans les grands dangers, ça va te la voler.

Si que tu te maries aves une vilaine fille,
T’es dans les grands dangers, faudra tu fais ta vie avec.

Oh parlez-nous à boire, non pas du marriage
Toujours en regrettant, nos jolies temps passé

Si que tu te maries avec une fille bien pauvre,
T’es dans les grands dangers, faudra travailler tout la vie.

Si que tu te maries avec une fille qu’a de quoi,
T’es dans les grands dangers, tu vas attraper des grandes reproches.
Fameux, toi grand vaurien, qu’a tout gaspillé mon bien
Oh parlez-nous à boire, non pas du marriage.

Source: cajunlyrics.com

Native speakers, can you help this poor, lost anglophone?  (Note: I’m guessing that jolies temps passé should be jolis temps passés, but what do I know?)

My source for the Basque joke: I don’t remember, but it’s probably one of Mario Pei‘s many books.  Pei was a linguist who wrote tons of popular-press books about language between the 1930s and the 1970s or so.  Running across one of them in a used bookstore  was the first time I ever heard of “linguistics.”  After a lifetime of mostly keeping quiet about my unending obsessions with language, the feeling was like what gay friends have described to me when they first learned that they weren’t the only guys in the world who wanted to have sex with other men.

Just in case you were wondering why your rabbit looks like it does

Black lab, yellow lab, chocolate lab, meth lab.

When I’m in the US, I live in the Wild West, and that means rabbits.  Where there are rabbits, there are probably man-eating rabbits, and I hate them.  So, the chart explaining rabbit coat coloration that you see above intrigued me–to survive the man-eating rabbits, you must be able to spot them, and you can’t always rely on seeing their long, sinister ears protruding from the grass, so you need to know their coat colors.  But, how do those particular genes explain the devilishly sly diversity of color and pattern that you see in the illustration?

For context, let me give you the rundown (as I understand it–bear in mind that I’m a linguist, not a geneticist) on Labrador retrievers:

black-lab-yellow-lab-chocolate-lab-meth-lab-nny-e299a4e299a4e299a4e299a4e299a4-12613532
Picture source: https://goo.gl/Zxb4Dk
  • Labs come in three colors: black, “chocolate,” and yellow.
  • Which color they are is determined by two genes.
  • One gene determines whether your hair is black or “chocolate.”
  • The other gene determines whether or not your hair has any pigment (think of pigment as the molecule that actually has the color) at all.
  • If you have the form of the gene (the “allele”) that allows your hair to have a color, then you will be either black or “chocolate” (assuming that you are a Labrador retriever).
  • If you have the form of the gene (the “allele”) that keeps your hair from having any pigment at all, then regardless of which form of the black-versus-chocolate gene you have, you will be yellow–yellow being what a Labrador retriever hair looks like if it doesn’t have any pigment deposited therein.

My point being: you don’t actually need to have a large amount of genetic variability to get a large amount of “phenotypic” variability (in this case, variability in appearance)–actually, very few things are affected by a single gene.  Rather, most traits are affected by a combination of a number of different genes.


OK, so: how do those rabbits come about?  They differ not just in their colors, but in the pattern of those colors.  Here’s a reasonable guess.

The odd data point in that graphic is the Himalayan.  Everybody else is monochrome, but the Himalayan has a color difference between his (I’m pretty sure that rabbits are generically male, probably due to the known viciousness of the man-eating variety–le lapin anthropophage in French, el conejo antropófago in Spanish, Lepus anthropophagos in Latin, I think, but I couldn’t swear to it) extremities and his…well, everything else.

maxresdefault
A Siamese cat with a baby. Note that the cat is not eating the baby—as far as I know, there is no such thing as a man-eating Siamese cat. Picture source: https://www.youtube.com/watch?v=JAC-s8cWJxQ

You’ve seen that pattern before–in Siamese cats, for instance.  My understanding is that the distribution–lighter towards the center, darker at the extremities–is related to reduced blood flow in said extremities.  The reduced blood flow gives you a reduced temperature, and that has some effect or another on the deposition of pigment.  (As I said, don’t quote me on this–I’m a linguist, not a Siamese cat expert.)  Looking at the rabbit that way, you wonder: OK, dark on the extremities and light on the rest, but which dark?  Which light?  Why doesn’t the rabbit have the same colors as a Siamese cat, for instance?  (Think of the evolutionary advantage for a rabbit who looked like a cat–it would be soooo much easier to get humans to take you in, in which case if you were the man-eating variety of rabbit, you could just gobble those overly-trusting humans right down.)

I went digging around for evidence for this explanation for the coloration patterns in Siamese cats.  I found a few papers on a group of related temperature-sensitive tyrosinase mutations that are associated with eye color differences in a range of Siamese cats and Himalayan mice and a rare mink discovered on a ranch in Nova Scotia–and with albinism in humans. (As an albino, your likelihood of going blind due to a lack of protective pigment in the iris and the retina is high–and that’s why we spend your tax dollars on studies of Himalayan mice.)  I found a paper on a temperature-sensitive tyrosinase mutation in a human with the following: white hair in the warmer areas (scalp and axilla) and progressively darker hair in the cooler areas (extremities) of her body. I haven’t tracked it down to the fur color question in Siamese cats, though.  Still think I just make this shit up?  Here’s the paper on the mink found on the ranch in Nova Scotia.  I mean, yeah, I make up the zombies and the man-eating rabbits–but, the rest of the stuff is “for reals,” as the kids say.

figure_12_02_05
Picture source: https://goo.gl/jR2r1Y

Look to the left, look to the right: if the colors in the figure are true to life, the Himalayan rabbit extremities are the color of the rabbit to the left, while the center is the color of the rabbit to the right.  (I am cursed to always remember a scene from an autobiography that I read when I was a kid.  The author has been arrested by the NKVD and finds himself in their notorious Lubyanka prison.  Whenever a prisoner is taken from one room to another, the machine-gun-toting guards intone step to the left, step to the right: attempt to escape.  The NKVD were murderous fuckers, and the threat was entirely believable.  Hence: look to the left, look to the right.)  Likely cause of the pattern of the Himalayan: temperature-dependent pigment deposition gradient of whatever pigment the chinchilla and albino rabbits have or do not have.

Yes, I have been known to spend my Saturday mornings looking for scientific literature on the topic of pigmentation deposition in Siamese cats when I could have been taking a walk in the beautiful fall weather.  This is probably related to why I get divorced so often.  French notes below–no English notes today.


French notes

le dépôt: deposition, in the sense of deposition of a substance.  This seems to be what would be used to talk about pigment deposition.  For example:  La synthèse et le dépôt de mélanines continuent jusqu’à ce que la structure interne ne soit plus visible, on parle alors de mélanosome de stade IV.  (biologiedelapeau.fr)  

le gisement: deposit, in the sense of a deposit of minerals, of archeological finds, and the like.  I haven’t been able to find any examples of it being used in a medical or biological context to refer to deposition of pigments in the skin.

c389pistasierc3a9cessivec389pistasiedominante
The same thing that we saw in Labrador retrievers: one gene for color, one gene for pigment deposition, and you get three kinds of coats. Faute d’orthographe: dépot should be dépôt.  Source: Bernadette Féry, http://slideplayer.fr/slide/181167/
exemple3sidc3a9rosehc3a9patiqueaccumulationdepigmentsferriques
With the correct spelling dépôt: Deposition of exogenous or endogenous iron. Picture source: http://slideplayer.fr/slide/5124382/, author unknown.
fig-1-gisement-non-conventionnel-a-gauche-versus-gisement-conventionnel-a-droite
Picture source: Marc Durand, https://goo.gl/3GcDDN.

 

51cofiqmavl-_sx351_bo1204203200_
Source: Alain Muret.

 

Matching Game III: Zombies and visa renewals

Today’s depressing vocabulary items are brought to you by Olivier Peru and Sophian Cholet’s magisterial bande dessinée Zombies. The non-depressing vocabulary item is a prerequisite for getting my French visa renewed. Don’t think that ANY of these vocabulary iterms are non-depressing? First World Problems, baby, First World Problems… (First World Problem explained in the English notes below.)

img_3893img_3892img_3891img_3890img_3889img_3888img_3887img_3886img_3885img_3884


English notes

First World Problem: Something that could only count as a problem if the rest of your life is better than that of most people on the planet.  Examples of First World Problems that I’ve had recently:

  • When my long-awaited new iPhone finally showed up, it was the wrong color.
  • The Singaporean noodles in the United lounge came in really small containers–like, two mouthfuls.
  • I didn’t get surclassé (upgraded) on a cross-country flight.

The better I get at distinguishing my First World Problems from real problems, the happier I get, and I’m already the happiest person you know…

Emporter versus emmener

Two ways to say “to take” in French.

p51t1g10
Source: http://web.fu-berlin.de/phin/phin51/p51t1.htm

I am of the “write about what you DON’T know” philosophy, and I sure as hell don’t know how to speak French.  So: today, here are two words that native speakers of English (say, me) tend to have trouble with in French: emporter  and emmener.   They both can be translated as to take, but they get used in different contexts.

First, I recommend that you check out this video on the topic from the Learn French with Pascal YouTube series.  Pascal’s explanations are always clear, he always has good examples, and he will give you native speaker pronunciations.  For example, emmener can be pronounced with or without the medial e, and he demonstrates both of them.  Scroll down after you’ve watched the video, and I’ll give you a bunch of examples from the Sketch Engine web site.

[https://youtu.be/xrDcv8KIf3o]

Pascal’s take on these two verbs is that you use them as follows:

  •  emmener in a situation where the thing being taken can move on its own.  He lists people and animals as the two kinds of things with which you would use emmener.
  • emporter when the thing that is being moved cannot move on its own–for example, a package.
ob_ae7361_emmener
Source: http://olalachamonix.overblog.com/2014/01/amener-emmener-apporter-emporter.html

Let’s see how this holds up in practice.  As we’ll see, it seems to be the case that these are more like heuristics than absolute rules; more probabilistic than deterministic.  In other words: the observations hold true more often than not, but there can be some variability.  To find these examples, I went to the Sketch Engine web site.  It allows you to search multiple corpora (singular corpus)–that is, collections of language that have been analyzed in some way.  I used the DGT French corpus, which is intended to support translations and therefore gives us English equivalents, as well as the frTenTen corpus.  It contains 9.9 billion words scraped from the Web.  When I got my results back, I randomized their order so that I wouldn’t be biased towards any particular sets of documents.

  • Objet: exemption de l’exigence d’ emporter un document de transport et une déclaration du transporteur pour certaines quantités de marchandises dangereuses définies sous 1.1.3.6 (n1). 
    • Subject:Exemption from the requirement to carry a transport document and a shippers’ declaration for certain quantities of dangerousgoods as defined in 1.1.3.6 (n1).
    • Comment: these are documents, therefore not capable of moving themselves, therefore emporter.
  • Les voyageurs ne peuvent emporter dans leur bagage à main que des marchandises dangereuses destinées à leur usage personnel ou professionnel. 
    • Only dangerous goods for personal or own professional use are permitted to be carried in hand luggage.
    • Comment: we’re talking about dangerous goods of some sort, and apparently those dangerous goods do not include, say, tigers (which are capable of movement on their own), so: emporter.
  • Et au lieu d’ emporter la pizza, j’ai eu envie de manger sur place, pour changer un peu…
    • Comment: it’s a pizza that’s being (or not) transported, therefore emporter.
  • Où est-ce que je nous ai emmenés 
    • Comment: the object pronoun is “us,” therefore the transportees are animate (alive), therefore they are capable of moving themselves, and therefore the verb is emmener.  
  • Indique-moi juste le chemin de ta villa, je t’y emmène.
    • Comment: the thing being taken somewhere can show something, so it is animate and sentient, so it can move on its own, so the verb is emmener.
  • La vie de Caroline est monotone, et sans surprise : chaque matin son père l’ emmène à l’école, et le soir une étudiante pas très sympa vient la chercher.
    • Comment: Caroline is human, so she can move on her own, so the verb is emmener.
  • Sécuriser les appâts afin qu’ils ne puissent pas être emmenés par les rongeurs.
    • Secure bait blocks so that they cannot be dragged away by rodents.
    • Comment: I have no clue why this is emmener.  By Pascal’s rule, since the things being moved–les appâts–are not capable of moving themselves, this should be emporter.
  • Le véhicule est alors emmené au moteur jusqu’à l’enceinte de mesure, en utilisant au minimum la pédale d’accélérateur.
    • The vehicle is then driven to the measuring chamber with a minimum use of the accelerator pedal.
    • Comment: maybe this is emmener because a vehicle is capable of moving under its own power (so to speak)?
  • Dans les 5 minutes qui suivent l’achèvement de l’opération de préconditionnement décrite au paragraphe 5.2.1., le capot-moteur est fermé et le véhicule est emmené hors du banc à rouleaux et est parqué dans la zone d’imprégnation.
    • Within five minutes of completing the preconditioning operation specified in paragraph 5.2.1. above the engine bonnetshall be completely closed and the vehicle driven off the chassis dynamometer and parked in the soak area.
    • Comment: another example of emmener with a vehicle.
geluck
Perhaps “emporter” despite being animate because he’s being carried, rather than moving under his own steam? Source: http://cocochanel58.blogspot.com

There are other verbs that refer to taking stuff places–apporter, amener, ramener–but this is about all my little head can handle for one day.  Native speakers: have at it in the Comments section, please!

French spelling errors I

If you’re a computational linguist, the sentence that your boss never wants to hear from you is this: we need to spend six months writing a program to fix the spelling errors in this @#$@#$% data. 

If you’re a computational linguist, the sentence that your boss never wants to hear from you is this: we need to spend six months writing a program to fix the spelling errors in this @#$@#$% data.  And yet: spelling errors or similar sources of unexpected inputs are a problem with every domain of computational science that I’m aware of.  Even super-highly-edited text has some residue of spelling errors and other problems.  For example, back in the days when there were still phone books, even they had a non-zero rate of spelling errors.  Not a high rate–but, not zero, either.

You don’t really believe that even when people are paying really, really, really close attention to how they write, they still screw up?  Read on.  I’ll come right out and admit that I’m not sure what the first word is in the picture below, but that’s not even what I’m talking about…

12212564_10204919128050513_1725884278_n