Matching game IV: Zipf’s Law in French

Zipf’s Law is why if someone is looking for a web page and types “dogs in marseilles” into the query box, your search engine should pay no attention to the word “in,” some attention to “dogs,” and quite a bit of attention to “marseilles.” 

Zipf’s Law describes the frequencies of words: there is a very, very small number of words that occur very, very often, and a very, very large number of words that occur very, very rarely–but, they do occur.  This blog is focused on one of the consequences of Zipf’s Law: it means that if you are seriously studying a second language, you are going to run into words that you don’t know every day for the rest of your life.

img_6216You know how the matching game works: we have words in English, words in French, and we match them.  Today’s words (and a tiny bit of grammar) are taken from the discussion of Zipf’s Law in the book Recherche d’information: Applications, modèles et algorithmes, by Massih-Reza Amini and Éric Gaussier, second edition.  Recherche d’information is information retrieval, the task of finding documents in response to an information need: what Google does for you every day.  One of the great embarrassments of linguistics is the fact that information retrieval is mostly about language, in the sense that mostly what you’re looking for is web pages with stuff written for them and you use words to find them–and yet, most of the work of information retrieval is done without actually doing anything that looks very much like doing anything with language.  At its heart, the technology of information retrieval is almost entirely done with counting and very simple arithmetic–nothing linguistic there.  You could think of that very simple arithmetic as taking advantage of Zipf’s Law–the very simple arithmetic is used to figure out things like the fact that if someone is looking for a web page and types dogs in marseilles into the query box, your search engine should pay no attention to the word in, some attention to dogs, and quite a bit of attention to marseilles when it is making the decision about which web pages to put at the top of the search results.  Scroll down to find today’s vocabulary items, and click on the pictures of the relevant pages from Amini and Gaussier’s book if you’d like to see those words in context.  As for me: a second cup of coffee, go over these flashcards, and then off to the lab.  Today’s goal: explain why researchers calculated the ratio of vocabulary size to length of conversation of a bunch of soldiers–after chasing them through the woods, catching them, depriving them of food and sleep, and then interrogating them.


I included La fréquence du second mot because I’ve been trying to understand when to use second and when to use deuxième.  If I understand the Académie’s Dire/Ne pas dire page correctly, the Academy would prefer that this be deuxième, but not even the Académie thinks that it’s mandatory to make the distinction:

On peut, par souci de précision et d’élégance, réserver l’emploi de second aux énoncés où l’on ne considère que deux éléments, et n’employer deuxième que lorsque l’énumération va au-delà de deux. Cette distinction n’est pas obligatoire.

On veillera toutefois à employer l’adjectif second, plus ancien que deuxième, dans un certain nombre de locutions et d’expressions où il doit être préféré : seconde main, seconde nature, etc., et dans des emplois substantivés : le second du navire.

As the web site puts it: C’est pour cela qu’on parle de la Seconde Guerre mondiale parce qu’on espère qu’ il n’y en aura pas de troisième !

