Matching game IV: Zipf’s Law in French

Zipf’s Law is why if someone is looking for a web page and types “dogs in marseilles” into the query box, your search engine should pay no attention to the word “in,” some attention to “dogs,” and quite a bit of attention to “marseilles.” 

Zipf’s Law describes the frequencies of words: there is a very, very small number of words that occur very, very often, and a very, very large number of words that occur very, very rarely–but, they do occur.  This blog is focused on one of the consequences of Zipf’s Law: it means that if you are seriously studying a second language, you are going to run into words that you don’t know every day for the rest of your life.


img_6216You know how the matching game works: we have words in English, words in French, and we match them.  Today’s words (and a tiny bit of grammar) are taken from the discussion of Zipf’s Law in the book Recherche d’information: Applications, modèles et algorithmes, by Massih-Reza Amini and Éric Gaussier, second edition.  Recherche d’information is information retrieval, the task of finding documents in response to an information need: what Google does for you every day.  One of the great embarrassments of linguistics is the fact that information retrieval is mostly about language, in the sense that mostly what you’re looking for is web pages with stuff written for them and you use words to find them–and yet, most of the work of information retrieval is done without actually doing anything that looks very much like doing anything with language.  At its heart, the technology of information retrieval is almost entirely done with counting and very simple arithmetic–nothing linguistic there.  You could think of that very simple arithmetic as taking advantage of Zipf’s Law–the very simple arithmetic is used to figure out things like the fact that if someone is looking for a web page and types dogs in marseilles into the query box, your search engine should pay no attention to the word in, some attention to dogs, and quite a bit of attention to marseilles when it is making the decision about which web pages to put at the top of the search results.  Scroll down to find today’s vocabulary items, and click on the pictures of the relevant pages from Amini and Gaussier’s book if you’d like to see those words in context.  As for me: a second cup of coffee, go over these flashcards, and then off to the lab.  Today’s goal: explain why researchers calculated the ratio of vocabulary size to length of conversation of a bunch of soldiers–after chasing them through the woods, catching them, depriving them of food and sleep, and then interrogating them.

img_6222.jpg
img_6221.jpgimg_6222.jpg

I included La fréquence du second mot because I’ve been trying to understand when to use second and when to use deuxième.  If I understand the Académie’s Dire/Ne pas dire page correctly, the Academy would prefer that this be deuxième, but not even the Académie thinks that it’s mandatory to make the distinction:

On peut, par souci de précision et d’élégance, réserver l’emploi de second aux énoncés où l’on ne considère que deux éléments, et n’employer deuxième que lorsque l’énumération va au-delà de deux. Cette distinction n’est pas obligatoire.

On veillera toutefois à employer l’adjectif second, plus ancien que deuxième, dans un certain nombre de locutions et d’expressions où il doit être préféré : seconde main, seconde nature, etc., et dans des emplois substantivés : le second du navire.

academie-francaise.fr/second-deuxieme

As the CarriereOnline.com web site puts it: C’est pour cela qu’on parle de la Seconde Guerre mondiale parce qu’on espère qu’ il n’y en aura pas de troisième !

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Curative Power of Medical Data

JCDL 2020 Workshop on Biomedical Natural Language Processing

Crimescribe

Criminal Curiosities

BioNLP

Biomedical natural language processing

Mostly Mammoths

but other things that fascinate me, too

Zygoma

Adventures in natural history collections

Our French Oasis

FAMILY LIFE IN A FRENCH COUNTRY VILLAGE

ACL 2017

PC Chairs Blog

Abby Mullen

A site about history and life

EFL Notes

Random commentary on teaching English as a foreign language

Natural Language Processing

Université Paris-Centrale, Spring 2017

Speak Out in Spanish!

living and loving language

- MIKE STEEDEN -

THE DRIVELLINGS OF TWATTERSLEY FROMAGE

mathbabe

Exploring and venting about quantitative issues

%d bloggers like this: