Internet search with highly inflected languages

I tend to wake up early.  One of the things that I do with my free time in the morning is study French.  Today I was flipping through a dictionary and ran across the word thésauriser.  “How nice,” I thought—“a verb about putting things in a thesaurus.”  Wrong—thésauriser is an intransitive verb meaning “to hoard money,” or a transitive verb meaning “to hoard.”

Having come across such a weird and wonderful verb, I  wondered whether or not it actually gets used.  Being a 21st-century corpus linguist, I turned to Google.
One of the huge shifts in corpus linguistics, loosely definable as the study of language using naturally occurring data (as opposed to, say, making up illustrative sentences from your own head), is the availability of huge amounts of searchable texts on the Internet, typically via Google.

One of the “problems” with using Google as a corpus search tool these days is that it will often give you links to definitions of a word, rather than examples of actual usage.  So, the first couple of pages of results were links to definitions.  However, this list also included a link to dico-proverbes.com, a web site that gives you proverbs that use your word of interest.  Here I found Qui sait économiser, sait thésauriser, and
A thésauriseur, héritier gaspilleur.  Translating a proverb is risky in any language, but even without a translation, we can note that the verb is, indeed, intransitive in these uses—you don’t explicitly use the word “money.”  (If any bilingual French/English speakers would like to chip in translations in the comments, it would be much appreciated!)

One of the first links is actually to a French Wikipedia entry for the nominalization thésaurisation, which is explained as a technical term in economics: La thésaurisation est un terme technique économique décrivant une accumulation de monnaie pour en tirer un profit ou par absence de meilleur emploi, et non par principe d’économie ou d’investissement productif.

The site 1mot.fr informs me that it is a valid French Scrabble word, and gives a long list of inflections that can be added to it, as well as a list of of smaller valid Scrabble words that can be made from its letters.

Yet another site, dico-rimes.com, gives a list of words that rhyme with it.

If you’ve been following the links so far, you’ve noted that in my attempts to find actual examples of usage, as opposed to metalinguistic information, I’ve been drifting away from the infinitive thésauriser and searching for inflected forms–mostly unsuccessfully (in that I’m still getting metalinguistic stuff).  So, how do I find actual examples of usage?  I tried searching on inflected forms that I thought might have lower frequencies and/or be less likely to be headings for their own entries in sites like the proverb-finding, rhyme-finding, definition-finding, etc. sites.  The third person plural present tense thésaurisent got me to a YouTube video titled Ceux qui THESAURISENT DETRUISENT l INTERET GENERAL et ce n est PAS BANAL (“Those who hoard, destroy the public good, and that’s not trite”–in French, it sort of rhymes). My French is definitely not good enough to understand the poor audio on the videotaped monologue (which I’m not actually sure is entirely in French), so I can’t say whether it’s insightful, or insane ramblings. A link to a footnote in The papers of Alexander Hamilton on Google Books finds me Cet example frappant peut s’appliquer à tous les hommes industrieux, depuis l’artiste célebre ou le chef de manufacture qui thésaurisent peut-être dix milles francs chaque année, jusqu’à l’artisan grossier qui n’épargne qu’un écu.   Note that here it’s not intransitive, and a specific amount of money is named.

So, I’ve been investigating this verb for about 45 minutes, and still have only found two actual examples of usage on Google.  Any suggestions on Googling in French would be much appreciated!  (I should add that the Text Retrieval Conference sponsored a set of experiments on information retrieval in Spanish some years ago, and if memory serves, they concluded that the techniques that work for not-so-highly-inflected English work just as well on fairly-highly-inflected Spanish.  However, information retrieval is not corpus linguistics.)

I wish I could say that this has led to some hilarious misunderstandings

edith-piafEdith Piaf was probably the most famous French singer of all time, at least outside of France.  It turns out that France is definitely the best place to buy her music–I picked up a 3-CD set at Les Invalides for ten euros.  Not surprisingly, Zipf’s Law strikes in Edith Piaf’s lyrics as much as anywhere else–I wasn’t able to get past the first line of her famous La vie en rose without consulting a dictionary.

According to the first French-language lyrics web site that I found, the first line of La vie en rose is as follows:

Des yeux qui font baiser les miens

This didn’t quite sound right to me.  The word that I thought I heard when I listened to the lyrics was not baiser, but baisser.  (The single s is pronounced z.  The double ss is pronounced s.)  When I looked up baisser, I discovered that it means “to lower.”  This made sense, in context–the first line would then mean Eyes that make mine lower (themselves).  What didn’t make quite as much sense–although, crucially, it wasn’t like it made no sense at all–was what the lyrics web site said: baiser. 

Baiser is a word that has always thrown me off.  As we learn in school, baiser means “to kiss.”  However, as I learned when I got to France, it also means “to fuck.”  I wish that I could say that this ambiguity has led to some hilarious misunderstandings in my life, but: no such luck.  Yet.

Zipf’s Law and the French alphabet

french-alphabetAt no point in my college French 101 class did we learn the alphabet.  This turned out to be a problem when I got to France and needed to spell things (like, say, my name).

There are a number of French alphabet songs available on YouTube.  This one, from www.learnfrenchlab.com, is my current favorite.  The words go some thing like “I want the X, the X, the X of Y,” where “X” is a letter and “Y” is a noun or phrase that starts with that letter—“I want the A, the A, the A of ananas (pineapple).”  I learned some things from this song—like, in France I learned by necessity that K is “ka,” but I had no clue that J is “zhee.”

The song is an interesting example of how much phonology and orthographic variability we have to get kids to ignore in order for them to learn to read.  (That’s equally true in English, of course.)  For example: the name of the letter E in French is not even spellable in English.  We don’t have the sound at all (and I struggle with it constantly, both in terms of producing it and in terms of hearing it).  However, in the song, it’s illustrated with éléphant (elephant), in which it’s pronounced completely differently (like the “long” A in English).  The name of G is “zhey,” but the word illustrating G is grenouille (frog), in which the G is pronounced like the G in “green.”  You get the picture.

Of course, Zipf’s Law strikes in the alphabet song as much as anywhere else—there is nothing simple about children’s language.  (As my French tutor once pointed out, the passé simple, which we’re not even taught in French class as it’s pretty obscure and not used (that I know of) in the spoken language any more, is indeed used in French children’s books, with the result that my French tutor can’t use children’s books in the high school French classes that she teaches.)  Here are words that I either didn’t know, or didn’t know the gender of:

  • un ananas: pineapple.
  • la banane: banana.
  • le crocodile: crocodile.  (Are you noticing a pattern here?  Zipf’s Law strikes three times before I get past A, B, and C!)
  • le dauphin: dolphin.  (You probably know this word from history class as “heir apparent,” as did I—I had no idea that it also meant “dolphin.”)
  • la grenouille: frog.
  • un hippopotame: hippopotamus.
  • le kangourou: kangaroo.
  • le nounours: teddy bear.
  • le rigolo/la rigolote: joker, funny person; clown (pejorative); also an adjective—funny, amusing, bizarre, weird.
  • le serpent: snake.
  • le/la trompette: this is an interesting one, as the meaning varies by gender.  La trompette (feminine) is a trumpet.  Le trompette (masculine) is a trumpeter.
  • le zoo: zoo (but, it’s pronounced completely differently, of course).

That’s twelve new words, just to learn the alphabet.  Zipf’s Law will get you every time.