Internet search with highly inflected languages

I tend to wake up early.  One of the things that I do with my free time in the morning is study French.  Today I was flipping through a dictionary and ran across the word thésauriser.  “How nice,” I thought—“a verb about putting things in a thesaurus.”  Wrong—thésauriser is an intransitive verb meaning “to hoard money,” or a transitive verb meaning “to hoard.”

Having come across such a weird and wonderful verb, I  wondered whether or not it actually gets used.  Being a 21st-century corpus linguist, I turned to Google.
One of the huge shifts in corpus linguistics, loosely definable as the study of language using naturally occurring data (as opposed to, say, making up illustrative sentences from your own head), is the availability of huge amounts of searchable texts on the Internet, typically via Google.

One of the “problems” with using Google as a corpus search tool these days is that it will often give you links to definitions of a word, rather than examples of actual usage.  So, the first couple of pages of results were links to definitions.  However, this list also included a link to dico-proverbes.com, a web site that gives you proverbs that use your word of interest.  Here I found Qui sait économiser, sait thésauriser, and
A thésauriseur, héritier gaspilleur.  Translating a proverb is risky in any language, but even without a translation, we can note that the verb is, indeed, intransitive in these uses—you don’t explicitly use the word “money.”  (If any bilingual French/English speakers would like to chip in translations in the comments, it would be much appreciated!)

One of the first links is actually to a French Wikipedia entry for the nominalization thésaurisation, which is explained as a technical term in economics: La thésaurisation est un terme technique économique décrivant une accumulation de monnaie pour en tirer un profit ou par absence de meilleur emploi, et non par principe d’économie ou d’investissement productif.

The site 1mot.fr informs me that it is a valid French Scrabble word, and gives a long list of inflections that can be added to it, as well as a list of of smaller valid Scrabble words that can be made from its letters.

Yet another site, dico-rimes.com, gives a list of words that rhyme with it.

If you’ve been following the links so far, you’ve noted that in my attempts to find actual examples of usage, as opposed to metalinguistic information, I’ve been drifting away from the infinitive thésauriser and searching for inflected forms–mostly unsuccessfully (in that I’m still getting metalinguistic stuff).  So, how do I find actual examples of usage?  I tried searching on inflected forms that I thought might have lower frequencies and/or be less likely to be headings for their own entries in sites like the proverb-finding, rhyme-finding, definition-finding, etc. sites.  The third person plural present tense thésaurisent got me to a YouTube video titled Ceux qui THESAURISENT DETRUISENT l INTERET GENERAL et ce n est PAS BANAL (“Those who hoard, destroy the public good, and that’s not trite”–in French, it sort of rhymes). My French is definitely not good enough to understand the poor audio on the videotaped monologue (which I’m not actually sure is entirely in French), so I can’t say whether it’s insightful, or insane ramblings. A link to a footnote in The papers of Alexander Hamilton on Google Books finds me Cet example frappant peut s’appliquer à tous les hommes industrieux, depuis l’artiste célebre ou le chef de manufacture qui thésaurisent peut-être dix milles francs chaque année, jusqu’à l’artisan grossier qui n’épargne qu’un écu.   Note that here it’s not intransitive, and a specific amount of money is named.

So, I’ve been investigating this verb for about 45 minutes, and still have only found two actual examples of usage on Google.  Any suggestions on Googling in French would be much appreciated!  (I should add that the Text Retrieval Conference sponsored a set of experiments on information retrieval in Spanish some years ago, and if memory serves, they concluded that the techniques that work for not-so-highly-inflected English work just as well on fairly-highly-inflected Spanish.  However, information retrieval is not corpus linguistics.)

I wish I could say that this has led to some hilarious misunderstandings

edith-piafEdith Piaf was probably the most famous French singer of all time, at least outside of France.  It turns out that France is definitely the best place to buy her music–I picked up a 3-CD set at Les Invalides for ten euros.  Not surprisingly, Zipf’s Law strikes in Edith Piaf’s lyrics as much as anywhere else–I wasn’t able to get past the first line of her famous La vie en rose without consulting a dictionary.

According to the first French-language lyrics web site that I found, the first line of La vie en rose is as follows:

Des yeux qui font baiser les miens

This didn’t quite sound right to me.  The word that I thought I heard when I listened to the lyrics was not baiser, but baisser.  (The single s is pronounced z.  The double ss is pronounced s.)  When I looked up baisser, I discovered that it means “to lower.”  This made sense, in context–the first line would then mean Eyes that make mine lower (themselves).  What didn’t make quite as much sense–although, crucially, it wasn’t like it made no sense at all–was what the lyrics web site said: baiser. 

Baiser is a word that has always thrown me off.  As we learn in school, baiser means “to kiss.”  However, as I learned when I got to France, it also means “to fuck.”  I wish that I could say that this ambiguity has led to some hilarious misunderstandings in my life, but: no such luck.  Yet.

Zipf’s Law and the French alphabet

french-alphabetAt no point in my college French 101 class did we learn the alphabet.  This turned out to be a problem when I got to France and needed to spell things (like, say, my name).

There are a number of French alphabet songs available on YouTube.  This one, from www.learnfrenchlab.com, is my current favorite.  The words go some thing like “I want the X, the X, the X of Y,” where “X” is a letter and “Y” is a noun or phrase that starts with that letter—“I want the A, the A, the A of ananas (pineapple).”  I learned some things from this song—like, in France I learned by necessity that K is “ka,” but I had no clue that J is “zhee.”

The song is an interesting example of how much phonology and orthographic variability we have to get kids to ignore in order for them to learn to read.  (That’s equally true in English, of course.)  For example: the name of the letter E in French is not even spellable in English.  We don’t have the sound at all (and I struggle with it constantly, both in terms of producing it and in terms of hearing it).  However, in the song, it’s illustrated with éléphant (elephant), in which it’s pronounced completely differently (like the “long” A in English).  The name of G is “zhey,” but the word illustrating G is grenouille (frog), in which the G is pronounced like the G in “green.”  You get the picture.

Of course, Zipf’s Law strikes in the alphabet song as much as anywhere else—there is nothing simple about children’s language.  (As my French tutor once pointed out, the passé simple, which we’re not even taught in French class as it’s pretty obscure and not used (that I know of) in the spoken language any more, is indeed used in French children’s books, with the result that my French tutor can’t use children’s books in the high school French classes that she teaches.)  Here are words that I either didn’t know, or didn’t know the gender of:

  • un ananas: pineapple.
  • la banane: banana.
  • le crocodile: crocodile.  (Are you noticing a pattern here?  Zipf’s Law strikes three times before I get past A, B, and C!)
  • le dauphin: dolphin.  (You probably know this word from history class as “heir apparent,” as did I—I had no idea that it also meant “dolphin.”)
  • la grenouille: frog.
  • un hippopotame: hippopotamus.
  • le kangourou: kangaroo.
  • le nounours: teddy bear.
  • le rigolo/la rigolote: joker, funny person; clown (pejorative); also an adjective—funny, amusing, bizarre, weird.
  • le serpent: snake.
  • le/la trompette: this is an interesting one, as the meaning varies by gender.  La trompette (feminine) is a trumpet.  Le trompette (masculine) is a trumpeter.
  • le zoo: zoo (but, it’s pronounced completely differently, of course).

That’s twelve new words, just to learn the alphabet.  Zipf’s Law will get you every time.

French emails of the day: travelling by train on a whim, and updating your operating system

I continue to need to look up at least one word per French email. Goal: be able to read at least one email per day without having to look anything up…

  • coup de tête: a sudden impulse.  When talking about soccer or violence, it’s a head-butt.  A couple of years ago I bought train tickets on line, and ever since have gotten emails with sentences like Vous souhaitez partir à petits prix sur un coup de tête?
    Related phrases:
    • donner un coup de tête à qqn: to head-butt.
    • sur un coup de tête: on an impulse, on a whim.
  • dégressif: decreasing gradually.  The train company suggests that I Profitez d’un tarif dégressif en fonction du nombre de passagers voyageant ensemble (jusqu’à 6 passagers).
  • passer en: not sure what this means.  Anyone?  Here’s the sentence, from a work email about upgrading operating systems:
    Si, au retour de vacances, votre ordinateur vous propose de passer en
    Ubuntu 14.04.1, mieux vaut refuser, en attendant le retour de l’expert
    Ubuntu (Olivier) le 8 septembre !

The LIMSI-ILES web page and the greatest French word ever

When in France I work as a visiting researcher in the Groupe Information, Langue Ecrite et Signée (ILES).  I thought it would be nice if I could read my own group’s web page, so I opened up a dictionary and headed on over.  Perusing the page, I saw what must be the best French word ever: retweeté, “re-Tweeted.”  Note that I have no idea whatsoever how to pronounce it.

So, what else would one need to know in order to be able to read the ILES web page?

  • la modélisation: according to WordReference.com, it’s the establishment of a pattern, model, or template. modélisation et traitement automatique de la langue
  • une analyse: analysis; in a medical context, blood test.  leur analyse, leur compréhension ou leur production
  • précis(e): accurate, precise, exact, specific.  Can also be a book, in which case it’s a handbook, manual, or summary.

A day’s random French emails

Zipf’s Law hits me over and over again in emails.  I’m on various and sundry LIMSI and French natural language processing mailing lists, and usually have to look up something or other to read them.  Here’s today’s words:

  • la candidature: application.  In politics, it’s candidacy.  An email about a plan to apply to host a conference started with this sentence (how like Zipf’s Law to strike in the very first sentence): Nous continuons à préparer une candidature pour la conférence JEP-TALN 2016.
  • intitulé: entitled, titled.  In the context of talking about a text, it’s a noun meaning “title” or “heading.” 
    Le jeudi 4 Septembre nous recevrons Ryo Nagata qui nous présentera un exposé intitulé: “Après un an — fruits du travail au LIMSI et après”. (Remember exposé from a previous post, also about a talk?)

Only two words ’cause I only have two French emails in my Inbox!

Zipf’s Law as applied to the vocabulary of pizza

When people visit me in Paris, they’re always surprised to see a wide variety of non-French restaurants—Chinese take-out abounds, as does Thai and Indian food.  The same is true here in Guatemala—the restaurants include a Mediterranean place, a Mexican place, a Korean tea house, and some really amazing bakeries.  My first meal in Guatemala this time was at a pizza joint that some of my coworkers like.  Zipf’s Law affects the vocabulary of pizza as much as the vocabulary of anything else.  Here are the words that I had to look up in order to understand the very first, most basic pizza on the list:

  • rodaja: a round slice; also a disk or a caster.
  • albahaca: basil.

There’s an excellent Guatemalan restaurant in town called Tres Tiempos.  I spent an evening there eating tamalitos and a sort of Guatemalan hotdog and looking up the words on the menu.  How could repollo possibly not mean “chicken”??

  • repollo: cabbage.  You probably learned the word col—so did I.  Don’t know where this one comes from.
  • rebozado: battered.

Incidentally, if you’re into language and you’re into food, you will want to check out Dan Jurafsky’s latest book, The Language Of Food: A Linguist Reads The Menu.  Dan received a MacArthur Genius Grant for his fascinating work in natural language processing, and his talk on ketchup at a NAACL meeting is probably most people’s favorite keynote speech ever.  If you start out at smile.amazon.com, you can donate part of the purchase price to Surgicorps.

The hand surgery screening interview in Spanish

The first day of a Surgicorps visit to Guatemala is taken up with screening all of the people who show up hoping for surgery for their children or themselves.  The surgeons and anesthesiologists are quite heroic, and will see everyone who shows up.  (People also trickle in through the back door all week—I haven’t seen the surgeons refuse to examine anyone, regardless of whether or not they come on the mass screening day.)  Some people are there from early in the morning until deep into the evening, waiting in line with their little children, or elderly mother, or just themselves, undoubtedly hungry and anxious about the outcome of the screening.  Similarly, the physicians mostly skip lunch and work until everyone has been seen.  It all makes for an intense day, and for the interpreters, it’s the busiest day by far, as well as the most unpredictable one in terms of what you’ll need to interpret about.

I prepped for this trip by focusing on the vocabulary of hand anatomy and hand surgery, and had the good luck to end up working with the hand surgeon on intake day.  I enjoyed working with him last year, in part because he’s as kind, patient, and sweet as you can imagine a person being, and in part because after a patient described what he was there for, the hand surgeon would almost always begin his response with “We can make this better,” and I LOVED translating that—every time I translated it, I felt less crushed by the weight of all of the pain and deformity around us and more buoyed by the possibility that people’s lives would be improved by our visit to Guatemala.  Despite my preparation, I had to learn to translate nine English words into Spanish, and one Spanish word into English, plus two more Spanish words or expressions into English when I got finished with the hand screening and moved into the anesthesia screening room.  These words are notable in that the list contains not just one, but two, counterexamples to the “one context, one meaning” hypothesis, which claims that ambiguous words are not ambiguous if you can establish the context in which they are being used:

  • rotate: hacer girar.
  • espina: thorn.  This came up in the context of a patient explaining that he had embedded a thorn deep into his hand.  It also turns out to mean “spine,” in the anatomical sense.  So much for the “one context, one meaning” hypothesis.
  • stabilize: estabilizar.  This was a tough one—it’s not even in my dictionary, and I had to go home and check WordReference.com to find it.  Related word: estable “stable.”
  • claw: so, this is a really tough one.  There are four (four) nouns that translate the English noun claw, and three of them have come up so far this week.
    • Pinza is a claw like a crab or lobster claw.  This one came up in the context of “claw deformity”—we saw a couple of patients with claw deformities of the hand (see photo, from Tumblr).tumblr_lymznxfkf11r8vrhxo1_500
    • Garra is a claw like an eagle’s claw or a lion’s claw.  This one came up in the context of giving instructions for a hand therapy exercise.
    • Uña is the nail itself—this one came up in the context of a woman who wanted her toenails removed.  (Long story—they did need it.)
    • Finally, there’s another word, zarpa, that I haven’t figured out how to use yet.

    So much for the “one context, one meaning” hypothesis, once again!  Note that the previous example was ambiguous in the direction of Spanish to English, while this one is ambiguous in the direction of English to Spanish.

  • deformity: la deformidad.
  • ayuno: fasting.  Related expressions, which did come up later, over the course of the week: en ayunas or en ayuno (yes, the gender is different): “fasting,” or “before breakfast.”
  • trompa: in the context of anatomy, a duct or tube.  Trompa de Falopio: Fallopian tube.
  • hormigoso: ant-like; full of ants; ant-eaten; or, in this context, itchy.
  • cosquilloso: ticklish.
  • buzz: zumbar.
  • dissolve: disolver.

If you’ve read this far: how about a donation to Surgicorps?  It’s a wonderful group that does great work.

“Dialect” means pretty different things in English and Spanish

One of the first lessons of Linguistics 101 is: “Everyone speaks a dialect.”  We all come from somewhere, we all belong to some social class, we all have some gender, and all of these—plus many more things—affect our language.  To a linguist, there’s no such thing as a “standard” dialect any more than to a surgeon there’s such a thing as a “standard” anatomy—everyone varies.  To a linguist, there’s no real difference between a language and a dialect—as the redoubtable Ilse Lehiste put it, “A language is a dialect with an army and a navy.”

To a linguist, the term dialect generally refers to a form of language that is associated with a particular geographic area.  Although the terminology varies from linguist to linguist, we have other words for varieties of language associated with particular social groups (sociolect), people who share very specific activities (jargon), levels of formality and specific social contexts (register), and so on.  How you will speak at any given time is an interaction between the tendencies of your dialect, sociolect, the register, and so on.

Back to dialect: there is a cognate term in Spanish, but the denotation is quite different.  In Spanish, dialecto typically refers to an indigenous language.  As one of our surgeons, a native speaker of Spanish, put it the other day, referring to a patient in the recovery room: “even though I speak Spanish, I don’t understand him, because he speaks a dialect.”  This was a patient who spoke one of the indigenous languages of Guatemala.  Most of these languages belong to the Mayan language family.  There are about 29 Mayan languages, of which 21 are spoken in Guatemala by the large indigenous population—about 70% of Guatemalans are native.  (The rest are called Ladinos.)  The Mayan languages are about as similar to or different from each other as English and German or Spanish and French.  Patients who speak one of these languages and don’t also speak Spanish must bring someone who can interpret between their language and Spanish, and then one of the Surgicorps interpreters interprets between Spanish and English.  I’ve seen four patients like that so far in the past three days.

This leads to the question: if the Spanish word dialecto means an indigenous language, how do you say “dialect” in Spanish?  I don’t know the answer.  The Spanish Wikipedia page for dialecto discusses it as a technical term in linguistics, with the same basic meaning as the English sense.  (Note that as is the case in English, there is sometimes a large difference between the technical meaning of a term and the meaning of that term in the general language.)  There is a Spanish term jerga that translates roughly as “jargon.”  The Spanish Wikipedia page for jerga distinguishes it from “dialect” in that a jerga is associated with a social group or a profession, and is used either to obscure communication with out-group members (slang) or to enhance communication on technical subjects (jargon).  The search continues.

Guatemala is funny if you’re Bulgarian, but maybe not otherwise

Arrived in Guatemala Saturday morning after two flights spent obsessively studying medical vocabulary and reading about health care interpreting.  I almost made it out of the airport without having to check a dictionary, but Zipf’s Law humbled me in baggage claim, i.e. before I even made it out of the airport.  Oh, well.

Getting to Antigua involves about an hour and a half drive.  Much of it is through Guatemala City and its suburbs.  The road is pretty much solidly lined with small businesses, many of them with hand-painted signs–Guatemala City has some very, very low-income, rough areas.  (In general, travellers are advised by guidebooks to just stay out of Guatemala City and go somewhere else.)  I amused myself on the drive by taking pictures of words that I don’t know—there were far more such words than I could capture on my cell phone camera.  Probably the most linguistically interesting was a huge advertisement for vaginal suppositories that was notable for the fact that it used a vosotros verbal form, which you don’t often see north of here (I don’t, anyways).

  • faja: This was the word for the thing that your baggage comes in on–more specifically, faja de retiro de equipage (baggage claim carrousel).  Faja has a variety of meanings in my dictionary, none of which is “carrousel.”  They mostly have to do with things that go around something–sash, girdle, bandage, newspaper wrapper. 2014-08-09 13.50.10
  • cuota: I think this was meant (see photo) in its sense of a fee or dues.  It can also mean a quota or share of something, as well as a tuition fee.2014-08-09 14.31.01
  • capilla: An interesting word, and I don’t know how to interpret it in this case (see photo).  One meaning is a chapel.  However, my dictionary says that another is a “death house”—I’m not sure what that actually means, but I think it has something to do with people who have been condemned to death.  There are additional, quite different meanings, and each of these has related words and expressions:
    • hood, cape: a related word is capillo, meaning a baby cap, a baptismal cap, a hood, a cocoon, or a cigarette filter.  Note that this is an example of two words that differ only in gender and have different meanings.
    • death house: an expression related to this meaning is estar en capilla, which can mean “to be in the death house,” but also “to be on pins and needles.”
    • chapel: a related word is el capiller, a churchwarden or sexton.
    • proof sheet: not sure where this one comes from, and I don’t know of any related words.2014-08-09 14.31.47
  • almacén: I knew this word in its sense of “warehouse,” but it turns out that it can also mean store or department store, which was probably the intended sense here (see photo).2014-08-09 14.36.10
  • ladrillo: a brick or tile. 2014-08-09 14.46.46Why Guatemala is funny if you’re Bulgarian: Guatemala is a mountainous country with lots of curvy roads.  Curvy roads are marked with the word curvas (curves).  It will immediately be obvious to you why this is funny if you’re Bulgarian, or, indeed, from any Eastern European country that I’m aware of, Slavic-speaking or not.  If you’re not Bulgarian: it’s not blog-appropriate, so write to me if you want the joke explained.

 

Ukrainian Humanitarian Resistance

Resisting the russist occupation while keeping our humanity

Languages. Motivation. Education. Travelling

"Je suis féru(e) de langues" is about language learning, study tips and travelling. Join my community!

Curative Power of Medical Data

JCDL 2020 Workshop on Biomedical Natural Language Processing

Crimescribe

Criminal Curiosities

BioNLP

Biomedical natural language processing

Mostly Mammoths

but other things that fascinate me, too

Zygoma

Adventures in natural history collections

Our French Oasis

FAMILY LIFE IN A FRENCH COUNTRY VILLAGE

ACL 2017

PC Chairs Blog

Abby Mullen

A site about history and life

EFL Notes

Random commentary on teaching English as a foreign language

Natural Language Processing

Université Paris-Centrale, Spring 2017

Speak Out in Spanish!

living and loving language

- MIKE STEEDEN -

THE DRIVELLINGS OF TWATTERSLEY FROMAGE