Accidental versus necessary

Every language has rules that don’t seem that explainable to a linguist (or anyone else, I presume). One such rule in French is that if you have a plural noun that’s modified by a prenominal adjective (i.e. one of the few French adjectives that comes before the noun), then you use the singular form of the preposition de, not the plural form, which you would otherwise expect:

So, we know that French adjectives generally are postnominal (after the noun)—when do you put them in front of the noun? About.com suggests the acronym BAGS for remembering at least some of the adjectives that go before the noun:

  • Beauty
  • Age
  • Good and bad
  • Size (except grand for humans)

According to About.com, this phenomenon is related to inherent versus non-inherent properties of the noun that is being modified—a distinction similar to necessary versus accidental qualities, which, according to Pustejovsky’s The Generative Lexicon, is a distinction that goes back to Aristotle. Pustejovsky points out that adjectives describing necessary versus accidental qualities behave differently in the progressive aspect in English, with adjectives describing accidental qualities being grammatical in the progressive, while adjectives describing necessary qualities are not grammatical in the progressive—so that you can say (relevant adjectives in bold):

  • The horse is being gentle with her rider.
  • You’re being so angry again!
  • Stop being so impatient.

…but, according to Pustejovsky, you can’t say:

  • * John is being tall today.
  • * Aren’t you being beautiful tonight!
  • * Stop being so intelligent.

(In linguistics, the * before a sentence means that you can’t say it in the language in question. Note that there’s no claim that you can’t say Aren’t you beautiful tonight—the claim is only about the progressive aspect, indicated in these examples by the verb form being.  You are free to argue with Pustejovsky’s claims about whether or not the starred sentences are really ungrammatical.  Note also that there are French adjectives whose meaning changes depending on whether they’re prenominal or postnominal—more on those another time.)

So, why the de in front of plural nouns that have prenominal adjectives?  I have no idea.  The interesting thing to me is not so much the specifics of the rule (use de, not des, in front of a plural prenominal adjective) as what it suggests about the representation that underlies the rule, the qualities that the language has to have in order for a rule to be able to make reference to those qualities: in this case, something like the distinction between inherent versus non-inherent or necessary versus accidental qualities.

The ethics of crowdsourcing for linguistic resource construction in French

Screenshot 2014-10-06 11.34.33One of the major trends in my field today is the use of Amazon Mechanical Turk (AMT) to create linguistic resources, particularly for natural language processing.  Using AMT, tasks that require human intelligence—for example, deciding which synonym of a word is being used in a particular context, or labeling a photograph with the things that it pictures, or deciding whether or not a web page is relevant to a search query—are given to humans in very small increments, usually with the goal of using the humans’ data to train a computer to do the same task.  It is a form of crowdsourcing—using the public to do a (typically large) job in (typically) small amounts, e.g. Wikipedia.

Karën Fort of the Sorbonne and Gilles Adda of LIMSI have researched the ethics of the AMT model for work and for remuneration.  The AMT model turns out to raise many issues, including a number of ethical ones.  Karën and Gilles have worked to develop a charter for ethical use of this and other crowdsourcing platforms.  (Full disclosure: Karën and Gilles and I published an editorial on the use of Amazon Mechanical Turk in our field in the journal Computational Linguistics.)   If you click on the picture, it will take you to a set of slides that Karën prepared for a talk on the subject.  Zipf’s Law strikes in the domain of ethics as much as anywhere else—here are some words that I had to look up to read the slides:

▪    une ombre: shade, shadow.
▪    la zone d’ombre: gray zone.
▪    promouvoir: to promote.
▪    le vaut bien: to be worth it.
▪    la plate-forme: platform.
▪    la myriadisation: crowdsourcing.
▪    délocalisé: outsourced.
▪    la foule: crowd, mob, masses.
▪    le travail parcellisé: microworking.
▪    découpé: cut into pieces.

That’s ten words just to get to slide 10 out of 30, but that’s about all I can handle in a single day—more Zipf’s Law words next time.

Les sous-langages: sublanguages

The lines represent growth in the number of word types as increasing numbers of tokens are observed.  The blue line (BNRC) is unrestricted Bulgarian text.  The red line (epicrises) is Bulgarian clinical documents.  The clinical documents show lexical constraints--for a given number of tokens, the number of word types is much smaller, and tends toward finiteness.
The lines represent growth in the number of word types as increasing numbers of tokens are observed. The blue line (BNRC) is unrestricted Bulgarian text. The red line (epicrises) is Bulgarian clinical documents. The clinical documents show lexical constraints–for a given number of tokens, the number of word types is much smaller, and tends toward finiteness.

I had to look up all of these words today in order to be able to explain just one aspect of my research. One of the things that I work on is the topic of sublanguages (explained below). Looking for material on the subject in French, I came across the doctoral dissertation Sur la notion de sous-langage, by Roland Dachelet. Even in the context of discussing my own research, Zipf’s Law strikes.

  • le sous-langage: sublanguage.
  • le domaine: domain. A sublanguage is a variety of language associated with a specific domain—medicine, biology, weather, sports reporting.
  • spécialisé: specialized. Being related to a specific domain, a sublanguage is specialized.
  • la contrainte: constraint. Sublanguages are generally associated with constraints—constraints on the kinds of subjects and arguments that a verb in the domain can have, for instance; constraints on syntactic structures; constraints on the set of words.
  • le lexique: in this case, the set of words in a text—vocabulary. It has other meanings, too, such as bilingual dictionary. Typically the set of words in a sublanguage is constrained.
  • la morphologie: morphology (how words are put together).
  • une ambiguïté: ambiguity. The fundamental problem of language processing—if most things in language didn’t have multiple possible interpretations, computers could just look everything up.
  • la variabilité: variability. The other major problem of language processing—there are so many ways to express the same thing.
  • la caractérisation: characterization. The current challenge in sublanguages is to characterize them automatically—that is, with a computer, as opposed to a human doing it manually.
  • la syntaxe: syntax. This is how phrases are structured.
  • syntaxique: syntactic.
  • une analyse syntaxique: syntactic analysis.
  • la structure: structure. Syntax is mostly about structure.
  • la sémantique: semantics.
  • sous-jacent: below, underlying, implicit (the sense in which I need it). Important aspects of language, such as syntactic structure, are implicit in the sense that they are not visibly indicated in the stream of language.

Internet search with highly inflected languages

I tend to wake up early.  One of the things that I do with my free time in the morning is study French.  Today I was flipping through a dictionary and ran across the word thésauriser.  “How nice,” I thought—“a verb about putting things in a thesaurus.”  Wrong—thésauriser is an intransitive verb meaning “to hoard money,” or a transitive verb meaning “to hoard.”

Having come across such a weird and wonderful verb, I  wondered whether or not it actually gets used.  Being a 21st-century corpus linguist, I turned to Google.
One of the huge shifts in corpus linguistics, loosely definable as the study of language using naturally occurring data (as opposed to, say, making up illustrative sentences from your own head), is the availability of huge amounts of searchable texts on the Internet, typically via Google.

One of the “problems” with using Google as a corpus search tool these days is that it will often give you links to definitions of a word, rather than examples of actual usage.  So, the first couple of pages of results were links to definitions.  However, this list also included a link to dico-proverbes.com, a web site that gives you proverbs that use your word of interest.  Here I found Qui sait économiser, sait thésauriser, and
A thésauriseur, héritier gaspilleur.  Translating a proverb is risky in any language, but even without a translation, we can note that the verb is, indeed, intransitive in these uses—you don’t explicitly use the word “money.”  (If any bilingual French/English speakers would like to chip in translations in the comments, it would be much appreciated!)

One of the first links is actually to a French Wikipedia entry for the nominalization thésaurisation, which is explained as a technical term in economics: La thésaurisation est un terme technique économique décrivant une accumulation de monnaie pour en tirer un profit ou par absence de meilleur emploi, et non par principe d’économie ou d’investissement productif.

The site 1mot.fr informs me that it is a valid French Scrabble word, and gives a long list of inflections that can be added to it, as well as a list of of smaller valid Scrabble words that can be made from its letters.

Yet another site, dico-rimes.com, gives a list of words that rhyme with it.

If you’ve been following the links so far, you’ve noted that in my attempts to find actual examples of usage, as opposed to metalinguistic information, I’ve been drifting away from the infinitive thésauriser and searching for inflected forms–mostly unsuccessfully (in that I’m still getting metalinguistic stuff).  So, how do I find actual examples of usage?  I tried searching on inflected forms that I thought might have lower frequencies and/or be less likely to be headings for their own entries in sites like the proverb-finding, rhyme-finding, definition-finding, etc. sites.  The third person plural present tense thésaurisent got me to a YouTube video titled Ceux qui THESAURISENT DETRUISENT l INTERET GENERAL et ce n est PAS BANAL (“Those who hoard, destroy the public good, and that’s not trite”–in French, it sort of rhymes). My French is definitely not good enough to understand the poor audio on the videotaped monologue (which I’m not actually sure is entirely in French), so I can’t say whether it’s insightful, or insane ramblings. A link to a footnote in The papers of Alexander Hamilton on Google Books finds me Cet example frappant peut s’appliquer à tous les hommes industrieux, depuis l’artiste célebre ou le chef de manufacture qui thésaurisent peut-être dix milles francs chaque année, jusqu’à l’artisan grossier qui n’épargne qu’un écu.   Note that here it’s not intransitive, and a specific amount of money is named.

So, I’ve been investigating this verb for about 45 minutes, and still have only found two actual examples of usage on Google.  Any suggestions on Googling in French would be much appreciated!  (I should add that the Text Retrieval Conference sponsored a set of experiments on information retrieval in Spanish some years ago, and if memory serves, they concluded that the techniques that work for not-so-highly-inflected English work just as well on fairly-highly-inflected Spanish.  However, information retrieval is not corpus linguistics.)

I wish I could say that this has led to some hilarious misunderstandings

edith-piafEdith Piaf was probably the most famous French singer of all time, at least outside of France.  It turns out that France is definitely the best place to buy her music–I picked up a 3-CD set at Les Invalides for ten euros.  Not surprisingly, Zipf’s Law strikes in Edith Piaf’s lyrics as much as anywhere else–I wasn’t able to get past the first line of her famous La vie en rose without consulting a dictionary.

According to the first French-language lyrics web site that I found, the first line of La vie en rose is as follows:

Des yeux qui font baiser les miens

This didn’t quite sound right to me.  The word that I thought I heard when I listened to the lyrics was not baiser, but baisser.  (The single s is pronounced z.  The double ss is pronounced s.)  When I looked up baisser, I discovered that it means “to lower.”  This made sense, in context–the first line would then mean Eyes that make mine lower (themselves).  What didn’t make quite as much sense–although, crucially, it wasn’t like it made no sense at all–was what the lyrics web site said: baiser. 

Baiser is a word that has always thrown me off.  As we learn in school, baiser means “to kiss.”  However, as I learned when I got to France, it also means “to fuck.”  I wish that I could say that this ambiguity has led to some hilarious misunderstandings in my life, but: no such luck.  Yet.

Zipf’s Law and the French alphabet

french-alphabetAt no point in my college French 101 class did we learn the alphabet.  This turned out to be a problem when I got to France and needed to spell things (like, say, my name).

There are a number of French alphabet songs available on YouTube.  This one, from www.learnfrenchlab.com, is my current favorite.  The words go some thing like “I want the X, the X, the X of Y,” where “X” is a letter and “Y” is a noun or phrase that starts with that letter—“I want the A, the A, the A of ananas (pineapple).”  I learned some things from this song—like, in France I learned by necessity that K is “ka,” but I had no clue that J is “zhee.”

The song is an interesting example of how much phonology and orthographic variability we have to get kids to ignore in order for them to learn to read.  (That’s equally true in English, of course.)  For example: the name of the letter E in French is not even spellable in English.  We don’t have the sound at all (and I struggle with it constantly, both in terms of producing it and in terms of hearing it).  However, in the song, it’s illustrated with éléphant (elephant), in which it’s pronounced completely differently (like the “long” A in English).  The name of G is “zhey,” but the word illustrating G is grenouille (frog), in which the G is pronounced like the G in “green.”  You get the picture.

Of course, Zipf’s Law strikes in the alphabet song as much as anywhere else—there is nothing simple about children’s language.  (As my French tutor once pointed out, the passé simple, which we’re not even taught in French class as it’s pretty obscure and not used (that I know of) in the spoken language any more, is indeed used in French children’s books, with the result that my French tutor can’t use children’s books in the high school French classes that she teaches.)  Here are words that I either didn’t know, or didn’t know the gender of:

  • un ananas: pineapple.
  • la banane: banana.
  • le crocodile: crocodile.  (Are you noticing a pattern here?  Zipf’s Law strikes three times before I get past A, B, and C!)
  • le dauphin: dolphin.  (You probably know this word from history class as “heir apparent,” as did I—I had no idea that it also meant “dolphin.”)
  • la grenouille: frog.
  • un hippopotame: hippopotamus.
  • le kangourou: kangaroo.
  • le nounours: teddy bear.
  • le rigolo/la rigolote: joker, funny person; clown (pejorative); also an adjective—funny, amusing, bizarre, weird.
  • le serpent: snake.
  • le/la trompette: this is an interesting one, as the meaning varies by gender.  La trompette (feminine) is a trumpet.  Le trompette (masculine) is a trumpeter.
  • le zoo: zoo (but, it’s pronounced completely differently, of course).

That’s twelve new words, just to learn the alphabet.  Zipf’s Law will get you every time.

Another reason to love France: A guide to verb conjugation is on the best-seller list

bescherelleIf you’re lacking a good reason to love France today, here’s a fine one: the Bescherelle has been on the Amazon France bestseller list for 324 days.  It’s currently the 19th highest-sales item, and that’s down from its previous listing.

The written French verb is a marvelous thing, with inflections for person, number, tense, aspect, mood, and occasionally gender (e.g. adjectival past participles of verbs conjugated with être).  This works out to maybe 50 forms for every verb, with lots of irregularities that you just have to memorize.  The most popular reference is Bescherelle: La conjugaison pour tous (“Conjugation for everyone”).  Named after a 19th-century French lexicographer and grammarian, Bescherelle is actually a series of books on conjugation, grammar, and orthography (spelling), and it is so popular that the word Bescherelle is used in the modern language to refer to any guide to conjugation (or so Wikipedia tells me—I haven’t heard this usage).

As you can guess, French verb conjugation is a challenge, and the ability to do it correctly is the sign of a well-educated person.  In the United States, it’s tough to study French without a copy of 500 French Verbs.  (It’s a great book, but don’t buy the Kindle version–it’s probably the worst Kindle book I’ve ever seen.)  In contrast, the Bescherelle lists 12,000 (twelve thousand) verbs, or at least my 1997 edition does—I’m not sure what the number is for the most current version.  The African language scholar Laura Downing once told me that when things got boring around the office in France, her (French) co-workers would toss around the Bescherelle and take turns quizzing each other.  I can attest that when I was puzzling over an odd verb tense at work one day, my office mate Brigitte said: “There’s the Bescherelle!”

French emails of the day: travelling by train on a whim, and updating your operating system

I continue to need to look up at least one word per French email. Goal: be able to read at least one email per day without having to look anything up…

  • coup de tête: a sudden impulse.  When talking about soccer or violence, it’s a head-butt.  A couple of years ago I bought train tickets on line, and ever since have gotten emails with sentences like Vous souhaitez partir à petits prix sur un coup de tête?
    Related phrases:
    • donner un coup de tête à qqn: to head-butt.
    • sur un coup de tête: on an impulse, on a whim.
  • dégressif: decreasing gradually.  The train company suggests that I Profitez d’un tarif dégressif en fonction du nombre de passagers voyageant ensemble (jusqu’à 6 passagers).
  • passer en: not sure what this means.  Anyone?  Here’s the sentence, from a work email about upgrading operating systems:
    Si, au retour de vacances, votre ordinateur vous propose de passer en
    Ubuntu 14.04.1, mieux vaut refuser, en attendant le retour de l’expert
    Ubuntu (Olivier) le 8 septembre !

The LIMSI-ILES web page and the greatest French word ever

When in France I work as a visiting researcher in the Groupe Information, Langue Ecrite et Signée (ILES).  I thought it would be nice if I could read my own group’s web page, so I opened up a dictionary and headed on over.  Perusing the page, I saw what must be the best French word ever: retweeté, “re-Tweeted.”  Note that I have no idea whatsoever how to pronounce it.

So, what else would one need to know in order to be able to read the ILES web page?

  • la modélisation: according to WordReference.com, it’s the establishment of a pattern, model, or template. modélisation et traitement automatique de la langue
  • une analyse: analysis; in a medical context, blood test.  leur analyse, leur compréhension ou leur production
  • précis(e): accurate, precise, exact, specific.  Can also be a book, in which case it’s a handbook, manual, or summary.

A day’s random French emails

Zipf’s Law hits me over and over again in emails.  I’m on various and sundry LIMSI and French natural language processing mailing lists, and usually have to look up something or other to read them.  Here’s today’s words:

  • la candidature: application.  In politics, it’s candidacy.  An email about a plan to apply to host a conference started with this sentence (how like Zipf’s Law to strike in the very first sentence): Nous continuons à préparer une candidature pour la conférence JEP-TALN 2016.
  • intitulé: entitled, titled.  In the context of talking about a text, it’s a noun meaning “title” or “heading.” 
    Le jeudi 4 Septembre nous recevrons Ryo Nagata qui nous présentera un exposé intitulé: “Après un an — fruits du travail au LIMSI et après”. (Remember exposé from a previous post, also about a talk?)

Only two words ’cause I only have two French emails in my Inbox!

Ukrainian Humanitarian Resistance

Resisting the russist occupation while keeping our humanity

Languages. Motivation. Education. Travelling

"Je suis féru(e) de langues" is about language learning, study tips and travelling. Join my community!

Curative Power of Medical Data

JCDL 2020 Workshop on Biomedical Natural Language Processing

Crimescribe

Criminal Curiosities

BioNLP

Biomedical natural language processing

Mostly Mammoths

but other things that fascinate me, too

Zygoma

Adventures in natural history collections

Our French Oasis

FAMILY LIFE IN A FRENCH COUNTRY VILLAGE

ACL 2017

PC Chairs Blog

Abby Mullen

A site about history and life

EFL Notes

Random commentary on teaching English as a foreign language

Natural Language Processing

Université Paris-Centrale, Spring 2017

Speak Out in Spanish!

living and loving language

- MIKE STEEDEN -

THE DRIVELLINGS OF TWATTERSLEY FROMAGE