Computational linguistics–it’s not all beer and pétanque: French text normalization

…and then I start to notice stuff like this going across the screen…
Zipf’s Law: a small number of words occur very frequently, while the vast majority of words occur very rarely–but, they do occur. Credit: @ASvanevik.

Zipf’s Law in brief: languages have a small number of words that occur very, very frequently (English examples: the, I, to, and; French examples: de, la, et, à), and an enormous number of words that occur very rarely–but, they do occur.  Consequence for people trying to learn a second language: every day of your life, you’re going to come across words that you don’t know.

Now, even if you suck at math as badly as I do, Zipf’s Law isn’t that hard to wrap your head around–clearly words like the, too, and and are super-frequent, while words like hangnail and glory are not, and clearly there are hella more words (see the English notes below for an explanation of hella) like hangnail and glory than there are words like the, too, and and.  However: understanding something in the abstract and really feeling it in your gut are two very different things, right?  I mean, you don’t actually expect to get smacked in the face by Zipf’s Law on a regular basis.

And yet…

Screen Shot 2017-10-10 at 08.56.10
Translation: ‘”I’m going to do a little fast” versus “I’m going to fuck a young guy”: On the importance of the circumflex accent.’ 3.8 thousand likes, 8.4 thousand retweets, and thanks to Phil dAnge for telling me what it means before I got myself into even MORE trouble with this one than I already had.

Computational linguistics isn’t all drinking beer in Prague and dancing in the park with pretty girls long after midnight in Bulgaria.  In fact, far more of it than anyone would guess is writing computer programs to process your data into a form that would let you actually do something fun with it.  When you’re working with French, a common step in this procssing is to remove the accents from all of the letters.  Yes, this seems like heresy.  Yes, this creates ambiguity–jeûne (a fast) becomes jeune (a youth), châsse (reliquary–God, how I love casually dropping that one into a conversation) becomes chasse (hunt), and répéter becomes repéter (Um, Zipf… you have to say répéter (to repeat)otherwise, you’re saying repéter (to fart again).)  But, overall, the increase in ambiguity that comes from the deletion of accents (part of a set of techniques known as normalization of your text) is well-recompensed by the fact that you essentially (and probably counter-intuitively) get rid of a lot of potential errors by your program that way.  For example, I work with biomedical data, so I might want to be able to process something like this:

Mon père a perdu l’ordonnance de médicament.

…and alert someone that this guy needs attention.  Seulement voilà–the thing is–if I expect everything to be spelled correctly, then I’ll miss this:

Mon père a perdu l’ordonnance de medicament.

…and this:

Mon pere a perdu l’ordonnance de médicament.

…and this:

Mon pere a perdu l’ordonnance de medicament.

…and you don’t want to miss the fact that some little old guy (let’s say, me, but less bald and fat) has lost his prescription, right?  So, if you’re working with French, you will probably–quite early in your processing–remove all of the accents from everything.  Once you’ve done that, all four of the previous sentences look the same to the computer program, and you only have to write enough code to deal with one of them.

So, the other day I’m sitting in the lab getting ready for a meeting the next morning, and I’m working with French medical data, and I realize that I need some non-medical text so that I have something with which to compare my medical data.  (Something to compare my medical data with, in spoken English.)  I find some French novels available for free. I start running them through my program….and then I start to notice stuff like this going across the screen:

s’est-il \xe9cri\xe9, et il m’a saut\xe9 au cou. Les autres aussi m’ont embrass\xe9

Not intelligible (at least to a human), but not shocking, either.  Here’s the thing: in their guts, computers were made to deal with the English alphabet, and only the English alphabet.  Consequently, if you want to work with other languages, you have to find ways to “encode” their letters into a form that the computer can deal with.  Those strings of backslashes, numbers, and weird letters are one way of doing that “encoding.”

Thelenota ananas, a species of sea cucumber. Credit: By Leonard Low from Australia – Flickr, CC BY 2.0,

So: what to do?  The easy way out of this would be to replace all of those strings with the appropriate unaccented character.  Easy enough: saut\xe9 seems likely to have been sauté once upon a time, and embrass\xe9 was probably embrassé, so replace all of the sequences \xe9 with e, and you’re good to go.  Not nearly as glamorous as eating sea cucumbers in Hangzhou, but not super-difficult, either.

Then something like this flashes across your screen:

…l’abbaye de la Chapelle poss\xfc\xbe\x8e\x86\x94\xbcdait deux bergeries à Oye, et recevait la d\xfc\xbe\x99\xa6\x94\xbcme de…

With d\xfc\xbe\x99\xa6\x94\xbcme, we are clearly deep into the “long tail” of the Zipfian distribution.  But, being obsessive memorizers of the at-minimum 10 words a day that we come across in our daily lives but don’t know, and being in the midst of Les Misérables, the word dîme (tithe) comes to mind immediately, and seeing the word abbaye (abbey) in the vicinity is all that we need to confirm it.  So, now we know that \xfc\xbe\x99\xa6\x94\xbc should be replaced with i, and we go on about our business.


So, yes: Zipf’s Law is a thing–your day is full of words that are very, very rare, but that do occur.  And, yes: obsessive memorization of French vocabulary will occasionally get you out of a tight spot at work.  And, yes: computational linguistics is not all drinking beer in Prague and dancing in the park with pretty girls in Bulgaria long past midnight–but, when it is, it is so, so good!  And, when it’s not–it’s hella better than digging ditches!

English notes

hella: an extremely casual intensifier, it means “very” or “a lot of,” depending on whether it’s modifying an adjective or a noun.

Equivalent to very:

hella goofy means very goofy

hella cute means very cute

Equivalent to a lot of:

hella people means a lot of people

hella times means a lot of times, many times

Wanna try this at home?  Here’s the code–copy and paste it into an .Rmd file.  Don’t know what an .Rmd file is?  Don’t try this at home.

title: "GutenbergR demo"
author: "KBC"
date: "9/25/2017"
output: html_document

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#install.packages("gutenbergr") # only need to do this the first time that you run the script

# get list of French-language books <- gutenberg_metadata() %>% filter (language == "fr") <- gutenberg_works(languages = "fr", only_text = TRUE,
 rights = c("Public domain in the USA.", "None"), distinct = TRUE,
 all_languages = FALSE, only_languages = TRUE)

# retrieve the contents of those books <- gutenberg_download($gutenberg_id) <- gutenberg_download(796) # Stendahl <- gutenberg_download(799) # Verne

# fix at least some of the characters$text <- gsub('\xe9', 'e',$text)$text <- gsub('\xfc\xbe\x8d\xb6\x94\xbc', 'c',$text)$text <- gsub('\xfc\xbe\x8e\x86\x94\xbc', 'e',$text)$text <- gsub('\xe0', 'a',$text)$text <- gsub('\xfc\xbe\x98\x96\x94\xbc', 'e',$text)$text <- gsub('\xfc\xbe\x98\xa6\x98\xbc', 'u',$text)$text <- gsub('\xfc\xbe\x8d\x86\x98\xbc', 'o',$text)$text <- gsub('\xfc\xbe\x99\xa6\x94\xbc', 'i',$text)$text <- gsub('\xfc\xbe\x8e\x96\x98\xbc', 'u',$text)$text <- gsub('\xfc\xbe\x99\x96\x94\xbc', 'i',$text)$text <- gsub('\xfc\xbe\x8c\xa6\x94\xbc', 'a',$text)$text <- gsub('\xfc\xbe\x98\xa6\x84\xbc', '--',$text)$text <- gsub('\xfc\xbe\x98\xa6\x88\xbc', '',$text)$text <- gsub('\xfc\xbe\x8e\x96\x8c\xbc', '',$text)

7 thoughts on “Computational linguistics–it’s not all beer and pétanque: French text normalization”

  1. Another meaning for this word you like, “châsse”. In classical gangsters slang “les châsses” means “les yeux”, preferably for a woman . So you see even gangsters in France knew how to spell . Well, yesteryears, now they can’t even read …

    Liked by 1 person

  2. He was a lightning and shooting star, 4 centuries before his time . Rimbaud and Verlaine, when they discovered him, felt like the young Eric Clapton when he discovered Robert Johnson .
    But it will be a tough task for you to follow his old French .

    Liked by 3 people

  3. I’m glad I’m not the only one this kind of thing happens to. When I was a very young-looking 15-year-old in Mexico, I seem to have mixed up “non-carbonated” (sin gas) with “you fuck” (chingas). Or that’s what I figured out retrospectively from the look of shock I got.

    Liked by 2 people

    1. Don’t be too afraid, “chinga”is also used to mean “My God !” or “Damnit!”. Once in Spain I wanted to say “I don’t want to embarrass this lady” but the Spanish verb “embarrassar” means impregnate, fecundate . I let you imagine the roll of laughters .

      Liked by 2 people

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s