Computational linguistics–it’s not all beer and pétanque: French text normalization

zipfs-law-curve-https-::medium.com:@ASvanevik:how-i-learned-german-in-30-days-df7b7ff85654 — Zipf’s Law: a small number of words occur very frequently, while the vast majority of words occur very rarely–but, they do occur. Credit: @ASvanevik.

Zipf’s Law in brief: languages have a small number of words that occur very, very frequently (English examples: the, I, to, and; French examples: de, la, et, à), and an enormous number of words that occur very rarely–but, they do occur. Consequence for people trying to learn a second language: every day of your life, you’re going to come across words that you don’t know.

Now, even if you suck at math as badly as I do, Zipf’s Law isn’t that hard to wrap your head around–clearly words like the, too, and and are super-frequent, while words like hangnail and glory are not, and clearly there are hella more words (see the English notes below for an explanation of hella) like hangnail and glory than there are words like the, too, and and. However: understanding something in the abstract and really feeling it in your gut are two very different things, right? I mean, you don’t actually expect to get smacked in the face by Zipf’s Law on a regular basis.

And yet…

Screen Shot 2017-10-10 at 08.56.10 — Translation: ‘”I’m going to do a little fast” versus “I’m going to fuck a young guy”: On the importance of the circumflex accent.’ 3.8 thousand likes, 8.4 thousand retweets, and thanks to Phil dAnge for telling me what it means before I got myself into even MORE trouble with this one than I already had.

Computational linguistics isn’t all drinking beer in Prague and dancing in the park with pretty girls long after midnight in Bulgaria. In fact, far more of it than anyone would guess is writing computer programs to process your data into a form that would let you actually do something fun with it. When you’re working with French, a common step in this procssing is to remove the accents from all of the letters. Yes, this seems like heresy. Yes, this creates ambiguity–jeûne (a fast) becomes jeune (a youth), châsse (reliquary–God, how I love casually dropping that one into a conversation) becomes chasse (hunt), and répéter becomes repéter (Um, Zipf… you have to say répéter (to repeat)—otherwise, you’re saying repéter (to fart again).) But, overall, the increase in ambiguity that comes from the deletion of accents (part of a set of techniques known as normalization of your text) is well-recompensed by the fact that you essentially (and probably counter-intuitively) get rid of a lot of potential errors by your program that way. For example, I work with biomedical data, so I might want to be able to process something like this:

Mon père a perdu l’ordonnance de médicament.

…and alert someone that this guy needs attention. Seulement voilà–the thing is–if I expect everything to be spelled correctly, then I’ll miss this:

Mon père a perdu l’ordonnance de medicament.

…and this:

Mon pere a perdu l’ordonnance de médicament.

…and this:

Mon pere a perdu l’ordonnance de medicament.

…and you don’t want to miss the fact that some little old guy (let’s say, me, but less bald and fat) has lost his prescription, right? So, if you’re working with French, you will probably–quite early in your processing–remove all of the accents from everything. Once you’ve done that, all four of the previous sentences look the same to the computer program, and you only have to write enough code to deal with one of them.

So, the other day I’m sitting in the lab getting ready for a meeting the next morning, and I’m working with French medical data, and I realize that I need some non-medical text so that I have something with which to compare my medical data. (Something to compare my medical data with, in spoken English.) I find some French novels available for free. I start running them through my program….and then I start to notice stuff like this going across the screen:

s’est-il \xe9cri\xe9, et il m’a saut\xe9 au cou. Les autres aussi m’ont embrass\xe9

Not intelligible (at least to a human), but not shocking, either. Here’s the thing: in their guts, computers were made to deal with the English alphabet, and only the English alphabet. Consequently, if you want to work with other languages, you have to find ways to “encode” their letters into a form that the computer can deal with. Those strings of backslashes, numbers, and weird letters are one way of doing that “encoding.”

thelenota_ananas — Thelenota ananas, a species of sea cucumber. Credit: By Leonard Low from Australia – Flickr, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=1552175

So: what to do? The easy way out of this would be to replace all of those strings with the appropriate unaccented character. Easy enough: saut\xe9 seems likely to have been sauté once upon a time, and embrass\xe9 was probably embrassé, so replace all of the sequences \xe9 with e, and you’re good to go. Not nearly as glamorous as eating sea cucumbers in Hangzhou, but not super-difficult, either.

Then something like this flashes across your screen:

…l’abbaye de la Chapelle poss\xfc\xbe\x8e\x86\x94\xbcdait deux bergeries à Oye, et recevait la d\xfc\xbe\x99\xa6\x94\xbcme de…

With d\xfc\xbe\x99\xa6\x94\xbcme, we are clearly deep into the “long tail” of the Zipfian distribution. But, being obsessive memorizers of the at-minimum 10 words a day that we come across in our daily lives but don’t know, and being in the midst of Les Misérables, the word dîme (tithe) comes to mind immediately, and seeing the word abbaye (abbey) in the vicinity is all that we need to confirm it. So, now we know that \xfc\xbe\x99\xa6\x94\xbc should be replaced with i, and we go on about our business.

computational_linguists — Source: https://xkcd.com/114/

So, yes: Zipf’s Law is a thing–your day is full of words that are very, very rare, but that do occur. And, yes: obsessive memorization of French vocabulary will occasionally get you out of a tight spot at work. And, yes: computational linguistics is not all drinking beer in Prague and dancing in the park with pretty girls in Bulgaria long past midnight–but, when it is, it is so, so good! And, when it’s not–it’s hella better than digging ditches!

English notes

hella: an extremely casual intensifier, it means “very” or “a lot of,” depending on whether it’s modifying an adjective or a noun.

Equivalent to very:

Get you a girl who’s hella goofy, loyal, and will treat you right.. OH WAIT THATS MEEEE AHAAAA

— Female Struggles (@FemaleStruggIes) October 10, 2017

hella goofy means very goofy

Lmao damn I used to be hella cute. Wtf happened 😤

— Francine (@Francineee_H) October 10, 2017

hella cute means very cute

Equivalent to a lot of:

hella people get into relationships and lose their ambition

that’s super wack

— William⚡️Bolton (@WilliamBolton) October 9, 2017

hella people means a lot of people

I mean you DM’d him hella times so you tell us RT @miakhalifa: Is Gilbert Arenas even relevant anymore?

— Ben Simmons 2.0 (@BBB_4_Lyfe) October 9, 2017

hella times means a lot of times, many times

Wanna try this at home? Here’s the code–copy and paste it into an .Rmd file. Don’t know what an .Rmd file is? Don’t try this at home.

---
title: "GutenbergR demo"
author: "KBC"
date: "9/25/2017"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#install.packages("gutenbergr") # only need to do this the first time that you run the script
library(gutenbergr)
```

```{r}
# get list of French-language books
#french.book.ids <- gutenberg_metadata() %>% filter (language == "fr")
french.book.ids <- gutenberg_works(languages = "fr", only_text = TRUE,
 rights = c("Public domain in the USA.", "None"), distinct = TRUE,
 all_languages = FALSE, only_languages = TRUE)
nrow(french.book.ids)
head(french.book.ids)
```

```{r}
# retrieve the contents of those books
#french.book.contents <- gutenberg_download(french.book.ids$gutenberg_id)
french.book.contents <- gutenberg_download(796) # Stendahl
#french.book.contents <- gutenberg_download(799) # Verne

summary(french.book.contents)
# fix at least some of the characters
french.book.contents$text <- gsub('\xe9', 'e', french.book.contents$text)
french.book.contents$text <- gsub('\xfc\xbe\x8d\xb6\x94\xbc', 'c', french.book.contents$text)
french.book.contents$text <- gsub('\xfc\xbe\x8e\x86\x94\xbc', 'e', french.book.contents$text)
french.book.contents$text <- gsub('\xe0', 'a', french.book.contents$text)
french.book.contents$text <- gsub('\xfc\xbe\x98\x96\x94\xbc', 'e', french.book.contents$text)
french.book.contents$text <- gsub('\xfc\xbe\x98\xa6\x98\xbc', 'u', french.book.contents$text)
french.book.contents$text <- gsub('\xfc\xbe\x8d\x86\x98\xbc', 'o', french.book.contents$text)
french.book.contents$text <- gsub('\xfc\xbe\x99\xa6\x94\xbc', 'i', french.book.contents$text)
french.book.contents$text <- gsub('\xfc\xbe\x8e\x96\x98\xbc', 'u', french.book.contents$text)
french.book.contents$text <- gsub('\xfc\xbe\x99\x96\x94\xbc', 'i', french.book.contents$text) 
french.book.contents$text <- gsub('\xfc\xbe\x8c\xa6\x94\xbc', 'a', french.book.contents$text)
french.book.contents$text <- gsub('\xfc\xbe\x98\xa6\x84\xbc', '--', french.book.contents$text)
french.book.contents$text <- gsub('\xfc\xbe\x98\xa6\x88\xbc', '', french.book.contents$text) 
french.book.contents$text <- gsub('\xfc\xbe\x8e\x96\x8c\xbc', '', french.book.contents$text)
#print(french.book.contents$text)
#length(french.book.contents$text)
print(french.book.contents$text[18:28])
```

7 thoughts on “Computational linguistics–it’s not all beer and pétanque: French text normalization”

phildange says:

October 10, 2017 at 7:59 am

Another meaning for this word you like, “châsse”. In classical gangsters slang “les châsses” means “les yeux”, preferably for a woman . So you see even gangsters in France knew how to spell . Well, yesteryears, now they can’t even read …

LikeLiked by 1 person

1. zipfslaw1 says:
  
  October 10, 2017 at 8:00 am
  
  OMG, that’s beautiful — thanks!
  
  LikeLike
  
2. zipfslaw1 says:
  
  October 10, 2017 at 8:08 am
  
  BTW–you suggested Villon. I finally found a bilingual edition–it’s on my reading list!
  
  LikeLiked by 1 person
  
phildange says:

October 10, 2017 at 8:32 am

He was a lightning and shooting star, 4 centuries before his time . Rimbaud and Verlaine, when they discovered him, felt like the young Eric Clapton when he discovered Robert Johnson .
But it will be a tough task for you to follow his old French .

LikeLiked by 3 people

Ellen Hawley says:

October 10, 2017 at 6:24 pm

I’m glad I’m not the only one this kind of thing happens to. When I was a very young-looking 15-year-old in Mexico, I seem to have mixed up “non-carbonated” (sin gas) with “you fuck” (chingas). Or that’s what I figured out retrospectively from the look of shock I got.

LikeLiked by 2 people

1. phildange says:
  
  October 10, 2017 at 6:37 pm
  
  Don’t be too afraid, “chinga”is also used to mean “My God !” or “Damnit!”. Once in Spain I wanted to say “I don’t want to embarrass this lady” but the Spanish verb “embarrassar” means impregnate, fecundate . I let you imagine the roll of laughters .
  
  LikeLiked by 2 people
  
  1. Ellen Hawley says:
    
    October 10, 2017 at 8:07 pm
    
    Yup. A 15-year-old friend of mine–a girl–made that mistake, trying to explain that she was embarrassed.
    
    LikeLiked by 2 people

	Anonymous on The many ways to spell “…
	Anonymous on Nightmare after nightmare: How…
	zipfslaw1 on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …

	Anonymous on The many ways to spell “…
	Anonymous on Nightmare after nightmare: How…
	zipfslaw1 on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …

Share this:

7 thoughts on “Computational linguistics–it’s not all beer and pétanque: French text normalization”

Leave a comment Cancel reply