
Zipf’s Law in brief: languages have a small number of words that occur very, very frequently (English examples: the, I, to, and; French examples: de, la, et, à), and an enormous number of words that occur very rarely–but, they do occur. Consequence for people trying to learn a second language: every day of your life, you’re going to come across words that you don’t know.
Now, even if you suck at math as badly as I do, Zipf’s Law isn’t that hard to wrap your head around–clearly words like the, too, and and are super-frequent, while words like hangnail and glory are not, and clearly there are hella more words (see the English notes below for an explanation of hella) like hangnail and glory than there are words like the, too, and and. However: understanding something in the abstract and really feeling it in your gut are two very different things, right? I mean, you don’t actually expect to get smacked in the face by Zipf’s Law on a regular basis.
And yet…

Computational linguistics isn’t all drinking beer in Prague and dancing in the park with pretty girls long after midnight in Bulgaria. In fact, far more of it than anyone would guess is writing computer programs to process your data into a form that would let you actually do something fun with it. When you’re working with French, a common step in this procssing is to remove the accents from all of the letters. Yes, this seems like heresy. Yes, this creates ambiguity–jeûne (a fast) becomes jeune (a youth), châsse (reliquary–God, how I love casually dropping that one into a conversation) becomes chasse (hunt), and répéter becomes repéter (Um, Zipf… you have to say répéter (to repeat)—otherwise, you’re saying repéter (to fart again).) But, overall, the increase in ambiguity that comes from the deletion of accents (part of a set of techniques known as normalization of your text) is well-recompensed by the fact that you essentially (and probably counter-intuitively) get rid of a lot of potential errors by your program that way. For example, I work with biomedical data, so I might want to be able to process something like this:
Mon père a perdu l’ordonnance de médicament.
…and alert someone that this guy needs attention. Seulement voilà–the thing is–if I expect everything to be spelled correctly, then I’ll miss this:
Mon père a perdu l’ordonnance de medicament.
…and this:
Mon pere a perdu l’ordonnance de médicament.
…and this:
Mon pere a perdu l’ordonnance de medicament.
…and you don’t want to miss the fact that some little old guy (let’s say, me, but less bald and fat) has lost his prescription, right? So, if you’re working with French, you will probably–quite early in your processing–remove all of the accents from everything. Once you’ve done that, all four of the previous sentences look the same to the computer program, and you only have to write enough code to deal with one of them.
So, the other day I’m sitting in the lab getting ready for a meeting the next morning, and I’m working with French medical data, and I realize that I need some non-medical text so that I have something with which to compare my medical data. (Something to compare my medical data with, in spoken English.) I find some French novels available for free. I start running them through my program….and then I start to notice stuff like this going across the screen:
s’est-il \xe9cri\xe9, et il m’a saut\xe9 au cou. Les autres aussi m’ont embrass\xe9
Not intelligible (at least to a human), but not shocking, either. Here’s the thing: in their guts, computers were made to deal with the English alphabet, and only the English alphabet. Consequently, if you want to work with other languages, you have to find ways to “encode” their letters into a form that the computer can deal with. Those strings of backslashes, numbers, and weird letters are one way of doing that “encoding.”

So: what to do? The easy way out of this would be to replace all of those strings with the appropriate unaccented character. Easy enough: saut\xe9 seems likely to have been sauté once upon a time, and embrass\xe9 was probably embrassé, so replace all of the sequences \xe9 with e, and you’re good to go. Not nearly as glamorous as eating sea cucumbers in Hangzhou, but not super-difficult, either.
Then something like this flashes across your screen:
…l’abbaye de la Chapelle poss\xfc\xbe\x8e\x86\x94\xbcdait deux bergeries à Oye, et recevait la d\xfc\xbe\x99\xa6\x94\xbcme de…
With d\xfc\xbe\x99\xa6\x94\xbcme, we are clearly deep into the “long tail” of the Zipfian distribution. But, being obsessive memorizers of the at-minimum 10 words a day that we come across in our daily lives but don’t know, and being in the midst of Les Misérables, the word dîme (tithe) comes to mind immediately, and seeing the word abbaye (abbey) in the vicinity is all that we need to confirm it. So, now we know that \xfc\xbe\x99\xa6\x94\xbc should be replaced with i, and we go on about our business.

So, yes: Zipf’s Law is a thing–your day is full of words that are very, very rare, but that do occur. And, yes: obsessive memorization of French vocabulary will occasionally get you out of a tight spot at work. And, yes: computational linguistics is not all drinking beer in Prague and dancing in the park with pretty girls in Bulgaria long past midnight–but, when it is, it is so, so good! And, when it’s not–it’s hella better than digging ditches!
English notes
hella: an extremely casual intensifier, it means “very” or “a lot of,” depending on whether it’s modifying an adjective or a noun.
Equivalent to very:
Get you a girl who’s hella goofy, loyal, and will treat you right.. OH WAIT THATS MEEEE AHAAAA
— Female Struggles (@FemaleStruggIes) October 10, 2017
hella goofy means very goofy
Lmao damn I used to be hella cute. Wtf happened 😤
— Francine (@Francineee_H) October 10, 2017
hella cute means very cute
Equivalent to a lot of:
hella people get into relationships and lose their ambition
that’s super wack
— William⚡️Bolton (@WilliamBolton) October 9, 2017
hella people means a lot of people
I mean you DM’d him hella times so you tell us RT @miakhalifa: Is Gilbert Arenas even relevant anymore?
— Ben Simmons 2.0 (@BBB_4_Lyfe) October 9, 2017
hella times means a lot of times, many times
Wanna try this at home? Here’s the code–copy and paste it into an .Rmd file. Don’t know what an .Rmd file is? Don’t try this at home.
--- title: "GutenbergR demo" author: "KBC" date: "9/25/2017" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) #install.packages("gutenbergr") # only need to do this the first time that you run the script library(gutenbergr) ``` ```{r} # get list of French-language books #french.book.ids <- gutenberg_metadata() %>% filter (language == "fr") french.book.ids <- gutenberg_works(languages = "fr", only_text = TRUE, rights = c("Public domain in the USA.", "None"), distinct = TRUE, all_languages = FALSE, only_languages = TRUE) nrow(french.book.ids) head(french.book.ids) ``` ```{r} # retrieve the contents of those books #french.book.contents <- gutenberg_download(french.book.ids$gutenberg_id) french.book.contents <- gutenberg_download(796) # Stendahl #french.book.contents <- gutenberg_download(799) # Verne summary(french.book.contents) # fix at least some of the characters french.book.contents$text <- gsub('\xe9', 'e', french.book.contents$text) french.book.contents$text <- gsub('\xfc\xbe\x8d\xb6\x94\xbc', 'c', french.book.contents$text) french.book.contents$text <- gsub('\xfc\xbe\x8e\x86\x94\xbc', 'e', french.book.contents$text) french.book.contents$text <- gsub('\xe0', 'a', french.book.contents$text) french.book.contents$text <- gsub('\xfc\xbe\x98\x96\x94\xbc', 'e', french.book.contents$text) french.book.contents$text <- gsub('\xfc\xbe\x98\xa6\x98\xbc', 'u', french.book.contents$text) french.book.contents$text <- gsub('\xfc\xbe\x8d\x86\x98\xbc', 'o', french.book.contents$text) french.book.contents$text <- gsub('\xfc\xbe\x99\xa6\x94\xbc', 'i', french.book.contents$text) french.book.contents$text <- gsub('\xfc\xbe\x8e\x96\x98\xbc', 'u', french.book.contents$text) french.book.contents$text <- gsub('\xfc\xbe\x99\x96\x94\xbc', 'i', french.book.contents$text) french.book.contents$text <- gsub('\xfc\xbe\x8c\xa6\x94\xbc', 'a', french.book.contents$text) french.book.contents$text <- gsub('\xfc\xbe\x98\xa6\x84\xbc', '--', french.book.contents$text) french.book.contents$text <- gsub('\xfc\xbe\x98\xa6\x88\xbc', '', french.book.contents$text) french.book.contents$text <- gsub('\xfc\xbe\x8e\x96\x8c\xbc', '', french.book.contents$text) #print(french.book.contents$text) #length(french.book.contents$text) print(french.book.contents$text[18:28]) ```
Another meaning for this word you like, “châsse”. In classical gangsters slang “les châsses” means “les yeux”, preferably for a woman . So you see even gangsters in France knew how to spell . Well, yesteryears, now they can’t even read …
LikeLiked by 1 person
OMG, that’s beautiful — thanks!
LikeLike
BTW–you suggested Villon. I finally found a bilingual edition–it’s on my reading list!
LikeLiked by 1 person
He was a lightning and shooting star, 4 centuries before his time . Rimbaud and Verlaine, when they discovered him, felt like the young Eric Clapton when he discovered Robert Johnson .
But it will be a tough task for you to follow his old French .
LikeLiked by 3 people
I’m glad I’m not the only one this kind of thing happens to. When I was a very young-looking 15-year-old in Mexico, I seem to have mixed up “non-carbonated” (sin gas) with “you fuck” (chingas). Or that’s what I figured out retrospectively from the look of shock I got.
LikeLiked by 2 people
Don’t be too afraid, “chinga”is also used to mean “My God !” or “Damnit!”. Once in Spain I wanted to say “I don’t want to embarrass this lady” but the Spanish verb “embarrassar” means impregnate, fecundate . I let you imagine the roll of laughters .
LikeLiked by 2 people
Yup. A 15-year-old friend of mine–a girl–made that mistake, trying to explain that she was embarrassed.
LikeLiked by 2 people