What computational linguists actually do all day: The lexical frequency version

In practice, we spend most of our time trying to figure out where we went wrong in writing some computer program or another. 

Tell someone that you’re a computational linguist, and the next thing out of their mouth is likely to be either:

  1. How many languages do you speak?, or…
  2. What’s that?

In theory, computational linguists spend their time thinking about fun questions like:

  1. Is natural language Turing-complete?
  2. The relationship, if any, between what we know about words (say, the word dog can be a noun or a verb, and it occurs more often with the words bark and leash than with the word meow) and what we know about the world (say, a dog is a canine, and might like to chase balls, and will eat cat shit if not instructed otherwise).
  3. How Zipf’s Law, which describes the fact that a small number of words are extremely common, while a large number of words are extremely rare, but do occur, might or might not be related to the mathematical phenomenon of the fractal.

In practice, we spend most of our time trying to figure out where we went wrong in writing some computer program or another.  (OK: that, and writing grant proposals.)  Think that being a computational linguist sounds glamorous?  Here’s how I spent my morning.


All I gotta do: go through a bunch of documents and count how often each word in that bunch of documents occurs.  Easy-peasy–barely hard enough for a homework in Computational Linguistics 101.

Seulement voilà…

Screen Shot 2018-12-03 at 12.54.58

Easy enough to fix–I just failed to give the complete name of the program, and…. marde.

Screen Shot 2018-12-03 at 12.57.21

OK, easy enough to fix–I had written

Screen Shot 2018-12-03 at 12.58.58

…when I shoulda written

Screen Shot 2018-12-03 at 12.58.44

Shoulda: the typical spoken form of should have. 

(Note the square bracket near the end of the middle line–I had left it out.)  Great–avançons, alors.  But, no, fuckashitpiss:

Screen Shot 2018-12-03 at 13.03.54

Easy enough to fix–turns out I wrote this:

 

Screen Shot 2018-12-03 at 13.05.19

…when I shoulda written this:

Screen Shot 2018-12-03 at 13.06.35

(Note the dollar sign before the rightmost instance of words now.)  And so, on we go, but…

Screen Shot 2018-12-03 at 13.07.57

…and it’s easy enough to fix–I had written this:

Screen Shot 2018-12-03 at 13.08.57

…when I shoulda written this:

Screen Shot 2018-12-03 at 13.10.10.png

(Note the double quote before $frequencies{$words[$i]}\n”;) …and now I’m wondering:

  1. These errors were all on one single line–what other horrors have I hidden in this code, and will they be as easy to find as those were?
  2. What the hell was I thinking when I wrote that line?  Was I thinking about the upcoming dissertation defense at 2 PM?  Was I thinking about Trump giving my country to China?  Was I thinking about tomorrow’s colonoscopy? Who the hell knows, really–whatever it was, it apparently wasn’t this line of code…

Mais returnons… Ah marde, but at least this one will be easy to fix…

Screen Shot 2018-12-03 at 13.14.14

…except that I verify the existence of the directory, and then get this:

Screen Shot 2018-12-03 at 13.16.03

…which is the exact same error that I got before.  So, I go back and look at my code, where I see this, and remember that my error message is supposed to print out the name of the directory that it couldn’t open, but it did no such thing:

Screen Shot 2018-12-03 at 13.22.00

…which is ’cause I never gave the program the name of the input directory.  So I take care of that, and also tell my program to print out the name of the directory that it couldn’t open if, it fact, it can’t open a directory–as we saw above, I had planned to do this, but of course left out that little detail:

Screen Shot 2018-12-03 at 13.25.31

…and now I experience a tiny little bit of success, because my program does not crash.  Seulement voilà, it doesn’t actually produce any input:

Screen Shot 2018-12-03 at 13.27.18

Note the lack of a bunch of lexical frequencies… So, I go back to my script, and I start looking around in the region of the program where I meant for the output to happen.  I don’t see anything obvious in that area, so: I go further up in the code, and start doing what I need to do to convince myself that the earlier parts of the program are working the way that I intended them to.  This means printing out the results at intermediate steps of the processing. The resulting code (leaving out a bunch of details) looks like this:

Screen Shot 2018-12-03 at 13.32.22

…which does nothing different than it was doing before, so I know that I need to go even further up in the program and, again, print stuff out as I go, resulting in this:

Screen Shot 2018-12-03 at 13.35.57

…which, when I run the script, produces this:

Screen Shot 2018-12-03 at 13.37.25

…which suggests to me that the directory exists, and that I’m opening it correctly, but that I am either (a) reading its contents incorrectly, or (b) making a mistake when I make a decision about whether or not to open each file.  A quick Google search finds the problem for me–I had written this:

Screen Shot 2018-12-03 at 13.41.16…when I shoulda written this:

Screen Shot 2018-12-03 at 13.41.33

(Note that the text at the left end of the line was open, and now is opendir.)

Progress!  Now I get some output, but note the last line–I’m just getting a bunch of file names, and no word frequencies.  I can see the problem right away, though–I have the directory name right, and I have the file name right, but I need to combine them in order to be able to open the file.  Doing so gives me this code:

Screen Shot 2018-12-03 at 13.46.55

…which results in my script running successfully for a while, but then crashing, and I know exactly what causes said crash…

Screen Shot 2018-12-03 at 13.48.10.png

…and I know that it’s a bear to fix, and I’ve been working on this fucking task that’s barely difficult enough to make a good homework assignment, and now it’s time to go to the aforementioned dissertation defense, and… Soupire…

Meme source: https://imgur.com/gallery/fzbkRI8

 

 

Gratuitous picture of me and my cat

In which I can’t even get beyond the Introduction.

Your lexicon–the words that you know, and what you know about them–is unlike every other part of your knowledge of your native language in that it continues to grow over the course of your entire life.  By the time you’re a young child you know pretty much everything that you’re going to know about your language’s phonetics, phonology, morphology, and syntax.  Your lexicon, though–that continues to grow throughout your life.

Now imagine someone who tries to learn a second language as an adult.  Like everyone else who speaks that language, you’re going to be learning new words until you die.  But, that’s going to be a lot more obvious to you than it is to people who speak it natively, because unlike them, you didn’t spend your entire youth learning the vocabulary of that language–start studying a language in your 50s, and you are literally 50 years behind a native speaker when it comes to learning the lexicon of the language in question.

If you’ve been reading this blog for a while, you know that you don’t have to work very hard to find words that you don’t know: Zipf’s Law, which describes the fact that a small number of words of a language are very, very common, while the rest occur only very rarely–but do occur–ensures that you will be running across new words just going about your daily life.

Living in France, I have no difficulty whatsoever running into 10 words that I don’t know every single day.  Ads on the metro, the services written on a window installer’s truck, the name of a street that I walk by on the way to the lab–that’s all it takes.  Living in the US, it’s a bit harder, but it’s totally doable–listening to the radio, watching something on YouTube, or listening to a book on tape will do it.  10 words a day, every day (except the month of December, which I spend reviewing the words that I learned from January to November), and mine de rien, you have a vocabulary of thousands of words.

And yet: as Zipf’s Law would suggest, I still have no problem whatsoever finding 10 new words a day to learn.  Case in point: today I wanted to figure out what the symbol ≠ means in the grammar book that I’m working through at the moment (Grammaire progressive du français : niveau perfectionnement, B2 – C2, by Maïa Grégoire and Alina Kostucki).  So, I went to the “front matter” of the book–the table of contents and stuff like that.  This involved reading the Introduction, where I ran across the following:

WordReference.com found me most of the relevant definitions, and yet: dictionaries being the beautiful but imperfect things that they are (like, say, my cat), it did let me down for a couple words: relever, and mécanisation. To wit:

….même avec un vocabulaire riche et une bonne connaissance de la grammaire, les résultats atteints son souvent entravés par la persistance de fautes qui ont traversé les différents niveaux d’apprentissage. Bon nombre de ces difficultés tiennent à des interférences avec la langue d’origine et aucune grammaire ” générale ” ne peut prétendre en rendre compte.  D’autres, en revanche, relèvent de particularités de de la langue française, mal perçues par les étudiants, et que nous tentons d’exposer de la façon la plus claire possible.

My best guess for an English-language equivalent of relever de would be “to arise from.”  Here are some examples of to arise from from Word Sketch, purveyor of fine linguistic corpora and the tools for searching them:

  • The lectures focus on topics arising from research in science and technology.
  • The investigation arose from a referral from both Houses of the NSW Parliament.  (Arise is an irregular verb, with the past tense form arose.)
  • He blames Jews for the ills arising from the industrial revolution, e.g., class divisions and hatred.
  • Leukaemias are devastating diseases of the haemopoietic system that arise from aberrant stem or progenitor cells.  (Leukaemia and haemopoietic are the British English spellings of leukemia and hemopoietic.)

But: looking at WordReference, I don’t see to arise from as a possible translation of relever de, or vice versa.  Phil d’Ange?

The other problem word: la mécanisation.  The only translation of this word in Word Reference is…”mechanization”!  What that means: I can only guess (see above about how your lexicon grows over the course of your entire life), and none of my guesses would make sense in this context.  Mechanized infantry is infantry equipped with armored vehicles to move itself around, and mechanized artillery is artillery equipped with its own transport system, but oral mechanization, as in the sample from my book?  I haven’t the faintest clew.  (That’s “clue,” for us Americans–something about the faintest clew just demands that you spell it like a Brit.)

À la partie théorique, située sur la page de gauche, correspond, sur la page de droite, une présentation en contexte (parfois illustrée) des points de grammaire, et une série d’exercices de réemploi : exercices à trous, transformations, mécanisation orale, écrit.

 

img_1847.jpg

Native speakers: can you show an anglophone some love?  (To show someone some love means to help them, to do something nice for them, to give them something.  Super-slangy.)

 

Finally, here is a gratuitous picture of a fat old bald guy and his cat Keiko.  As you can tell from the amount of light in the dwelling, the photo was taken in America, not in wintertime Paris.  The teddy bear on the floor is the property of my cat, and I suggest that you not touch it.

Conflict of interest statement: I have no conflicts of interest to declare.  I pay for a subscription to Sketch Engine, I bought the book, and Word Reference is free to one and all.