What computational linguists actually do all day: The lexical frequency version

In practice, we spend most of our time trying to figure out where we went wrong in writing some computer program or another. 

Tell someone that you’re a computational linguist, and the next thing out of their mouth is likely to be either:

  1. How many languages do you speak?, or…
  2. What’s that?

In theory, computational linguists spend their time thinking about fun questions like:

  1. Is natural language Turing-complete?
  2. The relationship, if any, between what we know about words (say, the word dog can be a noun or a verb, and it occurs more often with the words bark and leash than with the word meow) and what we know about the world (say, a dog is a canine, and might like to chase balls, and will eat cat shit if not instructed otherwise).
  3. How Zipf’s Law, which describes the fact that a small number of words are extremely common, while a large number of words are extremely rare, but do occur, might or might not be related to the mathematical phenomenon of the fractal.

In practice, we spend most of our time trying to figure out where we went wrong in writing some computer program or another.  (OK: that, and writing grant proposals.)  Think that being a computational linguist sounds glamorous?  Here’s how I spent my morning.


All I gotta do: go through a bunch of documents and count how often each word in that bunch of documents occurs.  Easy-peasy–barely hard enough for a homework in Computational Linguistics 101.

Seulement voilà…

Screen Shot 2018-12-03 at 12.54.58

Easy enough to fix–I just failed to give the complete name of the program, and…. marde.

Screen Shot 2018-12-03 at 12.57.21

OK, easy enough to fix–I had written

Screen Shot 2018-12-03 at 12.58.58

…when I shoulda written

Screen Shot 2018-12-03 at 12.58.44

Shoulda: the typical spoken form of should have. 

(Note the square bracket near the end of the middle line–I had left it out.)  Great–avançons, alors.  But, no, fuckashitpiss:

Screen Shot 2018-12-03 at 13.03.54

Easy enough to fix–turns out I wrote this:

 

Screen Shot 2018-12-03 at 13.05.19

…when I shoulda written this:

Screen Shot 2018-12-03 at 13.06.35

(Note the dollar sign before the rightmost instance of words now.)  And so, on we go, but…

Screen Shot 2018-12-03 at 13.07.57

…and it’s easy enough to fix–I had written this:

Screen Shot 2018-12-03 at 13.08.57

…when I shoulda written this:

Screen Shot 2018-12-03 at 13.10.10.png

(Note the double quote before $frequencies{$words[$i]}\n”;) …and now I’m wondering:

  1. These errors were all on one single line–what other horrors have I hidden in this code, and will they be as easy to find as those were?
  2. What the hell was I thinking when I wrote that line?  Was I thinking about the upcoming dissertation defense at 2 PM?  Was I thinking about Trump giving my country to China?  Was I thinking about tomorrow’s colonoscopy? Who the hell knows, really–whatever it was, it apparently wasn’t this line of code…

Mais returnons… Ah marde, but at least this one will be easy to fix…

Screen Shot 2018-12-03 at 13.14.14

…except that I verify the existence of the directory, and then get this:

Screen Shot 2018-12-03 at 13.16.03

…which is the exact same error that I got before.  So, I go back and look at my code, where I see this, and remember that my error message is supposed to print out the name of the directory that it couldn’t open, but it did no such thing:

Screen Shot 2018-12-03 at 13.22.00

…which is ’cause I never gave the program the name of the input directory.  So I take care of that, and also tell my program to print out the name of the directory that it couldn’t open if, it fact, it can’t open a directory–as we saw above, I had planned to do this, but of course left out that little detail:

Screen Shot 2018-12-03 at 13.25.31

…and now I experience a tiny little bit of success, because my program does not crash.  Seulement voilà, it doesn’t actually produce any input:

Screen Shot 2018-12-03 at 13.27.18

Note the lack of a bunch of lexical frequencies… So, I go back to my script, and I start looking around in the region of the program where I meant for the output to happen.  I don’t see anything obvious in that area, so: I go further up in the code, and start doing what I need to do to convince myself that the earlier parts of the program are working the way that I intended them to.  This means printing out the results at intermediate steps of the processing. The resulting code (leaving out a bunch of details) looks like this:

Screen Shot 2018-12-03 at 13.32.22

…which does nothing different than it was doing before, so I know that I need to go even further up in the program and, again, print stuff out as I go, resulting in this:

Screen Shot 2018-12-03 at 13.35.57

…which, when I run the script, produces this:

Screen Shot 2018-12-03 at 13.37.25

…which suggests to me that the directory exists, and that I’m opening it correctly, but that I am either (a) reading its contents incorrectly, or (b) making a mistake when I make a decision about whether or not to open each file.  A quick Google search finds the problem for me–I had written this:

Screen Shot 2018-12-03 at 13.41.16…when I shoulda written this:

Screen Shot 2018-12-03 at 13.41.33

(Note that the text at the left end of the line was open, and now is opendir.)

Progress!  Now I get some output, but note the last line–I’m just getting a bunch of file names, and no word frequencies.  I can see the problem right away, though–I have the directory name right, and I have the file name right, but I need to combine them in order to be able to open the file.  Doing so gives me this code:

Screen Shot 2018-12-03 at 13.46.55

…which results in my script running successfully for a while, but then crashing, and I know exactly what causes said crash…

Screen Shot 2018-12-03 at 13.48.10.png

…and I know that it’s a bear to fix, and I’ve been working on this fucking task that’s barely difficult enough to make a good homework assignment, and now it’s time to go to the aforementioned dissertation defense, and… Soupire…

Meme source: https://imgur.com/gallery/fzbkRI8

 

 

Gratuitous picture of me and my cat

In which I can’t even get beyond the Introduction.

Your lexicon–the words that you know, and what you know about them–is unlike every other part of your knowledge of your native language in that it continues to grow over the course of your entire life.  By the time you’re a young child you know pretty much everything that you’re going to know about your language’s phonetics, phonology, morphology, and syntax.  Your lexicon, though–that continues to grow throughout your life.

Now imagine someone who tries to learn a second language as an adult.  Like everyone else who speaks that language, you’re going to be learning new words until you die.  But, that’s going to be a lot more obvious to you than it is to people who speak it natively, because unlike them, you didn’t spend your entire youth learning the vocabulary of that language–start studying a language in your 50s, and you are literally 50 years behind a native speaker when it comes to learning the lexicon of the language in question.

If you’ve been reading this blog for a while, you know that you don’t have to work very hard to find words that you don’t know: Zipf’s Law, which describes the fact that a small number of words of a language are very, very common, while the rest occur only very rarely–but do occur–ensures that you will be running across new words just going about your daily life.

Living in France, I have no difficulty whatsoever running into 10 words that I don’t know every single day.  Ads on the metro, the services written on a window installer’s truck, the name of a street that I walk by on the way to the lab–that’s all it takes.  Living in the US, it’s a bit harder, but it’s totally doable–listening to the radio, watching something on YouTube, or listening to a book on tape will do it.  10 words a day, every day (except the month of December, which I spend reviewing the words that I learned from January to November), and mine de rien, you have a vocabulary of thousands of words.

And yet: as Zipf’s Law would suggest, I still have no problem whatsoever finding 10 new words a day to learn.  Case in point: today I wanted to figure out what the symbol ≠ means in the grammar book that I’m working through at the moment (Grammaire progressive du français : niveau perfectionnement, B2 – C2, by Maïa Grégoire and Alina Kostucki).  So, I went to the “front matter” of the book–the table of contents and stuff like that.  This involved reading the Introduction, where I ran across the following:

WordReference.com found me most of the relevant definitions, and yet: dictionaries being the beautiful but imperfect things that they are (like, say, my cat), it did let me down for a couple words: relever, and mécanisation. To wit:

….même avec un vocabulaire riche et une bonne connaissance de la grammaire, les résultats atteints son souvent entravés par la persistance de fautes qui ont traversé les différents niveaux d’apprentissage. Bon nombre de ces difficultés tiennent à des interférences avec la langue d’origine et aucune grammaire ” générale ” ne peut prétendre en rendre compte.  D’autres, en revanche, relèvent de particularités de de la langue française, mal perçues par les étudiants, et que nous tentons d’exposer de la façon la plus claire possible.

My best guess for an English-language equivalent of relever de would be “to arise from.”  Here are some examples of to arise from from Word Sketch, purveyor of fine linguistic corpora and the tools for searching them:

  • The lectures focus on topics arising from research in science and technology.
  • The investigation arose from a referral from both Houses of the NSW Parliament.  (Arise is an irregular verb, with the past tense form arose.)
  • He blames Jews for the ills arising from the industrial revolution, e.g., class divisions and hatred.
  • Leukaemias are devastating diseases of the haemopoietic system that arise from aberrant stem or progenitor cells.  (Leukaemia and haemopoietic are the British English spellings of leukemia and hemopoietic.)

But: looking at WordReference, I don’t see to arise from as a possible translation of relever de, or vice versa.  Phil d’Ange?

The other problem word: la mécanisation.  The only translation of this word in Word Reference is…”mechanization”!  What that means: I can only guess (see above about how your lexicon grows over the course of your entire life), and none of my guesses would make sense in this context.  Mechanized infantry is infantry equipped with armored vehicles to move itself around, and mechanized artillery is artillery equipped with its own transport system, but oral mechanization, as in the sample from my book?  I haven’t the faintest clew.  (That’s “clue,” for us Americans–something about the faintest clew just demands that you spell it like a Brit.)

À la partie théorique, située sur la page de gauche, correspond, sur la page de droite, une présentation en contexte (parfois illustrée) des points de grammaire, et une série d’exercices de réemploi : exercices à trous, transformations, mécanisation orale, écrit.

 

img_1847.jpg

Native speakers: can you show an anglophone some love?  (To show someone some love means to help them, to do something nice for them, to give them something.  Super-slangy.)

 

Finally, here is a gratuitous picture of a fat old bald guy and his cat Keiko.  As you can tell from the amount of light in the dwelling, the photo was taken in America, not in wintertime Paris.  The teddy bear on the floor is the property of my cat, and I suggest that you not touch it.

Conflict of interest statement: I have no conflicts of interest to declare.  I pay for a subscription to Sketch Engine, I bought the book, and Word Reference is free to one and all.

Becoming a computational linguist without double-majoring in linguistics and computer science

You’re an undergraduate, and you want to become a computational linguist? Here’s how to do it.

People who want to become computational linguists usually get a PhD in the subject.  Every once in a while, though, you run into someone who wants to study computational linguistics as an undergraduate.  In the United States, that means a student in what we call “college” and the rest of you call “university” (or, if you’re French, la fac’).  Undergraduate students in the US have one, and sometimes two, “majors”–the topic in which they will do the most coursework, and whose name will appear on their official paperwork when they graduate.  To “double-major” is to have two majors, rather than the usual one.  It’s not super-unusual to do this–I had a double major, in English and linguistics–but, it’s helpful to do a double major only if really necessary, as it’s a hell of a lot of work. 

If you’re getting a bachelor’s degree and want to be a computational linguist, a double major in computer science and linguistics is probably overkill.  (Overkill discussed in the English notes below.)  The most efficient way to become a computational linguist would be to get a degree in linguistics in a department that has computational linguists on the faculty, such as the University of Colorado at Boulder, or Ohio State University. If you want to try to become a computational linguist in a university that doesn’t have computational linguists in any department: first of all, your major should probably be linguistics, not computer science—computational linguists are a kind of linguist, right? (They are—I’m a computational linguist, and I’m a linguist.) You’ll want to do some coursework in the computer science department, but I wouldn’t actually recommend even a minor in computer science—that will probably require you to take some courses that won’t be the most useful ones for you, while taking up time that you could have been using to take courses that would be useful for you.

What should those courses be?  As many as possible from this list:

  • Corpus linguistics (usually offered in the linguistics department, but if your university doesn’t have such a course in the linguistics department, look for courses in the social science, communications, or media departments, possibly with names like “content analysis”)
  • Statistics (best in a linguistics or speech & hearing department–the traditional psychology department or agriculture school courses will kill you)
  • Machine learning (usually offered in a computer science department)
  • Natural language processing (presumably not what you meant by “computational linguistics,” or you would have said so)
  • Automatic speech recognition, if and only if you seriously think that you want to work in this area (often offered in the electrical engineering department)
  • Speech synthesis, if and only if you seriously think that you want to work in this area (again, often offered in the electrical engineering department)

Notice what’s not on this list: programming courses.  Take those if you know that you need them, but if you don’t know that you need them, then don’t take them.  Notice that I also haven’t said anything about linguistics courses: we’re assuming here that linguistics is your major, and you’re going to get a solid and well-rounded background in that field.

Picture source: Mariana Romanyshyn, Grammarly, Inc. https://www.slideshare.net/MarianaRomanyshyn/nlp-a-peek-into-a-day-of-a-computational-linguist-71510838


English notes:

overkill: doing way too much.  Examples:

How I used it in the post: If you’re getting a bachelor’s degree and want to be a computational linguist, a double major in computer science and linguistics is probably overkill. 

 

American English reading practice: John McCain, Trump, and torture

I’m a US military veteran, and proud of it. If anyone hates torture more than a military person, I don’t know who it is.

sen-john-mccain-tty-04-gty-jef-170718_hpEmbed_1_18x13_992
John McCain was shot down and held prisoner for 5 and a half years by the North Vietnamese. He never recovered physically from the frequent and lengthy torture sessions that he underwent. The son of an admiral, he was offered early release, but refused to be set free until all of his fellow prisoners were. Meanwhile, Trump avoided the draft, later bragged about it repeatedly in public, and attacked McCain repeatedly as a candidate and as president. Asshole.

Afin de travailler votre amerloque, voilà un reportage sur la torture, John McCain, et Trump.  On débute avec du vocabulaire, et puis je vous invite à suivre le lien vers l‘article dans son intégralité.

For more on a proud US military veteran’s opposition to Trump’s immoral ideas about torture, see this post.  Do you have corrections for my crappy French?  The Comments section awaits you.

Speaking out on torture and a Trump nominee, ailing McCain roils Washington

to speak out: to say something by way of a public statement, typically criticizing something.  Note that the preposition here is on, but it could also be about, and possibly others.

ailing: sick.  If English had the concept of langage soutenu, this would be soutenu, like many of the words in this article.

to roil: to stir up, to disturb, to put in a state of disorder (see Merriam-Webster, sense 2)

Sen. John McCain is 2,200 miles from Washington and hasn’t been on Capitol Hill in five months, but he showed this week that he remains a potent force in national politics and a polarizing figure within the Republican Party.

potent: powerful

polarizing: “to break up into opposing factions or groupings: a campaign that polarized the electorate” (Merriam-Webster, sense 3). Today’s Republican Party can generally be divided into people who like McCain, a war hero and basically OK guy right up to his recent death–versus immoral shitbags who cravenly support Trump no matter how low he stoops into the mud.  Thus: he’s a polarizing figure within the party.

But his declaration Wednesday in opposition to Gina Haspel, President Trump’s nominee for CIA director, has uniquely roiled the political scene. The denunciation has prompted reactions from fellow senators and a former vice president, as well as intemperate remarks from some Republicans aligned with Trump, including a White House aide.
to prompt:to serve as the inciting cause of : evidence prompting an investigation” (Merriam-Webster, sense 3).
intemperate:  not temperate, where “temperate” means “akeeping or held within limits not extreme or excessive MILDmarked by an absence or avoidance of extravagance, violence, or extreme partisanship” (Merriam-Webster, senses 2a and 2d)”
It has revived the fierce debate over torture and its effectiveness in extracting information in the years since the Sept. 11 terrorist attacks — from a man who speaks from experience. McCain was held for 5½ years in a North Vietnamese prison, often deprived of sleep, food and medical care, after a jet he piloted was shot down over Hanoi.
No need for translation here, but for context, it’s worth knowing that McCain was a war hero and a staunch supporter of the US military–and hugely, vocally opposed to torture.  In contrast, Trump the draft-dodger (réfractaire, I think) has long advocated it.  Asshole.
Click here for the complete article in the Washington Post.

What a linguist would name a store if a linguist owned a store

to delight in: to really enjoy doing something; to like a thing very, very much. Examples:

Trump delights in insulting people who are less powerful than he is. Fucking bully–nothing more despicable than a fucking bully.

 

Trump delights in his ability to insult women’s appearance on the world stage. What a loser.

 

How it’s used on the sign: Delight in treasures old and new.

 

Brought to you by the Anglophone Association for the Promotion of Weird Prepositions.

Shit Interpreters Say

41393558_2170323476569085_6117008580353720320_n

Thanks to the person who posted this on my Facebook timeline–you shall remain anonymous, since you probably would not want it known that you know me far too well.


English notes

This little gem of humor about the realities of translation/interpretation uses a number of devices from very colloquial written English.  Three of them:

Wanna: want to. “Now I really wanna see a horrible faltering translation from one of these movies…”

Cuz: because.  Can also be written ’cause or cos, and cuz can also be “cousin.”

The thing is:  This is used to introduce an assertion that … hm… states some kind of problem or complication with whatever it is that is under discussion.  For example: Zipf, are you going to the lab meeting?  Well…the thing is, I double-booked myself at 1.  In the material, when the person says (I’m going to insert some punctuation, which will make it a lot easier to follow) the thing is, in one dialect this word is the name of a terrifying Demon but in a completely different language from the same area that… the “thing under discussion,” if you like, is the fact that the person is being expected to be able to translate this stuff (but there’s this complication related to the multiple possible meanings of the word in question).

Note that if you’re being really casual, you can shorten this to just thing is… omitting the “the.”

 

Vocabulary makeover, please

Zipf’s Law: The frequency of a word is related exponentially to its rank in a frequency-ordered list. Practically speaking, this means that an adult studying a second language will run across words that they don’t know every day of their life.

To paraphrase Newton: if I speak better French than other Americans, it is only because I spend more time memorizing vocabulary.  My daily, daily, daily morning ritual: with my first cigarette and cup of coffee, I memorize 10 new words.  Zipf’s Law being what it is, I don’t exactly have to go hunting for words that I don’t know—over the course of the day, I note down every new word that I come across, and the next morning, I pick 10 of them to cram into the small amount of remaining space in my much-abused brain.

My go-to dictionaries are WordReference.com, followed by the Farlex French dictionary app.  If I’m pretty sure that I need more context, I go to the Sketch Engine web site if I have Internet access, and Linguee if I don’t.  Pretty straightforward, my little routine.  Quotidian.  Mundane.

Every once in a while, though, it does not yield the desired result.  Case in point: capillotracté.  Not in Word Reference, not in Farlex French.  So: Google… which gets me definitions that I don’t understand, because they make reference to an expression that I don’t understand: tirer quelqu’un par les cheveux.  And so, dear Readers: can you help an amerloque out?


My odyssey started in a place where you don’t expect to see casual use of language: Le Figaro.  The Fig’ is one of the Big 3 French newspapers, along with Libération (left) and Le Monde (center).  As you have probably guessed, Le Figaro is to the right of center.  Like many conservative people, it gets excited about prescribing language usage.  I don’t get excited about prescribing language usage, but I do get excited about language, so although I subscribe to Le Monde (I’m a lefty myself, but I figure that I’ll get the most representative sample of vocabulary more towards the center)I will often go to the Fig’ to read its language articles.  As you might expect from prescriptivists, they tend to be…precise.  Clear.  Unambiguous.  (Si ce n’est pas clair, ce n’est pas français, right?  Harumph.).

So, I’m reading an article on the subject of how to refer to Line 1 of the Paris metro—ligne un, or ligne une?—when I come across a word that I don’t know. I promptly copy it, along with the context in which I saw it, onto an index card (something that does not exist in France–see this post on the mystery):

The next morning, I go to look it up–and find nothing. Word Reference: no love. The Farlex French dictionary app: nope. Fine–I go to Google. I find definitions there, but they all refer to an expression whose meaning is opaque to me: tirer quelqu’un par les cheveux. For example:

img_0545
img_0546

img_0547

How about it, native speakers?  Can you help an amerloque out?  I’d pull my hair out over this, but I’m already bald…

The rule dit capillotracté?  Ligne un, because it’s a number, not the indefinite article.  The indefinite article un/une is inflected for gender, but the number un is not.


French notes

l’amerloque: American, person or language; noun or adjective. Familier et péjoratif.  Wiktionnaire alleges that it comes from Amérique plus oque, providing no evidence; I therefore claim equal plausibility for my own little theory, which is that it comes from Amérique plus locuteur.  Examples from Wiktionnaire, from which I stole them quite gleefully ’cause I don’t like their etymology:

  • […] mais c’est pas un spectacle pour une dame, rigola le jeunot à la casquette amerloque. — (Léo MaletLes rats de Montsouris, 1955)
  • Nom de Dieu, quand est-ce que tu vas arrêter de parler l’amerloque ? — (Sébastien Monod, Rue des Deux Anges, 2005)

English notes

makeover: “An overall treatment to improve something or make something more attractive or appealing.” (Source: American Heritage® Dictionary of the English Language, Fifth Edition. (2011). Retrieved September 11 2018 from https://www.thefreedictionary.com/makeover.)  There is an enormous quantity of makeover-themed TV shows.  Don’t judge me.

consult (noun): As a noun, this is stressed on the first syllable: CONsult.  consult is when you send someone or something to an expert, typically in a medical context.  For example, if you go to your doctor and they are pretty sure that you are having a neurological problem, they might tell their clerk to set you up with a neurology consult.  

In coming up with a title for this article, I thought about Vocabulary consult versus Vocabulary makeover.  The former would make a hell of a lot more sense, but since the word that I’m asking you to help me with has something to do with hair, I went for Vocabulary makeover.  Don’t like my choice?  Write your own fucking blog on the implications of the statistical properties of language for second-language learners.

to pull one’s hair out (over something): to have reached the point of frustration with a problem and still be unable to solve it.  Examples:

Bill, can you help me?  I’m pulling my hair out here… Every time I call the constructor, I get a “String Index Out Of Bounds” error, which makes no sense to me whatsoever…   

Dude, I’m pulling out my hair out over this budget. Every time I try to include the annual COL increase for salaries, the spreadsheet doubles the amount allotted for travel to the American Medical Informatics Association annual meeting. What the FUCK?? 

How I used it in the post: How about it, native speakers?  Can you help an amerloque out?  I’d pull my hair out over this, but I’m already bald…