What computational linguists actually do all day: The debugging edition

We already knew that the patient had the primary, secondary, and tertiary stages of syphilis.

Tell someone you’re a computational linguist, and the next question is almost always this: so, how many languages do you speak?  This annoys the shit out of us, in the same way that it might annoy a public health worker if you asked them how many stages of syphilis they have.  (There are four.  When I was a squid (military slang for “sailor”), one of our cardiologists lost her cool and threw a scalpel.  It stuck in one of my mates’ hands.  We already knew that the patient had the primary, secondary, and tertiary stages of syphilis, so my buddy was one unhappy boy…)

Being asked “how many languages do you speak?” annoys us because it reflects a total absence of knowledge about what we devote our professional lives to.  (This is obviously a little arrogant–why should anyone else bother to find out about what we devote our professional lives to?  That’s our problem, right?  Nonetheless: the millionth time that you get asked, it’s annoying.)  It’s actually easier to explain what linguistics is in French than it is in English, because French has two separate words for things that are both covered by the word language in English:

  • une langue is a particular language, such as French, or English, or Low Dutch.
  • le langage is language as a system, as a concept.
interaction of tone with foot structure
No, I did not just make up “tone-bearing unit.”

Linguists study the second, not the first.  People who call themselves linguists might specialize in vowels, or in words like “the,” or in how people use language both to segregate themselves and to segregate others.  Whatever it is that you do, you’re basing it on data, and the data comes from actual languages, so you might work with any number of them–personally, I wrote a book on a language spoken by about 30,000 people in what is now South Sudan.  The point of that work, though, is to investigate broader questions about langage, more so than to speak another language–that’s a very different thing.  I can tell you a hell of a lot about the finite state automata that describe tone/tone-bearing-unit mappings in that language, but can’t do anything in it beyond exchange polite greetings (and one very impolite leave-taking used only amongst males of the same age group).

So, if you’re not spending your days sitting around memorizing vocabulary items in three different regional variants of Upper Sorbian, what does a linguist actually do all day?  Here’s a typical morning.  I was trying to do something with trigrams (3-word sequences–approximately the longest sequence of words that you can include in a statistical model of language before it stops doing what you want it to do), when I ran into this:

Screen Shot 2018-03-28 at 04.01.05

Fixed that one, and then there was a problem with my x-ray reports (my speciality is biomedical languages)…

Screen Shot 2018-03-28 at 03.30.00

Fixed that one, and then…

Screen Shot 2018-03-28 at 03.26.09

…and your guess may well be better than mine on that one.  God help you if you run into this kind of thing, though…

Source: me.

…because that message about not having some number of elements (a) usually takes forever to figure out, and then (b) once you do figure it out, reflects some kind of problem with your data that is going to give you a lot of headaches before you get it fixed.

I spend a lot of my day looking at things like this:

Source: me.

.,..which is a bunch of 0s and 1s describing the relationship between word frequency and word rank, plus what goes wrong when your data gets created on an MS-DOS machine, which I will have to fix before I can actually do anything with said data (see the English notes below for what said data means); or this…

Source: me.

…which tells me some things about the effects of “minor” preprocessing differences on type/token ratios–they’re not actually so minor; or this…

Source: Cohen, K. B., Verspoor, K., Fort, K., Funk, C., Bada, M., Palmer, M., & Hunter, L. E. (2017). The Colorado Richly Annotated Full Text (CRAFT) corpus: Multi-model annotation in the biomedical domain. In Handbook of Linguistic Annotation (pp. 1379-1394). Springer, Dordrecht.

…which tells me that either there are some errors in that data, or there is an enormous amount of variability between the official terminology of the field and the way that said terminology actually shows up in the scientific literature.  (See the leftmost blob–it indicates that there are plenty of cases of one-word terms that show up as more than 5 words in actual articles.  That is certainly possible–disease in which abnormal cells divide without control and can invade nearby tissues is 13 words that together correspond to the single-word term cancerbut, I was surprised to see just how frequent those large discrepancies in lengths were.  In my professional life, I love surprises, but they also suggest that you’d better consider the possibility that there are problems with the data.)

So, yeah: it’s not like I can’t get my hair cut in Japanese, or explain how to do post-surgical hand therapy in Spanish, or piss off a con artist in Turkish (a story for another time)–but, none of those have anything to do with my professional life as a computational linguist.  That’s all about computing, which means computers, and I hate computers.  Ironic, hein?  Life is fucking weird, and I like it that way.

English notes

queneau exercices de figure
I think this is Queneau, but couldn’t swear to it. Source: it’s all over the place.

said: a shorter way of saying “the aforementioned.”  Both of these are characteristic of written language, more so than of spoken language.  Even in writing, though, it’s pretty bizarre if you’re not a native speaker, which is why I picked it to talk about today.  A French equivalent would be ledit/ladite/lesdites (not sure about that last one–Phil dAnge?), which I have a soft spot for ’cause I learned it in Queneau’s Exercices de style.  

Trying to think of helpful ways to recognize this bizarre usage of said, I went looking for examples of said whose part of speech is adjectival.  Here are some of the things that I found:

  • As such, any dispute that you may have on goods purchased or services availed of should be raised directly with said merchant/s.
  • seemingly endless shopping list to conquer, a shrinking budget with which to do said shopping ~ and let’s face it: our businesses don’t run themselves while we’re visiting relatives.
  • This is a monumental pain in the ass — you don’t exactly trip over Notary Publics in today’s day and age — and I can only assume came from said company having a problem with identity once sometime in the last twelve years, and the president saying “fuck it.”

How it appears in the post:

  • …what goes wrong when your data gets created on an MS-DOS machine, which I will have to fix before I can actually do anything with said data;…
  • Either there are some errors in that data, or there is an enormous amount of variability between the official terminology of the field and the way that said terminology actually shows up in the scientific literature. 

debugging: A technical term in software programming that refers to finding problems in your program.  I used it in the title of today’s post because most of the illustrations that I gave of what I do all day are of irritating problems of one sort or another that I (really did) have to track down in the course of my day.  They don’t tell you in school that tracking down such things are literally about 80% of what any programmer spends their time doing.  Of course, any problem in a computer program is a problem that you created, so you can get irritated about them, but you most certainly cannot take your irritation out on anyone else…

8 thoughts on “What computational linguists actually do all day: The debugging edition”

  1. About halfway through that, my eyes started spinning in different directions and I had to ask the cat to read the rest and explain it to me. It was an experience that left me grateful that you do what you do so that I don’t have to.

    Liked by 2 people

      1. re: spinning eyes. I think if people who criticize Perl knew more they might compare R to it. Those are the same that would say that Perl reads like German sounds. They don’t appreciate either.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Curative Power of Medical Data

JCDL 2020 Workshop on Biomedical Natural Language Processing


Criminal Curiosities


Biomedical natural language processing

Mostly Mammoths

but other things that fascinate me, too


Adventures in natural history collections

Our French Oasis


ACL 2017

PC Chairs Blog

Abby Mullen

A site about history and life

EFL Notes

Random commentary on teaching English as a foreign language

Natural Language Processing

Université Paris-Centrale, Spring 2017

Speak Out in Spanish!

living and loving language




Exploring and venting about quantitative issues

%d bloggers like this: