June 2019 – Zipf's Law

What computational linguists actually do all day: The recursion edition

I know, I know: computational linguistics sounds like the world’s most glamorous profession, right?

I know, I know: computational linguistics sounds like the world’s most glamorous profession, right? You imagine a bunch of geeks in hip glasses sitting around talking about Sanskrit is-aorist verbs, playing a little foosball after a free sushi lunch in the Google cafeteria, and then writing code to translate Jacques Prévert into idiomatic American English with a little stock ticker in the upper-right corner of their screen so that they can watch the value of their vested options go up, and up, and up, and…

In reality, I’m sitting in the international student dormitory of a well-known East Coast American university. Yesterday was a good day, because the shitwad in room D2 left his dirty dishes in the sink for the full 48 hours that let me feel fine about throwing the reeking things in the trash can.

But, then I realized something: I can only get easy copyright releases for the book I’m writing for papers published in 2016 or later. That means that I need to do a serious analysis of what I’m citing in the book, which means…writing code (the computer language that makes up a program) to go through a bunch of citations to figure out what year they were published, in which conference or journal, etc., etc., etc.

That means that I write stuff that looks like this:

open (IN, "/Users/transfer/Dropbox/Scripts-new/bioNLP.bib") || die "Couldn't open input file...\n";

…and then spend a lot of time looking at the error message “Couldn’t open input file”, ’cause I was missing the slash at the beginning of this:

/Users/transfer/Dropbox/Scripts-new/dummy.bib

…which I was happy to figure out, but didn’t really find all that interesting.

Then I spent a lot of time writing things like the following:

    if ($line =~ /title.*=\{(.*)\},$/) {

        $DEBUG && print "TITLE: $1\n";

        $entry{"TITLE"} = $1;

    }

…which wasn’t particularly difficult, but caused a little pinprick in my soul, ’cause I knew as I was writing it that it would mess up any time that I had a title with a curly-brace in it ({}), and practicing your profession shittily never feels good. For reasons that we need not go into, having curly-braces in the title of a work happens a hell of a lot more often than you might think, and that fixing that little flaw would require writing something called a recursive function, which really shouldn’t be that complicated for a computational linguist (recursion is one of the fundamental properties of language (the picture at the top of this page is a humorous illustration of recursion (which is probably oxymoronic (and as you might have guessed, these embedded parentheticals are themselves an example of recursion (as is the second sentence of this post (an example, that is–not necessarily a humorous one (unlike the cartoon))))))), and yet still, is more than my little brain de pois chiche (garbanzo bean) can handle on a Sunday morning.

Then, in order to be able to see any actual output, I had to write code like the following:

        my $output = "";

        for my $field (@fields) {

            #print "$entry{$field}\t";                                                               

            $field .= $entry{$field} . "\t";

        }

        $field =~ s/\t$//;

        print "$field\n";

    }

…which was neither particularly challenging nor particularly interesting, but caused my program to crash quite rudely, ’cause for reasons that we need not go into, I should have written

        my $output = "";

        for my $field (@fields) {

            #print "$entry{$field}\t";                                                               

            $output .= $entry{$field} . "\t";

        }

        $output =~ s/\t$//;

        print "$output\n";

    }

That gave me the first thought I’d had all morning that was actually interesting, as I contemplated how hard I’m pretty sure that it would have been–how impossible I at least hope it would be, for the moment at any rate–for a computer to find and fix that particular bug.

Another half hour or so of work, and now I can actually see what I wanted to know, which is the venues where the works that I cite were published. This was useful, in that I noticed that one that should be heavily represented in my bibliography in fact barely figures there at all. But, what it meant was that I needed to Google hither and yon to find out how to search Google Scholar (we’re just getting more and more meta here all the time) by name of conference. Not particularly challenging; but, not particularly interesting, either.

This is a whiny post, right? Totally tongue in cheek, though. Actually, I have the incredible good luck to love what I do, and the book in question really is a labor of…a labor of love.

English notes

Something in this post that is perfectly fine English but that I probably would not have written if I didn’t spend a lot of time writing (poorly) in French these days:

I noticed that a publication venue that should be heavily represented in my bibliography in fact barely figures there at all.

An educated speaker of the langue de Molière will be aware that figurer sur une liste is perfectly natural (as far as I know) French. What I wrote is perfectly fine English, but I would suspect that it doesn’t occur very often, even in written academic or official English. Why did it pop out of mouth (well…fingers) today? French-language interference, which is funny, ’cause in language teaching we often talk about first-language interference (carrying over aspects of the grammar of your native language, such that they fuck up your mastery of a foreign or second language), but I can’t recall ever running into the concept of second-language interference, and French is mostly definitely a second language for me, not my first. Go figure…

go figure is an expression that expresses surprise about something that you’ve just been talking about, or an assertion that you are about to make. How I used it in the post:

I can’t recall ever running into the concept of second-language interference, and French is mostly definitely a second language for me, not my first. Go figure…

How to design the methods for a data science, machine learning, or natural language processing project: Part I

Yay—I have a data science/machine learning/natural language processing project! Now what do I do??

I occasionally use this blog to try out materials for something that I will be publishing. This post is a casual version of something that will go into a book that I’m writing about…writing.

So, you’re going to do a data science project. Maybe you’re going to use natural language processing (processing: using a computer program to do something; natural language: human language, as opposed to computer languages) to analyze social media data because you want to find out how veterans feel about the medical care that they receive through the Veterans’ Administration. (Spoiler alert: a number of my buddies are vets, and they do indeed use the Veterans’ Administration health care system, and they both (a) are happy with it, and (b) recommend it to the rest of us.) Maybe you’re doing it as a project for a course; maybe you’re doing it as your first assignment at your high-paying brand-new data scientist job; maybe you’re planning to write a research paper for a journal on military health care. How do you go about doing it?

An excellent piece of advice when you’re trying to figure out how to do any research project: write out what you’re going to do, in prose, before you start doing it. As my colleague Graciela Gonzalez, of the Health Language Processing Laboratory at the University of Pennsylvania School of Medicine, puts it:

Most of us make some mistakes in the process of thinking through how we will test our hypothesis. The advantage of writing down what you’re going to do–the Methods section of a research paper, the design of your research project–before you do it is that when you see it on paper, spelled out explicitly and step by step, you will often notice the logical or procedural errors in what you were thinking, and then you won’t spend weeks making those errors before realizing that they were never going to get you where you wanted to go.

OK, so: you know that you’re going to write out your methods, very explicitly and in the order in which you will do them. But, how do you figure out what those methods should be?

An efficient way to go about this is to read research papers by other people who have done similar things. As you read them, you’re going to look for a general pattern–think of this as an example of the frameworks that we’ve talked about in other parts of this book. Returning to our example of using natural language processing to analyze social media data, you might go to PubMed/MEDLINE, the National Library of Medicine’s database of 27 million biomedical research articles, and search for papers that mention either natural language processing or text mining, and also have the words social media in the title or abstract. (Click here if you would like to see the set of 190+ papers that this search would find.)

The results of that search will return these three papers that are studying a problem similar to yours: they’re using natural language processing to find women talking about their pregnancy, people talking about adverse reactions to drugs, or people talking about abuse of prescription medications–not exactly what you need to do, but similar. You’ll see two steps that are carried out in all of them. I’ve highlighted the points where they’re mentioned in the abstracts of the three papers:

METHODS: Our discovery of pregnant women relies on detecting pregnancy-indicating tweets (PITs), which are statements posted by pregnant women regarding their pregnancies. We used a set of 14 patterns to first detect potential PITs. We manually annotated a sample of 14,156 of the retrieved user posts to distinguish real PITs from false positives and trained a supervised classification system to detect real PITs. We optimized the classification system via cross validation, with features and settings targeted toward optimizing precision for the positive class. For users identified to be posting real PITs via automatic classification, our pipeline collected all their available past and future posts from which other information (eg, medication usage and fetal outcomes) may be mined.

Sarker, Abeed, Pramod Chandrashekar, Arjun Magge, Haitao Cai, Ari Klein, and Graciela Gonzalez. “Discovering cohorts of pregnant women from social media for safety surveillance and analysis.” Journal of medical Internet research19, no. 10 (2017): e361.

METHODS: One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies.

Sarker, Abeed, and Graciela Gonzalez. “Portable automatic text classification for adverse drug reaction detection via multi-corpus training.” Journal of biomedical informatics 53 (2015): 196-207.

METHODS: We collected Twitter user posts (tweets) associated with three commonly abused medications (Adderall(®), oxycodone, and quetiapine). We manually annotated 6400 tweets mentioning these three medications and a control medication (metformin) that is not the subject of abuse due to its mechanism of action. We performed quantitative and qualitative analyses of the annotated data to determine whether posts on Twitter contain signals of prescription medication abuse. Finally, we designed an automatic supervised classification technique to distinguish posts containing signals of medication abuse from those that do not and assessed the utility of Twitter in investigating patterns of abuse over time.

Weissenbacher, Davy, Abeed Sarker, Tasnia Tahsin, Matthew Scotch, and Graciela Gonzalez. “Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods.” AMIA Summits on Translational Science Proceedings 2017 (2017): 114.

Now we can abstract out the two steps that we found in all three papers:

The authors built a data set.
The authors used a technique called classification–a form of machine learning–to differentiate between the social media posts that did and did not talk about a person’s own pregnancy, or an adverse reaction to a medication, or abuse of prescription medications.

So, now you have a basic outline of your methodology. Your goal being to use natural language processing to investigate, using social media data, how veterans feel about the care that they receive through the Veterans’ Administration health care system, maybe your methodology will look like this:

Create a data set containing tweets in which veterans are talking about how they feel about the care that they receive in the VA health care system.
Use machine learning to classify those tweets into ones where the vets feel (a) positive, (b) negative, or (c) neutral about that care.

OK, so: now you can expand that. You’re quickly going to realize that Step 2–classifying those tweets–is actually going to require you to be able to do three classifications:

You have to be able to differentiate tweets written by veterans from tweets written by everybody else.
You have to be able to differentiate tweets where the vets are talking about the VA health care system from where they’re talking about things other than the VA health care system.
You have to be able to classify whether the feelings that they express about the VA health care system are positive, negative, or neutral.

Now that you’ve started to flesh out your methodology, you realize something: creating that data set is going to take a really long time, since you essentially have to be able to label three different kinds of things in the social media posts. You have a finite amount of time and resources with which to do it, so how are you going to make that possible?

Faced with an enormous amount of work to accomplish with limited time and resources, the most sane approach is this: go to your supervisor, show them your detailed methods plan, and let them come to the conclusion that they had better either (a) give you a lot more resources, or (b) modify your assignment. Having gone through this multiple times over the course of my career, I can tell you that (b) is a hell of a lot more likely. What is the modified assignment going to look like? It’s probably going to be a reduction of the task to “just” the task of detecting tweets that were and weren’t written by veterans. Now you can go back to your outline, and modify it:

Create a data set containing tweets written by veterans, and tweets written by anybody else.
Use machine learning to classify those tweets into the ones that were written by veterans, and the ones that weren’t.

This is going to be hard enough, believe me. Here are some examples of what those tweets might look like–I made them up, but they’re totally plausible:

HM1 Zipf here, USS Biddle 1980-1982–BT3 Raven McDavid, you out there?
AFOSC raffle drawing at 1500–win that lawnmower and help us buy books for the squadron?
FTN today, FTN tomorrow, FTN and fuck Chief Chomsky til I get out this motherfucker
Mario Brothers, still nothin like it, bitchboys

Have you figured it out? Here are the answers:

Clearly written by a veteran.
Almost certainly written by the spouse of an active duty Air Force officer, so not written by a veteran.
Clearly written by a sailer who is still on active duty, so not written by a veteran.
No clue who it was written by, and/but there’s no reason whatsoever to think that it was written by a veteran, so it should be classified as not written by a veteran.\

What’s that you say? It wasn’t clear to you at all? Think about this: if it wasn’t clear to you, it’s certainly not going to be clear to a computer program, so your classification step is going to be difficult. In fact, if it’s not clear to you, you’re going to have a hell of a difficult time building the data set–time to go back to your supervisor and ask for the resources to hire some veterans to help you out!

…and (4) raises a super-difficult question: what the hell counts as a reasonable experimental control for this research project? (Spoiler: I don’t know, and I have a doctoral degree in this particular topic.)

All of this to say:

Your redefined project is going to be plenty hard, thank you very much.
You wouldn’t know how crucial it was to redefine said project if you hadn’t started the process of writing out what exactly you’re going to do.

…and hell–you hadn’t even gotten to the “exactly” part yet! So: take Graciela’s point seriously, and write some things down before you start doing anything else.

…and now you can think about what you’re going to measure to figure out whether or not you were successful in doing what you were trying to do.

Linguistic geekery: Raven McDavid was a dialectologist back in day. He is said to be the inspiration for the Harrison Ford character in Raiders of the lost ark. Chomsky is Noam Chomsky, the most important (although not the best, in my humble opinion) linguist of the 20th century. Where they appear in the post:

HM1 Zipf here, USS Biddle 1980-1982–BT3 Raven McDavid, you out there?
AFOSC raffle drawing at 1500–win that lawnmower and help us buy books for the squadron?
FTN today, FTN tomorrow, FTN and fuck Chief Chomsky til I get out this motherfucker
Mario Brothers, still nothin like it, bitchboys’

We who smell of anchovy pizza salute you

I thought how horrible it would be to kiss someone whose beard smelled of anchovy pizza.

The adventure of the moment has me in Philadelphia, one of the great eastern cities of the Colonial era, where I have the privilege of spending the summer as a guest of the Health Language Processing Laboratory of the University of Pennsylvania School of Medicine. I’m living in a dormitory on the edge of the campus, which is a beautiful part of town–a beautiful part of town that borders what we call in French a quartier sensible. In plain English: a shitty neighborhood. The grocery store that I go to is right on the border between the two, so the clientele is a mix of two very different populations: hip young college students and faculty, and the kinds of people who live in shitty neighborhoods. (Word to the wise: don’t go past 45th Street.)

So, yesterday I’m standing in line in said grocery store. In front of me is a smart-sounding young woman. Being an American, she’s talking on her cell phone. Being an American, she’s talking loudly enough for everyone else to hear. Being American, we’re all listening. (In France, we do not have loud conversations on the phone in public–rude, rude, rude.) She’s talking about what medical specialty she’ll be going into–it’s clear which side of the border she’s from.

In the line next to us is an old dude who clearly is from the other side of the border. He stands in line with his hands behind his back. Being an American, he’s fat. Ratty jeans, and his nasty t-shirt reveals a jailhouse tattoo on his arm that says FTW, which for those of us of a certain age stands for Fuck The World. His beard looks like it probably smells of anchovy pizza.

The smart-sounding young lady is talking about what medical specialty she’ll be going into. Funny thing is, she’s not just talking about it–she’s apologizing for it. Mom, dermatologists help people, too. I mean, at least in small ways, we can affect their lives a little bit. …No, it’s not like saving lives all day in the emergency room, but we can make a little bit of a difference for people.

So, I’m standing there in line, and I notice something: my favorite soap is on sale for a dollar off. Computational linguistics is not nearly as remunerative as one might think, and that dollar off could be translated into the luxury of a cup of coffee, so as the smart-sounding young lady says her goodbyes and hangs up, I think: I’m getting out of this line and I’m grabbing a bottle of soap, and the wait be damned. Then the nasty-looking old dude with his hands behind his back says something, and I think: hearing this one out is going to be worth spending an extra dollar for a bottle of soap–stay right where you are.

“Are you a doctor? You sound like a doctor.”
Well…I’m an intern. I just finished medical school.
“If my grandma has a melanoma, is that a bad thing?”
She should definitely go to a doctor. It can be fatal.
“Like, a skin doctor?”
Yes, a dermatologist.
“A…dermalologist?”
Yes, a dermatologist. Melanoma can be fatal, but if it’s caught in time, it could save her life.
“So, a skin doctor could save someone’s life? Who woulda thought?”

For the first time, she looked at him, right in the face. I’m guessing that she was thinking something like I was thinking: nobody as ancient as this old fuck has a living grandmother. What the hell?

“I sure do love my grandma,” he said. And he smiled.

She looked at him for a while. You could see the wheels turning. And then she smiled, too. Thank you, she said.

She paid for her groceries, and left. The old fuck turned to put his groceries on the counter, and I saw what was in those hands behind his back: a book about existentialist perspectives on psychoanalysis. I looked at his FTW tattoo. I thought how horrible it would be to kiss someone whose beard smelled of anchovy pizza. I bet his dead grandma didn’t mind, though.

to say something in plain English: to say something clearly, with no big words or complicated sentences. (Phil dAnge: is there a French equivalent?) Examples:

Can someone explain to me in plain English, how the fuck can you believe general Theo Magath gives a shit or two for Paradis and Walldians?

Why the heck are we portraying HIM as a hero and Zeke the villain? Have you forgotten what MARLEY is all about???

— 🦉 (@euthanasiaworks) June 10, 2019

Idea: quiz show where engineers have to describe things in plain english and “non engineers” buzz them out if they use jargon / don’t explain clearly.

— Prof Lucy Rogers – Inventor With A Sense Of Fun (@DrLucyRogers) June 18, 2019

The car rental industry is synonymous with poor customer service and badly-worded policies. It is time it started communicating in plain English for the benefit of customers https://t.co/e4wApzka6R pic.twitter.com/arrHbhsmGb

— VisibleThread (@VisibleThread) June 18, 2019

funny thing is: …means that you’re about to describe something that is strange about a situation. Could also be weird thing is, strange thing is, and you could also put the in front of it. Examples:

Funny thing is you think you’re a bigger asshole than I am, we both know I’m the bully in the relationship 🤒

— Palo (@dtxpalo) June 19, 2019

Funny thing is when I was attempting to learn Korean the first phrase I ever learned was “My head hurts” it was a sign SHHHHH JUST LET ME HAVE IT😭😭 #StrayKidsComeback pic.twitter.com/3BPSGUSQgB

— IkyHeadHurtsSKZCB~#PrayForSudan (@IkyRamen) June 19, 2019

Funny thing is, you make people feel special and happy but once you’re tired of them you tell them you are not ready. Oops. Don’t go around hurting people. 🍵

— Ryan not Bang🐺 (@christiangrayed) June 19, 2019

How I used it in the post: Funny thing is, she’s not just talkingabout it–she’s apologizing for it.

remunerative: adjective that means that something pays a good salary. How I used it in the post:

Computational linguistics is not nearly as remunerative as one might think, and that dollar off could be translated into the luxury of a cup of coffee, so…

(dollar) off: “Off” here means the amount of a reduction in price. Examples:

It’s Monday friends. Don’t forget about our Happy hour from 5pm till 6pm, where cocktails and wines are $2.00 off and beers and non alcoholic beverages are a dollar off. We open at 5pm. Look forward to serving you… https://t.co/srwYpuOLPi

— MarutiPDX (@MarutiPdx) June 10, 2019

It’s hard to have the Monday blues while enjoying a beer on this deck. Better yet- drafts are a dollar off every Monday! Come see us to get your week started on the right track! 🛤🍻🚂#happymonday #mondayfunday… https://t.co/bn1700epjx

— Whistle Hop Brewing (@WhistleHop) June 17, 2019

to hear something out: to listen to the end of a discourse of some kind–an idea, an explanation. Similar construction: to hear someone out, which means to listen until they have said all that they have to say. Examples:

I ain’t a sweet son but Popz, hear this out…

50 cheers for you Popz.

Happy Birthday! Thank you💙💙💙 pic.twitter.com/Sy2eMOHQVR

— Ⓚ yaaan (@keyyawnsheyn) June 14, 2019

Left: Racism is bad
Right: Racism is good
Centrist: Woah now lefty let’s hear this out. We gotta compromise and have some racism.

— former lettucehead (@shylekaw) June 13, 2019

Please daddy God, please stay with her tonight. Please hear her out tonight, I know that she’s ranting and praying to you tonight. Please.🙏

— r (@quellynunal) June 18, 2019

How I used it in the post: Then the nasty-looking old dude with his hands behind his back says something, and I think: hearing this one out is going to be worth spending an extra dollar for a bottle of soap–stay right where you are.

woulda: in informal spoken American English, “would have.”

sure do + verb: an emphatic construction. Has a very rural flavor.

Looks like [it] smells of anchovy pizza is a marvelously evocative description, and I wish I could say that I came up with it myself, but I didn’t: it’s from an old Berke Breathed cartoon, where Opus the penguin uses it to describe Bruce Springsteen.

Why doing the laundry makes me happy

Doing the laundry will make you happy if you spend sufficient time contemplating the zombie apocalypse.

What will suck about the zombie apocalypse is….well, everything, really. For example: when the zombie apocalypse comes, most people will be completely filthy most of the time. For a while, you’ll at least be able to scavenge clean clothes–you won’t have many opportunities to bathe, but let’s face it: Old Navy will not be the first store to be looted. Eventually the clean clothes will all be gone. Eventually the day will come when you’ll strip a coat off of a reeking zombie whose head you’ve just smashed like a watermelon and be happy that you have something to keep yourself warm.

Today I woke up at 5:30–late for me–and headed down to the basement laundry room. Then I went to work–in clean underwear, clean jeans, and a clean t-shirt from the 2007 Association for Computational Linguistics meeting in Prague. (I learned to say gde je stan’ce metra–where is the subway station–which was undeniably useful. I also learned to ask questions about the National Theater, which amused the taxi drivers but did not accomplish much else.)

When you compare it with how bad life is going to suck during the zombie apocalypse, doing the laundry was actually pretty fun. Going to work in clean clothes was a pleasure, as it is every day, and it always will be if you spend sufficient time contemplating the zombie apocalypse. There’s a reason I’m the happiest person you know. Hell, I’m the happiest person you don’t know. Think about it.

English notes

In American English, “like a watermelon” is a common simile for describing actions of crushing, smashing, and the like. Some examples:

The more I look at Maggie the more I want her to crush my head between her strong legs like a watermelon. This is the only way I want to go

— Nick (@champreignpapi) May 29, 2019

If u come at me with that “you’re not chubby or fat you look great!” fatphobic nonsense I will crush ur head like a watermelon between my chubby ass thighs.
Folks, we are not tolerating fat-bashing bullshit this summer.

— dirty gertie (@urcoolgrandma) June 6, 2019

They also keep your head from busting like a watermelon when it hits the pavement, Einstein.

— Joe Hilton (@JoeHilt65106621) June 6, 2019

If I’d ended up like Shiggy I’d probably be pissed at someone running around saying ‘ everything’s fine’ too
Especially if they’d crushed my mentors head like a watermelon

— 🌻 Infinity Girl 🌞 (@Bennjoon) June 6, 2019

i’m really about to smash your skull into smithereens just like a watermelon if you don’t watch what you say to me

— akira 。 (@dopporamu) June 4, 2019

How I used it in the post: The day will come when you’ll strip a coat off of a reeking zombie whose head you’ve just smashed like a watermelon and be happy that you have something to keep yourself warm.

Language geekery: similes versus analogy

Simile and analogy are similar (is that a pun? if so, it’s not a very sophisticated one), but they’re not quite the same. Analogy starts with focusing on similarity between unlike items, and then typically is followed by pointing out the differences between them. In contrast, simile does not require any actual similarity between the unlike items, and does not include pointing out the differences.

Thus, the heuristic Detached roles is like a Hearst & Schütze super-category, but not constructed on a statistical metric, rather on underlying semantic components. (Source: Litkowski, Kenneth C. “Desiderata for tagging with WordNet synsets or MCCA categories.” Tagging Text with Lexical Semantics: Why, What, and How? (1997).)

A recursive transition network (RTN) is like a finite-state automaton, but its input symbols may be RTNs or terminal symbols. (Source: Goldberg, Jeff, and László Kálmán. “The first BUG report.” In COLING 1992 Volume 3: The 15th International Conference on Computational Linguistics, vol. 3. 1992.)

Therefore, a conversation is like a construction made of LEGO TM blocks, where you can put a block of a certain type at a few places only. (Source: Rousseau, Daniel, Guy Lapalme, and Bernard Moulin. “A Model of Speech Act Planner Adapted to Multiagent Universes.” Intentionality and Structure in Discourse Relations (1993).) Note that a native speaker probably would have put this somewhat differently. Where the authors say where, a native speaker might have said where you can only put a block of a specific type at a few places, or more likely, except that you can put a block of a specific type only specific places.

Given all of that: is this an analogy, or a simile? The day will come when you’ll strip a coat off of a reeking zombie whose head you’ve just smashed like a watermelon and be happy that you have something to keep yourself warm. Scroll down past the gratuitous Lisa Leblanc video for the answer.

I sometimes use this blog to try out materials for something that I will be publishing. This brief description of how to use analogy is intended for a book about writing about data scientist. I would love to know what parts of it are not clear. (My grandmother will tell me how great it is, so no need for you to bother with that.)

Answer: it’s a simile. Note that we’re not asserting any difference between the way that you’re going to smash the zombie’s head and the way that you would smash a watermelon: a reeking zombie whose head you’ve just smashed like a watermelon. Note also that we are not then contrasting the way that you’re going to smash the zombie’s head and the way that you would smash a watermelon. Simile, not analogy.

Yes, please–do volunteer to be a reviewer

Yes, you CAN volunteer to be a peer reviewer!

Get any two researchers together in a bar at the end of a day at any randomly chosen conference. They will get around to complaining about the difficulty of getting grant funding these days, service responsibilities in their institution, and how grad students don’t want to work as hard as we did back in the day. But, before that, they will complain about the real pain point of academic work: reviewing. (See the English notes below for an explanation of the expression “pain point.”)

“Peer review” is the process by which academic writing is considered for publication. The mechanics of it are this:

An author submits an article to a journal or conference.
An “associate editor” at the journal or an “area chair” at the conference finds reviewers who are willing to read and comment on the paper–your “peers.”
The reviewers read the paper, write up detailed comments on it, and make a suggestion regarding acceptance.
The associate editor or area chair makes a decision about the paper.

That decision in step 4 can take a number of forms, including outright acceptance (rare), rejection (not rare), and giving the author the option of making changes in response to the reviewers’ comments and resubmitting the paper, in which case steps 3 and 4 repeat. (They can repeat multiple times, too.)

At step 2, the associate editor or area chair needs to find three reviewers in the typical case–rarely fewer, and sometimes more. (I once submitted a paper to a journal for which I am the deputy editor-in-chief, and the editor who handled it had it reviewed by SIX reviewers–the most I have ever seen. To avoid the appearance of a conflict of interest, that made sense.)

Three reviewers per submission, and the big conferences in my area (computational linguistics) typically get between 1,000 and 2,500 submissions–that’s 3,000 to 7,500 reviews per conference. There are several big conferences in my area–assume five per year, and that’s 15,000 to 37,500 reviews that need to get written per year. And that’s just the conferences–journal publications are appearing faster than ever before in history, which is in itself not a surprise–most things are happening faster than ever before in history—but, the publication rate has been growing logarithmically, and if you’ve been reading about Zipf’s Law for a while, you know that that’s fast. Journal submissions take quite a bit more time to review than conference papers, too–a conference paper in my field is typically limited to 8 pages, but most journals in my field no longer have page limits at all.

Just for grins, here are the page counts on my 5 most recent journal articles: 15, 8, 14, 24, and 12. The 8-pager was in a journal with a page limit–of 7 pages! We paid an extra-page fee.

Who writes those peer reviews? Well…your peers. You write your share of those 15,000 to 37,500 reviews, and the authors of those 5,000 to 12,500 papers write reviews of your papers, and… Well, it’s a huge workload. How huge, exactly? It’s hard to say what an average would be, but I have a reviewed a couple hundred papers over the course of the past couple of years. Is that typical? Probably. And the conference papers come in bursts–conferences are deadline-driven, so all of the 1,000 to 2,500 submissions to an individual conference are being done at once. A reviewer for a conference in my field is typically assigned 5 papers. Of course, there is a limited set of time slots when conferences can happen–they mostly take place during breaks in the academic year, so either during the summer, or around the end-of-year holidays. That means that their submission deadlines tend to cluster together, so you are probably reviewing for multiple conferences in the same time period. How many? I’ve written 14 in the past two weeks. I may actually have spent more time reviewing other people’s papers than working on my current grant proposal–and it’s the grant proposals that bring in my salary. Could I say no to review requests? Of course. But, it would not be fair to do so–while I’m reviewing those papers, someone else is reviewing mine.

….All of this en préambule to the answer to a question that I don’t get asked often enough: can you volunteer to be a reviewer? The answer: yes. Here’s a good example of a request that I got recently:

Dear Dr. Zipf:

I am a Ph.D. student at university name removed, majoring in computer science, under the supervision of advisor name removed. My main research fields are bioinformatics, deep learning, machine learning and artificial intelligence.
I have done some researches in bimolecular function prediction, Nanopore sequencing, fluorescence microscope super resolution, MD simulation, sequence analysis, graph embedding and catastrophic forgetting, which were published in journals, such as PNAS, NAR and Bioinformatics, and conferences, such as ISMB, ECCB and AAAI. Attached please find my complete CV about my background.

I am very interested in serving the community and acting as a reviewer for the manuscripts which are related to my background. I know you are serving as an associate editor for a number of journals, such as BMC Bioinformatics. If you encounter some manuscripts which are highly related to my background, feel free to refer me as a reviewer.

Thank you very much for your consideration! Have a nice day!

My response:

Hi, name removed,

Thank you for writing–it is always nice to see a volunteer for reviewing! However, I only handle articles on natural language processing, which seems outside of your areas of expertise. I would recommend that you send your CV, and a similar email, to associate editors who specialize in your areas. Your advisor could suggest some, and you could also look at the editorial board of relevant journals, especially ones in which you have published.

Thank you again for volunteering, and keep looking for opportunities–I am pretty sure that you will find them!

Best wishes,

Beauregard Zipf

Response to THAT:

Dear Dr. Zipf:

OK! Thank you very much for the clarification and the instruction! Have a nice day!

Notice what you do not see in this exchange: what people are afraid of, which is a response saying something along the lines of “who the hell do you think you are to dare to propose yourself as a reviewer?” Of the 200 emails that I probably plowed through that day, this offer might have been the only message that actually brought me a little joy–even though I couldn’t use this particular reviewer, I’m certain that someone else will. Yes: you can volunteer to be a peer reviewer!

French notes

en préambule (à): as a preamble, en guise d’introduction.

la relecture par les pairs: peer review. WordReference.com also gives l’évaluation par les pairs and l’inter-évaluation, but I’ve never actually heard that last one. Native speakers??

Want to read a French-language blog post about peer review in computational linguistics? Here’s one by my colleagues Karën Fort and Aurélie Névéol.

English notes

pain point: a marketing term referring to the problem that a salesman is going to try to solve for you by selling you his product. How I used it in the post: Before that, they will complain about the real pain point of academic work: reviewing.

	Anonymous on The many ways to spell “…
	Anonymous on Nightmare after nightmare: How…
	zipfslaw1 on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …