What computational linguists actually do all day during lockdown

For starters: how about if we fight to get the files open?

I know, I know—computational linguistics sounds like it would be the most glamorous job in the world, right? We have a dirty little secret, though: 85% percent of our time is spent just trying to read in, and clean up, data files.

The novel coronavirus covid-19 has me on lockdown just like anybody else. What to do all day? Well, the National Library of Medicine, the Allen Institute for Artificial Intelligence, and some other folks whose names escape me recently released a corpus (set of linguistic data) of scientific journal articles that might or might not be relevant to covid-19 research, named CORD-19, and asked the artificial intelligence people of the world to see what they could do with it. Great—what else would I do all day?

For starters, how about fight to get the fucking files to open?

After that, we could clean some of the useless stuff out of the data—section headings (Introduction, Methods, Results…), puffery (important, significant), stuff like that…

Get rid of spaces and that sort of nonsense…

…and then make a TermDocumentMatrix, a “data structure” that lists all of the words in the corpus and the documents in which they occur (or all of the documents in the corpus and which words occur in them, depending on how you flip it).

We’ll try to make a word cloud, which will result in us watching several pages’-worth of error messages fly by (most of them removed for your viewing pleasure):

…and then we can finally see what’s in that data set, at which point we notice that there are some more words that we should be removing: …and then make a little graph of the most frequent words, at which point we’ll realize that we should probably be removing things like plural markers on nouns, past-tense markers on verbs, and stuff like that:

…and finally, we will record exactly what we did, in hopes of actually being able to do it on demand, which we obviously need to do immediately, or as soon as we fix the problems that we just identified, at any rate.

And there’s another lesson in this, too: 85% of any programmer’s time, be they computational linguists or not, is spent fixing the problems with code (computer instructions) that they have already written.
As is often the case with my articles about what I do for a living, this one sounds whiny. Don’t be fooled: I love my job, and consider myself the luckiest guy in the world to be able to do things that I love for a living. Now to fix those fucking problems…

Computational linguistics and the covid-19 coronavirus

I know, I know–you think that computational linguists spend their time sitting around discussing morphological typology and its implications for the monogenetic versus polygenetic hypotheses of the origin of language. And we do–after work, over beers.

For most of us, though, our professional life consists of trying to get computers to do things that involve language in some way.  One of the biggies is helping people find information. To teach a computer to do something like that, you need to have access to a lot of data to test the system that you are building.

Enter the covid-19 coronavirus.  There is suddenly an enormous amount of research being done on something that we did not even know about just 6 months ago. At the same time, there is a lot of research already published on other coronaviruses, and it would be idiotic to try to do research on a novel coronavirus without taking advantage of what we already know about the others. But, how can anyone go through the 15,000+ papers on coronavirii (spelling?) that are already in the US National Library of Medicine’s PubMed/MEDLINE database?

Enter computational linguists. Sometimes considered a branch of artificial intelligence, we work on computer programs to do things like summarize large sets of publications.  There are lots of things that you have to be able to do in order to do that–figure out what’s being talked about (coronavirus and medications? Coronavirus and transmissibility? Coronavirus and respiratory failure?); tell the difference between a positive statement, a negated statement, a speculative statement, and a negated speculative statement:

  1. Positive: The person-to-person transmission routes of 2019-nCoV included direct transmission, such as cough, sneeze, droplet inhalation transmission, and contact transmission, such as the contact with oral, nasal, and eye mucous membranes. (Source: this paper)
  2. Negated: The person-to-person transmission routes of 2019-nCoV did not include indirect transmission over the Internet. (I made this sentence up)
  3. Speculative: This other coronavirus might be specific to deer species. (From this paper published in 1995 about a different coronavirus)
  4. Negated speculative: This other coronavirus might not be specific to deer species. (I made this sentence up)

…and many other tasks that all have to be handled in order to solve the problem of summarizing those 15,000+ papers–and many other problems in getting computers to understand human language, too.

Like I said, though: in order to test our systems, we need data.  Enter a number of the big players in computational linguistics, who have created, and made freely available to the public, a large dataset of relevant papers. Their hope? That computational linguists around the world will dive into them, using them to develop and test tools for dealing with all of those papers.  Here’s an excerpt from the White House’s web site describing the effort to create and release the data, followed by a French-language appeal to the francophone computational linguistics community to work on it sent out by my colleague Pierre ZweigenbaumC’est parti…

One of the most immediate and impactful applications of AI is in the ability to help scientists, academics, and technologists find the right information in a sea of scientific papers to move research faster. We applaud the OSTP, WHO, NIH and all organizations that are taking a proactive approach to use the most advanced technology in the fight against COVID-19,” said Dr. Oren Etzioni, Chief Executive Officer of the Allen Institute for AI. “The Allen Institute for AI, and particularly the Semantic Scholar team, is committed to updating and improving this important resource and the associated AI methods the community will be using to tackle this crucial problem.”

“It’s difficult for people to manually go through more than 20,000 articles and synthesize their findings. Recent advances in technology can be helpful here. We’re putting machine readable versions of these articles in front of our community of more than 4 million data scientists. Our hope is that AI can be used to help find answers to a key set of questions about COVID-19,” said Anthony Goldbloom, Co-Founder and Chief Executive Officer at Kaggle.

Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset



Voici une série de tâches de recherche d’information / extraction d’information / fouille de textes / recherche de réponses à des questions lancées le 16 mars [1] sur un sujet d’actualité :


dans une base de 29000 articles (dont 13000 en texte intégral) concernant le coronavirus (bien sûr pas seulement le « nouveau ». Les questions sont listées sous la rubrique “Tasks”, et chaque question générique est déclinée en questions spécifiques. Voir par exemple “What is known about transmission, incubation, and environmental stability?”

Par ailleurs, un corpus (LitCovid) sur le Covid-19 est mis à jour en continu à la National Library of Medicine :
https://www.ncbi.nlm.nih.gov/research/coronavirus/ (1263 articles à
l’heure où j’écris ce message contre 1120 deux jours avant).

Le DBCLS à Tokyo a mis en place dans sa plateforme de gestion d’annotations un espace pour centraliser les informations extraites sur le corpus LitCovid sous forme d’annotations :


Tous les spécialistes de TAL sont donc encouragés à appliquer leurs méthodes sur ces données et à les faire tourner sur Kaggle (CORD-19), à les appliquer au corpus LitCovid et à déposer les annotations sur PubAnnotation.

Bien cordialement,

Pierre Zweigenbaum.

[1] “Today, researchers and leaders from the Allen Institute for AI,
Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for
Security and Emerging Technology (CSET), Microsoft, and the National
Library of Medicine (NLM) at the National Institutes of Health released
the COVID-19 Open Research Dataset (CORD-19) of scholarly literature
about COVID-19, SARS-CoV-2, and the Coronavirus group.”

Coup de grâce: Don Marquis’s “freddy the rat perishes”

Some of the most traumatic experiences of my life have involved me killing a mouse.  Tonight was no exception. 

Some of the most traumatic experiences of my life have involved me killing a mouse.  Tonight was no exception.  Delivering the coup de grâce was miserable for me–besides being soft-hearted, I have an incredible phobia about dead animals–but I felt like after a long and hard-fought battle, the furry little warrior deserved it.

The experience brought to mind the end of this poem by Don Marquis.  It is all in lower-case because it has been typed by a cockroach.  Archie (yes, that’s the cockroach’s name) depresses the keys by jumping on them; this precludes ever hitting the shift key to make upper-case letters.  Many thanks to the Don Marquis blog, where I found the text.  Want to know more about the strange case of Archie the cockroach poet?  See this post.

freddy the rat perishes

By Don Marquis, in “archy and mehitabel,” 1927

listen to me there have
been some doings here since last
i wrote there has been a battle
behind that rusty typewriter cover
in the corner
you remember freddy the rat well
freddy is no more but
he died game the other
day a stranger with a lot of
legs came into our
little circle a tough looking kid
he was with a bad eye

who are you said a thousand legs
if i bite you once
said the stranger you won t ask
again he he little poison tongue said
the thousand legs who gave you hydrophobia
i got it by biting myself said
the stranger i m bad keep away
from me where i step a weed dies
if i was to walk on your forehead it would
raise measles and if
you give me any lip i ll do it

they mixed it then
and the thousand legs succumbed
well we found out this fellow
was a tarantula he had come up from
south america in a bunch of bananas
for days he bossed us life
was not worth living he would stand in
the middle of the floor and taunt
us ha ha he would say where i
step a weed dies do
you want any of my game i was
raised on red pepper and blood i am
so hot if you scratch me i will light
like a match you better
dodge me when i m feeling mean and
i don t feel any other way i was nursed
on a tabasco bottle if i was to slap
your wrist in kindness you
would boil over like job and heaven
help you if i get angry give me
room i feel a wicked spell coming on

last night he made a break at freddy
the rat keep your distance
little one said freddy i m not
feeling well myself somebody poisoned some
cheese for me im as full of
death as a drug store i
feel that i am going to die anyhow
come on little torpedo don t stop
to visit and search then they
went at it and both are no more please
throw a late edition on the floor i want to
keep up with china we dropped freddy
off the fire escape into the alley with
military honors


Picture source: https://gritinthegears.blogspot.com/2010/11/freddy-rat-perishes-revisited.html

Curative Power of Medical Data

JCDL 2020 Workshop on Biomedical Natural Language Processing


Criminal Curiosities


Biomedical natural language processing

Mostly Mammoths

but other things that fascinate me, too


Adventures in natural history collections

Our French Oasis


ACL 2017

PC Chairs Blog

Abby Mullen

A site about history and life

EFL Notes

Random commentary on teaching English as a foreign language

Natural Language Processing

Université Paris-Centrale, Spring 2017

Speak Out in Spanish!

living and loving language




Exploring and venting about quantitative issues