What computational linguists actually do all day during lockdown

For starters: how about if we fight to get the files open?

I know, I know—computational linguistics sounds like it would be the most glamorous job in the world, right? We have a dirty little secret, though: 85% percent of our time is spent just trying to read in, and clean up, data files.

The novel coronavirus covid-19 has me on lockdown just like anybody else. What to do all day? Well, the National Library of Medicine, the Allen Institute for Artificial Intelligence, and some other folks whose names escape me recently released a corpus (set of linguistic data) of scientific journal articles that might or might not be relevant to covid-19 research, named CORD-19, and asked the artificial intelligence people of the world to see what they could do with it. Great—what else would I do all day?

For starters, how about fight to get the fucking files to open?

After that, we could clean some of the useless stuff out of the data—section headings (Introduction, Methods, Results…), puffery (important, significant), stuff like that…

Get rid of spaces and that sort of nonsense…

…and then make a TermDocumentMatrix, a “data structure” that lists all of the words in the corpus and the documents in which they occur (or all of the documents in the corpus and which words occur in them, depending on how you flip it).

We’ll try to make a word cloud, which will result in us watching several pages’-worth of error messages fly by (most of them removed for your viewing pleasure):

…and then we can finally see what’s in that data set, at which point we notice that there are some more words that we should be removing: …and then make a little graph of the most frequent words, at which point we’ll realize that we should probably be removing things like plural markers on nouns, past-tense markers on verbs, and stuff like that:

…and finally, we will record exactly what we did, in hopes of actually being able to do it on demand, which we obviously need to do immediately, or as soon as we fix the problems that we just identified, at any rate.

And there’s another lesson in this, too: 85% of any programmer’s time, be they computational linguists or not, is spent fixing the problems with code (computer instructions) that they have already written.
As is often the case with my articles about what I do for a living, this one sounds whiny. Don’t be fooled: I love my job, and consider myself the luckiest guy in the world to be able to do things that I love for a living. Now to fix those fucking problems…

Computational linguistics and the covid-19 coronavirus

I know, I know–you think that computational linguists spend their time sitting around discussing morphological typology and its implications for the monogenetic versus polygenetic hypotheses of the origin of language. And we do–after work, over beers.

For most of us, though, our professional life consists of trying to get computers to do things that involve language in some way.  One of the biggies is helping people find information. To teach a computer to do something like that, you need to have access to a lot of data to test the system that you are building.

Enter the covid-19 coronavirus.  There is suddenly an enormous amount of research being done on something that we did not even know about just 6 months ago. At the same time, there is a lot of research already published on other coronaviruses, and it would be idiotic to try to do research on a novel coronavirus without taking advantage of what we already know about the others. But, how can anyone go through the 15,000+ papers on coronavirii (spelling?) that are already in the US National Library of Medicine’s PubMed/MEDLINE database?

Enter computational linguists. Sometimes considered a branch of artificial intelligence, we work on computer programs to do things like summarize large sets of publications.  There are lots of things that you have to be able to do in order to do that–figure out what’s being talked about (coronavirus and medications? Coronavirus and transmissibility? Coronavirus and respiratory failure?); tell the difference between a positive statement, a negated statement, a speculative statement, and a negated speculative statement:

  1. Positive: The person-to-person transmission routes of 2019-nCoV included direct transmission, such as cough, sneeze, droplet inhalation transmission, and contact transmission, such as the contact with oral, nasal, and eye mucous membranes. (Source: this paper)
  2. Negated: The person-to-person transmission routes of 2019-nCoV did not include indirect transmission over the Internet. (I made this sentence up)
  3. Speculative: This other coronavirus might be specific to deer species. (From this paper published in 1995 about a different coronavirus)
  4. Negated speculative: This other coronavirus might not be specific to deer species. (I made this sentence up)

…and many other tasks that all have to be handled in order to solve the problem of summarizing those 15,000+ papers–and many other problems in getting computers to understand human language, too.

Like I said, though: in order to test our systems, we need data.  Enter a number of the big players in computational linguistics, who have created, and made freely available to the public, a large dataset of relevant papers. Their hope? That computational linguists around the world will dive into them, using them to develop and test tools for dealing with all of those papers.  Here’s an excerpt from the White House’s web site describing the effort to create and release the data, followed by a French-language appeal to the francophone computational linguistics community to work on it sent out by my colleague Pierre ZweigenbaumC’est parti…

One of the most immediate and impactful applications of AI is in the ability to help scientists, academics, and technologists find the right information in a sea of scientific papers to move research faster. We applaud the OSTP, WHO, NIH and all organizations that are taking a proactive approach to use the most advanced technology in the fight against COVID-19,” said Dr. Oren Etzioni, Chief Executive Officer of the Allen Institute for AI. “The Allen Institute for AI, and particularly the Semantic Scholar team, is committed to updating and improving this important resource and the associated AI methods the community will be using to tackle this crucial problem.”

“It’s difficult for people to manually go through more than 20,000 articles and synthesize their findings. Recent advances in technology can be helpful here. We’re putting machine readable versions of these articles in front of our community of more than 4 million data scientists. Our hope is that AI can be used to help find answers to a key set of questions about COVID-19,” said Anthony Goldbloom, Co-Founder and Chief Executive Officer at Kaggle.

Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset

 

Bonjour,

Voici une série de tâches de recherche d’information / extraction d’information / fouille de textes / recherche de réponses à des questions lancées le 16 mars [1] sur un sujet d’actualité :

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks

dans une base de 29000 articles (dont 13000 en texte intégral) concernant le coronavirus (bien sûr pas seulement le « nouveau ». Les questions sont listées sous la rubrique “Tasks”, et chaque question générique est déclinée en questions spécifiques. Voir par exemple “What is known about transmission, incubation, and environmental stability?”
(https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=568).

Par ailleurs, un corpus (LitCovid) sur le Covid-19 est mis à jour en continu à la National Library of Medicine :
https://www.ncbi.nlm.nih.gov/research/coronavirus/ (1263 articles à
l’heure où j’écris ce message contre 1120 deux jours avant).

Le DBCLS à Tokyo a mis en place dans sa plateforme de gestion d’annotations un espace pour centraliser les informations extraites sur le corpus LitCovid sous forme d’annotations :

http://pubannotation.org/collections/LitCovid

Tous les spécialistes de TAL sont donc encouragés à appliquer leurs méthodes sur ces données et à les faire tourner sur Kaggle (CORD-19), à les appliquer au corpus LitCovid et à déposer les annotations sur PubAnnotation.

Bien cordialement,

Pierre Zweigenbaum.

[1] “Today, researchers and leaders from the Allen Institute for AI,
Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for
Security and Emerging Technology (CSET), Microsoft, and the National
Library of Medicine (NLM) at the National Institutes of Health released
the COVID-19 Open Research Dataset (CORD-19) of scholarly literature
about COVID-19, SARS-CoV-2, and the Coronavirus group.”

Coup de grâce: Don Marquis’s “freddy the rat perishes”

Some of the most traumatic experiences of my life have involved me killing a mouse.  Tonight was no exception. 

Some of the most traumatic experiences of my life have involved me killing a mouse.  Tonight was no exception.  Delivering the coup de grâce was miserable for me–besides being soft-hearted, I have an incredible phobia about dead animals–but I felt like after a long and hard-fought battle, the furry little warrior deserved it.

The experience brought to mind the end of this poem by Don Marquis.  It is all in lower-case because it has been typed by a cockroach.  Archie (yes, that’s the cockroach’s name) depresses the keys by jumping on them; this precludes ever hitting the shift key to make upper-case letters.  Many thanks to the Don Marquis blog, where I found the text.  Want to know more about the strange case of Archie the cockroach poet?  See this post.

freddy the rat perishes

By Don Marquis, in “archy and mehitabel,” 1927

listen to me there have
been some doings here since last
i wrote there has been a battle
behind that rusty typewriter cover
in the corner
you remember freddy the rat well
freddy is no more but
he died game the other
day a stranger with a lot of
legs came into our
little circle a tough looking kid
he was with a bad eye

who are you said a thousand legs
if i bite you once
said the stranger you won t ask
again he he little poison tongue said
the thousand legs who gave you hydrophobia
i got it by biting myself said
the stranger i m bad keep away
from me where i step a weed dies
if i was to walk on your forehead it would
raise measles and if
you give me any lip i ll do it

they mixed it then
and the thousand legs succumbed
well we found out this fellow
was a tarantula he had come up from
south america in a bunch of bananas
for days he bossed us life
was not worth living he would stand in
the middle of the floor and taunt
us ha ha he would say where i
step a weed dies do
you want any of my game i was
raised on red pepper and blood i am
so hot if you scratch me i will light
like a match you better
dodge me when i m feeling mean and
i don t feel any other way i was nursed
on a tabasco bottle if i was to slap
your wrist in kindness you
would boil over like job and heaven
help you if i get angry give me
room i feel a wicked spell coming on

last night he made a break at freddy
the rat keep your distance
little one said freddy i m not
feeling well myself somebody poisoned some
cheese for me im as full of
death as a drug store i
feel that i am going to die anyhow
come on little torpedo don t stop
to visit and search then they
went at it and both are no more please
throw a late edition on the floor i want to
keep up with china we dropped freddy
off the fire escape into the alley with
military honors

archy

Picture source: https://gritinthegears.blogspot.com/2010/11/freddy-rat-perishes-revisited.html

Seeing the complexity of the simple: Comparative anatomy of the scapula

We can do a lot of things with our arms that a quadruped can’t do with theirs. Throw spears at edible quadrupeds. Throw tomatoes at sopranos. Throw bums out of office.

Being a scientist means finding delight in things that look complicated but are actually governed by pretty simple principles, as well as in things that look pretty simple but are actually pretty complex.  Case in point: the scapula.

Common English: shoulder blade.

Technical term: scapula, plural scapulae or scapulas.

French: l’omoplate (nf), la scapula (WordReference)

A scapula looks simple: they’re mostly flat, with a protruberance here and there.  Unlike closely associated bones, they don’t get broken very often–a Swedish study found a rate of scapular fractures of 10 per 100,000 people, while another Swedish study found 50 clavicular fractures per 100,000 people.

And yet: comparing the scapula across species, you see all kinds of interesting shit.  The point of comparative anatomy is that you can understand something better if you compare it to other ways that it could be, but isn’t.  So: let’s compare some scapulae.

The most obvious thing about the scapula is that it is positioned differently in different species.  The basic situation for most living things with limbs is this: you’re a quadruped (i.e. have four legs), and the scapula is located on the side of the trunk.  In contrast: look at a human, and the scapula is on its back. Compare the position of the scapula in this lovely picture of a horse:

horse skeleton
The axis of the scapula is on the line between points A and B. Picture source: http://www.horsecoursesonline.com/college/conformation/lesson_two_893.htm

…with the position of the scapula in this lovely picture of a human:

0d2300b469dd6a2efc1ba75783a74ef5
Picture source: https://www.pinterest.com/pin/547891110909216204/?lp=true

…and you see the difference.  It’s even more striking when you look at our closer relatives.  We are primates, and specifically, members of a group of primates known as apes, and even more specifically, of the great apes.  One of the biggest differences between us and our various and sundry primate relatives is that we are full-time bipeds.  Autrement dit: we walk upright, all the time.  In contrast, monkeys–which are primates, but not apes–are full-time quadrupeds.  Going along with this difference in locomotion is a difference in the position of the scapula: it’s on our back, but on a monkey’s side.  

Here’s a really nice view of a primate (left) and human (right) trunk.  Looking at the left side, labelled “monkey,” you see the typical quadruped architecture: the scapula is on the side of the chest cage.  On the right side, labelled “human,” it’s a different story: the scapula is on the back.

The arrows in this illustration make an important point: primates also have the typical quadruped chest cage, which is relatively narrow in comparison to its depth. In contrast, the human chest cage is sorta flat–relatively wide in comparison to its depth.  (Remember that the human skeleton in the picture is viewed from above, as if you had ripped someone’s head off in order to shit down their neck. (Sorry–a little sailor-talk there.  Unlike Trump, I served my country.)  In contrast, the monkey is being viewed from the front–I have no great analogy for you here.)

chest_compar (1)
Picture source: https://evolution.berkeley.edu/evolibrary/article/0_0_0/lines_05

Amongst primates in general, there is quite a bit of variability.  Why? Well, there’s quite a bit of variability in the extent to which they are quadrupedal versus bipedal. There’s quite a bit of variability in the capacity of the creature to do stuff with its hands over its head.  Here’s a nice layout that shows aspects of the shoulder anatomy across a range from true monkeys, to great apes, ending up with the ape-iest of us all: the anatomically modern human.  Start with the sacred monkey in the upper-left corner, and the scapula is clearly on the side of the thoracic cage.  End with the human in the bottom-right corner, and it’s clearly on the back.  In between…well, gibbons are (lesser) apes, while chimps and gorillas are great apes, like us.

324233_1_En_1_Fig6_HTML
Relative positions of the scapula in monkeys versus apes: the scapula is on the side in monkeys, on the back in apes.  Note the angle of the clavicle, too: the more dorsal the scapula is, the more perpendicular the clavicle is to the midline. Picture source: https://link.springer.com/chapter/10.1007/978-3-662-45719-1_1

What functional difference goes along with this structural difference? Well: the quadrupeds are really good at locomotion–it’s difficult to think of a quadruped that can’t outrun a human.  Try to catch your dog or your cat for a trip to the vet–good fucking luck, buddy.  But, quadrupeds also tend to have a big limitation: although their front limbs are very good at moving back and forth–see above about moving fast–they suck at anything else.  For example, we can make big circles with our arms; we can spread them.  We don’t have the speed of a quadruped, but we can do a lot of things with our arms that a quadruped can’t do with theirs. Throw spears at edible quadrupeds. Throw tomatoes at sopranos. Throw bums out of office.

There are two broad families of quadrupeds that have their scapula on their back, and they’re pretty fucking interesting. Come back next time for the details.

ResizedImage300286-Skeleton-Leg-Bones
Position of the scapula in the dog. Picture source: https://janedogs.com/dog-anatomy-terminology/


English notes

There are a lot of expressions that involve the shoulder.  For example, to stand shoulder to shoulder with someone has a literal meaning of standing close to someone (Merriam-Webster), and a figurative one of being united with someone, sharing a goal with them (Merriam-Webster). Example:

  • This resolution was offered in response to President Trump standing shoulder to shoulder with Putin while the Russian President offered the Special Counsel a chance to interview twelve Russian Military Intelligence Officers who’ve been indicted for conducting “large scale cyber operations to interfere with the 2016 presidential election” in exchange for politically-motivated Russian interrogations of U.S. citizens.  President Trump initially endorsed Putin’s cynical ploy as “an incredible offer” and during yesterday’s White House press briefing President Trump’s spokesperson said he was still considering it.  https://www.reed.senate.gov/news/releases/us-senate-votes-98-0-to-tell-trump-not-to-hand-over-american-citizens-to-putin

To give someone the cold shoulder is to intentionally treat them in a way that is cold or unsympathetic (Merriam-Webster); I would add the meaning of intentionally avoiding or ignoring them.  Example:

  • Most Twitter reactions seem to compare the president’s behavior to that of a child—which is pretty much on the money with what we’ve been saying since the start of this tale. Sure, it’s an improvement over the cold shoulder he gave German Chancellor Angela Merkel when she extended her hand for a handshake during her March White House visit. If the president doesn’t pout does he still earn, like, a half-star on the behavior chart? Even still, pushing is not encouraged, young man. https://www.vanityfair.com/style/2017/05/trump-nato-shove

If you are the kind of person who would watch and carefully re-watch Dubstep Cat videos to see just how far back his arms do and don’t go: we should probably be planning our wedding.

https://youtu.be/aQeIDhz-_eg

 

 

These are all true stories: or, Language Generation For Dummies

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer.

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.


Over the course of the past year or two, I’ve been writing a book about writing about what I do for a living.  Call it data science, call it natural language processing, call it machine learning–any way you slice it, the structure of the papers that we write in order to spread our little discoveries around is always pretty much the same.

In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

For people like me, that always inspires the same question: could I write a REALLY simple computer program to do this for me?

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. We know that technology as information retrieval. By now it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

One of my major obsessions in life is this: how do you START a research paper?  More precisely: how do you start a research paper in a way that isn’t BORING?

In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

Well, I do what I do in the medical field, specifically, and in the medical field, we have a saying: if no one dies in the first two sentences, your work is going to be ignored.

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. It is now an old technology, to the point that one might think that there are no open research questions left in the field.

I’m a little bit less pessimistic than that.  Inspired by openings like this favorite from a paper by Daniel Gildea and Daniel Jurafsky,

Recent years have been exhilarating ones for natural language understanding. The excitement and rapid advances that had characterized other language-processing tasks … have finally begun to appear in tasks in which understanding and semantics play a greater role. For example, …

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

…I wonder if it couldn’t work as well to save someone’s life in the first two sentences?

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. We know that technology as information retrieval. Today, it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

So, when I set out to write a simple computer program to generate the openings of research papers on a topic called information retrieval, I went looking for stories where someone landed in a doctor’s office–and came out of it better than one might have expected.

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

Finding those happy endings was the hardest part of this whole little after-dinner project.

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. We know that technology as information retrieval. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

From there, it was a simple matter of putting together a set of reasonable first sentences:

my @firstSentences = (“In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved.”,

 “In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved.”,

“In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted.”);

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. By now it is a familiar technology, and one might think that there was nothing more to be learned about it.

…and a set of reasonable second sentences…

my @secondSentences = (“The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer.”,

    “The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer.”);

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

…and so on.  I use a command called rand to pick a random sentence from the sets of possible first sentences, possible second sentences, and so on…

my $first_sentence   = rand @firstSentences;

my $second_sentence = rand @secondSentences;

In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

…and then I just glue my randomly selected first, second, third, fourth, and fifth sentences together…

$beginning_of_article = $first_sentence . $second_sentence . $third_sentence . $fourth_sentence . $fifth_sentence;

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. We know that technology as information retrieval. By now it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

…et voilà!  With only two options at each position for five different “sentence positions” (first sentence, second sentence, etc.), I have 2 to the 5th power (or 5 to the second power–I can never remember) possible openings that will work for any paper on information retrieval, ever.  That’s more papers on information retrieval than I will write between this evening and the day that I retire or die!

In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

My VERY simple little program does something called language generation.  That means that it produces output in “natural,”—i.e., human—language.  You can do REALLY fancy things with it–Google can now use its super-sophisticated language generation technology to produce entirely bogus news stories, novels, letters—or scientific articles, for that matter.

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. It is now an old technology, to the point that one might think that there are no open research questions left in the field.

So, two differences:

  1. Google’s shit is super-complicated, and mine is super-simple.
  2. Google’s shit is completely made up, and mine is completely true.

Am I fucking kidding?  Is Donald Trump beholden to Vladimir Putin???


Technical geekery

  1. Yes, I omitted the detail that rand() returns an integer that you then use as an index into the array of sentences, not a random sentence.
  2. Yes, I omitted the whitespace that you have to put between the sentences.
  3. Yes, I omitted the my in front of the output variable.
  4. Yes, I will submit one of those as the opening of a paper on information retrieval–how the fuck could I not????

Academic writing and how not to start a paper: Episode 2

Nobody gives a shit about medical terminology. Rethink your opening sentence.

This post is part of an occasional series on writing about academic research. Writing about writing about academic research on my blog allows me to avoid writing my book about writing about academic research. Techniques that other authors have used to avoid actually working on what they’re supposed to be working on include doing the laundry, abusing alcohol, and committing suicide. Personally, I think that writing about academic research on my blog is more adaptive than those techniques.  Well, obviously it’s not more adaptive than doing the laundry…

Introduction

Medical terminology is one of the best-studied aspects of the English language. This is important because… But, the complicated structure of its words makes it difficult to translate into other languages.  This is a problem because… To address this issue of decomposition, this paper describes a simple parser for biomedical terminology. This would allow…

To: Zipf

Subject: Comments on draft 

Zipf, nobody gives a shit about medical terminology. Rethink your opening sentence.

Introduction

When the first author’s grandmother had to visit her doctor–an increasingly common occurrence as she grew older–she understood more or less nothing that she was told.  But, she would take careful notes–and then call the first author to find out what it all meant. 

What was the problem here?  The first author’s grandmother was not an educated woman, but she was no dummy–a quick-witted and articulate woman who loved jokes of considerable linguistic sophistication. The issue here was not the first author’s grandmother.  Rather, it was the language used by her doctor–specifically, the highly specialized vocabulary of medicine. The first author was a medic in the military, and subsequently was awarded a doctoral degree in linguistics, writing a dissertation on biomedical language. He has no problem understanding medical terminology. But, for a normal person, the language with which their physician communicates with them can be every bit as much of an obstacle to their treatment as the rationing of care that characterizes the American health care system.

Medical terminology is one of the best-studied aspects of the English language. This is important because… But, the complicated structure of its words makes it difficult to translate into other languages.  This is a problem because… To address this issue of decomposition, this paper describes a simple parser for biomedical terminology. This would allow…

To: Zipf

Subject: Comments on draft

I wish I’d known your grandmother.

Screen Shot 2020-02-05 at 14.42.05
The oldest paper on medical terminology in PubMed/MEDLINE, the National Library of Medicine’s database of 27 million scientific publications. The piece was published in 1911–well over a hundred years ago as I write this–and the problems that it raises still have not been solved.  No author is listed.


For more Zipfian ravings on the topic of writing about academic research, see here. Or, buy my book.  Oh, wait–I’m writing this blog instead of working on the book… Damn it…

 

 

How to smile your way through the Parisian transit strike: Citymapper

The Internet has given us Trump, revenge porn, and catfishing; in recompense, it has also given us free on-line versions of a number of historical French dictionaries, and a way to weather public transportation strikes with a smile.

Executive summary: there’s an app called Citymapper available on the iPhone and Android that does an excellent job of staying on top of metro, train, and bus line operating hours.  Want to know about (1) linguistic trivia associated with strikes in French, and (2) public attitudes about the current action sociale?  Read on.

One of the things that I find very striking about Paris is that although the building located at any particular spot might change, the function carried out there can remain constant over centuries.  Millennia, even.  For example: the spot where Notre Dame de Paris is located has been a place of worship since the Druids were there.  The Palais de justice was the residence of the Roman administrator, and then the palace of the early French kings, before becoming the center of the French court system.  And, most relevant to today’s ravings: the location of the Parisian City Hall has been where the city was run out of for as long as Paris has been run by its bourgeois.

City Hall–in French, L’Hôtel de ville–is located on the Right Bank of Paris.  Although the Right Bank is very much the seat of Parisian power today, it started as mostly swampland.  (That fact figures into how the city was taken by the Romans–a story for another time.)  The expansion of Paris from the Left Bank to the Right in the early Middle Ages started with the area where the Hôtel de Ville is located today.  It was an early area of business, and the riverbank–la grève–in front of its current location was a gathering spot for laborers looking for work.  As the story goes (and I’m sorry that I can’t give you a citation for this, but I think that I ran across it in Metronome), over time the word for the place where laborers gathered became associated with strikes by laborers.

There’s some documentary evidence for this association.  Let’s work our way backward.  The Internet has given us Trump, revenge porn, and catfishing; in recompense, it has also given us free on-line versions of a number of historical French dictionaries.  Les-voilà.  Starting with the 8th edition of the Dictionnaire de l’Académie française, published 1932-1935, we have the following.  The first sentence is A level, flat surface covered with gravel or sand, going along the edge of a sea or a large river:

Screen Shot 2020-01-15 at 08.38.45
Screen shot from TheFreeDictionary.com. In the second paragraph (which I did not translate), they’re not shitting about the executions.  Notable ones that took place there that of include Jacques de Molay, the last grandmaster of the Knights Templar, who was burnt at the stake there on March 18th, 1314; and that of Robert-François Damiens, who was drawn and quartered there on March 28th, 1757.  (The event was extensively documented.  If you have a copy of Michel Foucault’s Discipline and Punish: The Birth of the Prison on your bookshelf, you’ll find an accurate description of the event in the first chapter.  The savagery was difficult to imagine–one of the professional executioners went into retirement after participating.)

Continuing back in time to the 18th century, we have this from Jean-François Féraud’s Dictionnaire critique de la langue française, published 1787-1788.  It contains the definition level and sandy beach:

Screen Shot 2020-01-15 at 09.21.13
Linguists will notice the prescriptiveness of the entry, which includes the observation that the verbal form of the word, which means “to harm,” is “not often used outside of the Palace, and in ordinary language is not good style,” as well as the facts that (1) Richelet found it a bit old (Phil dAnge, who was Richelet?), (2) Trév says that it was becoming a bit outdated (Phil dAnge: Trév.??), and (3) the Academy includes it without comment. Do note that he is talking about a verb, not about the “(river) bank” sense of grève.) Source: screen shot from https://fr.thefreedictionary.com/gr%c3%a8ve

Finally, going back to Jean Nicot’s Thresor de la langue française, published in 1606, we have the following, which includes words that I believe to mean “gravel, sand” (gravier and arena):

Screen Shot 2020-01-15 at 08.36.52
Nicot’s entry includes another meaning of the noun, which I think is a part of a suit of armor that goes on the legs. Source: https://fr.thefreedictionary.com/gr%c3%a8ve


If you haven’t been reading the news from France lately: public transport workers in and around Paris have been on strike for the past six weeks.  A public transport strike in these parts does not mean a complete cessation, but rather a diminution, of service.  A given metro line might be operating at half capacity, or maybe only 1 out of 3 trains on the line are running; those services might be only available during the morning and evening rush hours (en heures de pointe), or just in the evening.  Trains are packed to bursting, electric scooter rentals are maxed out; Uber is running, but the automotive traffic is so heavy that a 30-minute ride can easily take an hour.  As I write this in mid-January of 2020, the exceptionally convenient low-cost mobility that is such a delight of normal life in the City of Light is only a fond memory.

Are Parisians frustrated by the disruptions caused by the strike?  Of course.  Are they complaining about it a lot?  Not really.  Here are typical comments from my friends about the motivation for the strikes–a proposed reorganization of the admittedly convoluted French retirement system:

  1. The reforms won’t hurt me, personally–but, I’m worried for my child.
  2. The transportation workers are striking for all of us.
  3. The strike has to screw up Paris, or it won’t have any effect.

The comments reflect some underlying widespread French attitudes about their famous work stoppages: (1) Everybody has to earn a living, and (2) Your strike may be screwing up my life today, but my strike will be screwing up yours tomorrow.  So: in general, people are pretty tolerant of this kind of thing.

…and with that, I’m off to check Citymapper to find the best way to get to the Musée de la paléontology et de l’anatomie comparéeone of the three best museums in the world, in my humble but reasonably informed opinion.

The picture of an écartèlement (“drawing and quartering” in English) at the top of this page is of a bas relief from northeastern Spain. I found it at https://fr.vikidia.org/wiki/%C3%89cart%C3%A8lement.

Conflict of interest statement: I don’t have any.  Citymapper does not pay me, nor do they offer me free services.

What computational linguists actually do all day: The read-between-the-lines edition

Watch a movie like Arrival and you’ll get the impression that linguists spend their professional lives sitting around speculating about Sanskrit etymologies and the nature of the relationship between language and reality.  I’m not saying that we never do such things, but, no: that’s not what we do with our typical workdays.  I’m a computational linguist, which among other things means that what I do involves computers, which among other things means that I spend a certain amount of my time sitting around writing computer programs that do things with language.  Often, those programs are doing things that do not look very…exciting.  Not to the untrained eye, at any rate.

For other glimpses into the daily life of computational linguists, click here.

Case in point: yesterday I wanted to see how the statistical characteristics of language are affected by different decisions about what you consider a “word.”  You would think that the word “word” would be easy to define–in fact, not only do linguists not agree on what a word is, but you would have a hard time getting all linguists to agree that words even exist.  (One of the French-language linguistics books that I have my nose stuck in the most is Maurice Pergnier’s Le mot, “The Word.”  The first 50 pages (literally) are devoted to theoretical controversies around the question of whether words actually exist–or not. Want a good English-language discussion of the issues?  See Elisabetta Jezek’s The lexicon: An introduction.)

So, yesterday I got to thinking about one of the questionable cases in English: contracted negatives of modal verbs.  Here’s what that means.

In English, there is a small number of frequently-occurring verbs that can (and do) get negated not by a separate word like no, but by adding a special ending, spelled -n’t:

  • is/isn’t
  • did/didn’t
  • have/haven’t
  • could/couldn’t
  • would/wouldn’t
  • does/doesn’t

Note that British English has another form:

I’ve not

…which means I haven’t.

Now, if you care about statistics, you care about counting things.  Think about how you would count the numbers of words in these examples:

  1. I want to go.
  2. I do want to go.
  3. I do not want to go.
  4. I don’t want to go.

(3) and (4) are both perfectly acceptable ways of negating (1) and (2).  How would they affect a program that counts the number of words?  It depends.  Here are the straightforward cases: if (1) has four words (I, want, to, and go), then (2) has five (add do to the previous four), and (3) has 6 (add not to the previous five).

The questionable case is (4).  You could make a reasonable argument that don’t is a single word.  You also could make a reasonable argument that don’t should be counted as two words.  But, which two words?  A reasonable person could propose do and n’t–just split the “stem” do from the negative n’t.  

Fine.  But, let’s look at a little more data:

  1. I will go.
  2. I will not go.
  3. I won’t go.
  4. I can go.
  5. I cannot go.
  6. I can’t go.

Clearly (1) has three words–I, will, and go.  …  (2) adds one more, with not.  What about (3), though?  Is it inconsistent to count will not as two words, but won’t as one?  Maybe.  If you’re going to split it into two “words,” what are they?  Presumably wo and n’t?  But, what the hell is wo?  Is it the same “word” as will?  Notice that we’ve now had to start putting “word” in “scare quotes,” which should tell you that knowing what, exactly, a “word” is isn’t quite as simple as it might appear at first glance.  Think about this: in science you need to know what it is, exactly, the thing that you’re studying, which implies that you can recognize the boundary between one of those things and another.

What’s the right answer?  Hell, I don’t know.  I do know this, though: if you’re interested in the statistics of language (wait–what’s you’re?  Hell, what’s what’s?), then you have to be able to count things, so you have to make some decisions about where the boundaries between them are.  My issue du moment is actually not choosing between the options, but rather seeing what the consequences of those specific decisions would be for the resulting statistical measures, so I need to be able to test the effects of different ways of splitting things up (or not), so I need to write some code…


What you see below is me using a computational tool called a “regular expression” to find words that have a negative thing attached at the end (e.g. n’t) and separate the negative thing from the rest of the word.  So, given an input like didn’t, I want my program to (1) recognize that it has a negative thing at the end, and (2) split it into two parts: did, and n’t.  Grok (see the English notes for what grok means) the code (code means “instructions in a programming language”–here I’m using one called Perl), and then scroll down past it for an explanation of how it illustrates a piece of advice that I often give to students…

# this assumes input from a pipe...
while (my $line = <>) {

print "Input: $line";

# this doesn't work--why?
#$line =~ s/\b(wo|ca|did|could|should|might)(n't)\b$/\$1 $\2)/gi;
# works...
#$line =~ s/(a)(n)/a n/gi;
# this does what I want...
#$line =~ s/(a)(n)/$1 $2/gi;
# works...
#$line =~ s/(ca)(n't)/$1 $2/gi;
# works...
#$line =~ s/(ca|wo)(n't)/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|did)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|did|could)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|did|could|should|might)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|did|had|could|should|might)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|did|had|have|could|should|might)(n't)\b/$1 $2/gi;
# works...
#$line =~ s/\b(ca|wo|do|did|has|had|have|could|should|might)(n't)\b/$1 $2/gi;
# and finally: this pretty much looks like what I started with, but
# what I started with most definitely does NOT work...  what the fuck??

$line =~ s/\b(ca|wo|do|did|has|had|have|would|could|should|might)(n't)\b/$1 $2/gi;

       print $line;

} # close while-loop through input

The “regular expressions” in this code are the things that look like this:

s/\b(wo|ca|did|could|should|might)(n't)\b$/\$1 $\2)/gi

…or, in the case of a much shorter one, like this:

s/(a)(n)/a n/gi

(Note to other linguists: yes, I know that technically, the regular expression is just the part between the first two slashes, i.e. the underlined part s/(a)(n)/a n/gi in the second example.  Don’t hate on me–I’m trying to make this at least somewhat clear.) The lines that start with # are my notes to myself—the “reading between the lines” that you have to do to see how irritating it can be to troubleshoot this kind of thing.


regular expression is a way of describing a set of things.  What makes it “regular”–a mathematical term–is that those things can only occur in a very limited number of relationships.  In particular, that limited number of relationships do not include some phenomena that are very important in language, such as agreement between subjects and verbs–think of Les trois soeurs de ma grand-mère m’ont toujours aimé, “my grandmother’s three sisters have always loved me.”  The issue here is that regular expressions can only describe sequences of things that you might think of as “next to” each other; les trois soeurs is separated from the verb avoir, which must be in the third person plural form ont, by ma grand-mère, which would require the third person singular form a.  (Linguists: I know.)

Regular expressions, and the “regular languages” that they can describe, became of importance in linguistics when B.F. Skinner (yes, the famous psychologist) wrote a book about the psychology of language in which he suggested that they can describe human languages from a mathematical perspective.  This claim caught the attention of one Noam Chomsky, who wrote a book review pointing out the inadequacy of the idea of regular expressions as a description of human language.  The review brought him a lot of notice, and he went on to develop the ideas in that review into the most widespread and influential linguistic theory since the Tower of Babel.  Today, if you’ve only heard of one linguist, it was almost certainly Chomsky.

Chomsky’s critique of “regular languages” included the observation that there are perfectly natural things that can be said in any human language that can’t be described by a regular language.  For example:

Me, my brother, and my sister went to William and Mary, Indiana University, and Virginia Tech, respectively.

The problem that this illustrates for regular languages is that they don’t have a mechanism for accounting for the fact that you can have sentences where you have a list of things in an early part of the sentence, and then must have a list of things of the same length in a later part of the sentence.  Don’t believe me? Go read a book on “formal languages,” and then try it.


Linguistic geekery

Regular expressions are pretty natural tools for people who work with textual data, and they’re especially natural for linguists.  This is a surprise to a lot of computer scientists, some of whom are masters of regular expressions, but some of whom find them irritatingly bewildering.  It turns out that if you take a course on the “formal foundations” of linguistics, i.e. its groundings in logic and set theory, you will run across regular languages, which fact makes regular expressions pretty easy to learn.  And, for textual data, they are really useful even despite their limits–so much so that a programming language (named Perl) was created expressly for the purpose of making it easy to use regular expressions to “parse” textual data.  So, when I found myself wanting to be able to rip through a bunch of textual data and find the negative things like n’t, Perl and its regular expressions were a logical choice.