Coronavirus entertainment: Tattooing yourself

American English oral comprehension practice, with coronavirus and some tattoos

A bit of English listening practice: here’s an NPR story about a tattoo artist who is keeping himself entertained by tattooing…himself.  I’ve put together a list of some vocabulary items that you might not have come across before.  If you would like other aspects of the language of the story explained, please say so in the Comments section!

ink: The basic meaning of this word as a noun is encre. When the announcer says (at 12 seconds into the story) that the tattoo artist has given himself one new piece of ink every day in quarantine, he means that the man is giving himself one tattoo every day. Nowadays you can also use this word as a verb, to ink, which means to give a tattoo, or to tattoo. Examples below.

to sport [something]: to wear proudly. Merriam-Webster defines it this way: “to display or wear usually ostentatiously.” At 15 seconds into the story, the reporter says Woodhead, who already sported hundreds of tattoos…

CNN: A popular cable news channel. It is roughly the BFM-TV of the United States–relatively short stories versus a lot of in-depth programming; the same stories played over and over; and, it should be said, some very good commentators, particularly Fareed Zakaria, a pretty unattractive man who I would nonetheless love to have a cup of coffee with due to his great insights into the world and the news of the day.

 

 

Fiche le camp, Jack: English idiomatic expressions with “to hit”

One of the most delightful books I have ever read in French is named Les Mots et la chose–“Words and The Thing.”  “The thing” is a euphemism for “sex.”  The conceit of the book is that an actress who earns her keep by dubbing pornographic movies has grown weary of the limited vocabulary that her job calls for, so she writes to a retired linguist who specialized in words for la chose to ask for suggestions.  He comes through in spades, with separate chapters for all of the relevant body parts, and of course for l’acte itself.  My favorite: Le détroit des Dardanelles,  the Strait of the Dardanelles, for that part of your body where poo comes out and where, between friends, other things might occasionally go in.


I keep seeing all of these articles in the paper about how to fight coronavirus-quarantine-related boredom.  I don’t get it–I haven’t been this busy in ages.  Telecommuting; reminding my father to eat, to take his medicine, and to let me do his laundry; making masked food runs to the grocery store; eating half of a chocolate babka in a single day (damn it, Zipf); sitting on the front porch smoking cigarettes and petting the dog–I barely have time to learn my 10 words per day of French vocabulary.

Of course, none of that has stopped me from spending inordinate amounts of time looking up French-language covers of classic American songs.  For example, Fiche le camp, Jack is a cover of Hit the Road, Jack, a favorite from before my childhood (and hence, a long fucking time ago).  A cover differs from a dubbed version in that where dubbing involves an original video version whose audio track is replaced, a cover is a de novo production.  So, if there is a video involved, too, then it will be shot anew for the new version.

So, the above-mentioned French actress is dubbing movies so that they have a French-language soundtrack, while the video below shows a version of Hit the Road, Jack, nicely covered by Richard Anthony and some great back-up singers. I hope that it brings a smile to your quarantine day.  Scroll down for the English notes if you are so inclined–today we will talk about some idioms involving the verb to hit, as well as discuss American Evangelical beliefs about what’s going to happen to us sinners.


English notes: idioms involving the verb to hit

In the following examples, note that hit is an irregular verb: its present tense, past tense, and past participle are all hit.

to hit the books: to study.

I can’t go to the party tonight–I gotta hit the books.

Gotta is colloquial language for to have to.

to hit the road: to leave.

This has been a great party, but it’s time for me to hit the road–I gotta go study for my stupid linguistics exam.

to hit bottom: to reach a/the really terrible part of your life. It is often used in conjunction with alcoholics and drug addicts–the belief is that before you can get dry (alcoholics)/clean (drug addicts), you have to “hit bottom.”

God had left her alone with the sinners, so she would sin.  But, she hit bottom after going on a drunken binge with two men she met at a Catholic-sponsored conference on Poverty in the World of Change.  She woke up naked in a hotel bathtub.

The Forsaken, Book Two of The Apocalypse Trilogy.  This is an amusing series of American Protestant fundamentalist fiction about The Rapture, an event in which non-sinners will be whisked up to Heaven, while the rest of us are left on Earth.  (I think that we get damned to eternal Hell at some point.)  The extract is fascinating to me, in that in three short sentences it evokes so many of the tropes of American Protestant fundamentalism: anti-Catholicism, resistance to social services for the poor, and of course loathing of sex.

to hit the sack: to go to bed.

I’m gonna hit the sack–I’ll study for that stupid linguistics test tomorrow.

to hit the hay: to go to bed.

Well, Jack finally hit bottom. He went to the party, but he hit the road early to go home and hit the books.  But, instead, he hit the hay and didn’t study at all.  So, he flunked the test, which dropped his final grade in the course, which dropped his overall GPA, so he lost his badminton scholarship.  He went to his professor and asked him to raise his grade, but his professor said “Surely my course isn’t the only one in which you earned a lower grade than you needed?  Why not go to one of your other professors, and ask them to raise your grade?”  I guess you gotta hit bottom before you get sufficiently motivated as to get your shit together.

I have changed some details to protect the guilty.  But, yeah–I was the professor.

 

 

 

 

Prévert and Les mystères de Paris: Best. Vocabulary. Word. Ever.

Normalcy through vocabulary. And poetry.

The fact that covid-19 has 50% of the world’s population under lockdown orders does not change the fact that in the US, it is National Poetry Month.  The French are getting cats to play tic-tac-toe (le morpion in French, which also means [genital] crab, and I cannot stop giggling like a schoolboy about that), Americans are watching Netflix, and the President of the United States is showing himself more and more to be le roi des cons–and Art goes on.


Jacques Prévert’s poem Pater noster has opening lines as good as any in the world of free verse (translations by me, sorry):

Notre Père qui êtes aux cieux
Restez-y

Our Father who art in heaven
Stay there

Et nous nous resterons sur la terre
Qui est quelquefois si jolie

And we’ll stay here on Earth
Which is sometimes so pretty

Avec ses mystères de New York
Et puis ses mystères de Paris

With its mysteries of New York
And then its mysteries of Paris


So, yeah: the cool neighborhood near me is now empty except for the homeless people living under tarps in the sheltered doorways of now-abandoned shops, Macron is urging the French to support health-care workers, and Trump is urging Americans to support airlines; and I am trying to restore some sense of normalcy to my life by learning my usual 10 words of French vocabulary per day.

So, I’m on a French-language furniture web site the other day trying to find a picture of some obscure item of furniture or another that I ran across while reading Colette’s Chéri, when I came across this: the mystères de Paris.  Literally, that means “the mysteries of Paris”–but it means so, so much more…and thus we have the Best. Vocabulary. Word. Ever.


It turns out that there is such a thing as a mystères de Paris–and it is a commode.  Not a commode in the French sense of the word–what’s called in English a dresser–but a commode in the English sense of the word–a bedside chair with a receptacle for pooping.  A bedside toilet, if you will.  It’s not just any kind of commode, though:

  1. It’s a disguised commode.
  2. It is usually made to look like a stack of books.

From the Meubliz.com web site (translations by me, sorry):

Ce siège d’aisance prend la forme d’une pile de livres simulés. La partie supérieure s’ouvre comme un abattant pour laisser apparaître la cuvette. Ce petit meuble repose sur des pieds bas tournés en balustre ou découpés.

Généralement, ce siège de commodité assez original était décoré de belles et luxueuses couleurs.

This commode takes the form of a pile of fake books. The upper part opens as a lid to access the bowl.  This small piece of furniture sits on feet that have been [not sure what those carpentry terms mean].

Typically, this rather unusual commode was decorated with pretty, luxurious colors.

73c4f421c8f5c02d972a939c5f656b7c
Mystères de Paris bedside toilet. Source: Meubliz.com

If you’ve followed this site, you know that Prévert’s poetry is great for understanding what people mean when they talk about “the impossibility of translation.” This is a great example–I just can’t even imagine a way to render mystères de Paris into English, and forget about maintaining that rhyme:

….sur la terre
Qui est quelquefois si jolie

…on Earth
Which is sometimes so pretty

Avec ses mystères de New York
Et puis ses mystères de Paris

With its mysteries of New York
And then its mysteries of Paris

(Yes, jolie and Paris rhyme in French.)

ffd1437339d98390438b8a74f1394eb3
A Dutch-made mystères de Paris bedside toilet from 1850. Source: Meubliz.com

(Wait, I forgot–more tic-tac-toe-playing cats…)

 

So…let’s all stay in, stay healthy, thank the people working in the grocery stores, thank the people working in the gas stations, thank the doctors, thank the nurses, thank the respiratory therapists–and ignore les maîtres de ce monde, les maîtres avec leurs prêtres, leurs traîtres et leurs reîtres–a line from later in the poem that is more than evocative of the coronavirus-era Trump.  And let’s take care of each other.

See this post for the full poem, as well as for a discussion of the line that I just mentioned.  You can exercise your oral comprehension skills with an English-language video, complete with subtitles, on how to make your own face mask here.

ff47b35bd24e18605d4b2e36846b57b9
Mystères de Paris bedside toilet. Source: Meubliz.com

What computational linguists actually do all day during lockdown

For starters: how about if we fight to get the files open?

I know, I know—computational linguistics sounds like it would be the most glamorous job in the world, right? We have a dirty little secret, though: 85% percent of our time is spent just trying to read in, and clean up, data files.

The novel coronavirus covid-19 has me on lockdown just like anybody else. What to do all day? Well, the National Library of Medicine, the Allen Institute for Artificial Intelligence, and some other folks whose names escape me recently released a corpus (set of linguistic data) of scientific journal articles that might or might not be relevant to covid-19 research, named CORD-19, and asked the artificial intelligence people of the world to see what they could do with it. Great—what else would I do all day?

For starters, how about fight to get the fucking files to open?

After that, we could clean some of the useless stuff out of the data—section headings (Introduction, Methods, Results…), puffery (important, significant), stuff like that…

Get rid of spaces and that sort of nonsense…

…and then make a TermDocumentMatrix, a “data structure” that lists all of the words in the corpus and the documents in which they occur (or all of the documents in the corpus and which words occur in them, depending on how you flip it).

We’ll try to make a word cloud, which will result in us watching several pages’-worth of error messages fly by (most of them removed for your viewing pleasure):

…and then we can finally see what’s in that data set, at which point we notice that there are some more words that we should be removing: …and then make a little graph of the most frequent words, at which point we’ll realize that we should probably be removing things like plural markers on nouns, past-tense markers on verbs, and stuff like that:

…and finally, we will record exactly what we did, in hopes of actually being able to do it on demand, which we obviously need to do immediately, or as soon as we fix the problems that we just identified, at any rate.

And there’s another lesson in this, too: 85% of any programmer’s time, be they computational linguists or not, is spent fixing the problems with code (computer instructions) that they have already written.
As is often the case with my articles about what I do for a living, this one sounds whiny. Don’t be fooled: I love my job, and consider myself the luckiest guy in the world to be able to do things that I love for a living. Now to fix those fucking problems…

Computational linguistics and the covid-19 coronavirus

I know, I know–you think that computational linguists spend their time sitting around discussing morphological typology and its implications for the monogenetic versus polygenetic hypotheses of the origin of language. And we do–after work, over beers.

For most of us, though, our professional life consists of trying to get computers to do things that involve language in some way.  One of the biggies is helping people find information. To teach a computer to do something like that, you need to have access to a lot of data to test the system that you are building.

Enter the covid-19 coronavirus.  There is suddenly an enormous amount of research being done on something that we did not even know about just 6 months ago. At the same time, there is a lot of research already published on other coronaviruses, and it would be idiotic to try to do research on a novel coronavirus without taking advantage of what we already know about the others. But, how can anyone go through the 15,000+ papers on coronavirii (spelling?) that are already in the US National Library of Medicine’s PubMed/MEDLINE database?

Enter computational linguists. Sometimes considered a branch of artificial intelligence, we work on computer programs to do things like summarize large sets of publications.  There are lots of things that you have to be able to do in order to do that–figure out what’s being talked about (coronavirus and medications? Coronavirus and transmissibility? Coronavirus and respiratory failure?); tell the difference between a positive statement, a negated statement, a speculative statement, and a negated speculative statement:

  1. Positive: The person-to-person transmission routes of 2019-nCoV included direct transmission, such as cough, sneeze, droplet inhalation transmission, and contact transmission, such as the contact with oral, nasal, and eye mucous membranes. (Source: this paper)
  2. Negated: The person-to-person transmission routes of 2019-nCoV did not include indirect transmission over the Internet. (I made this sentence up)
  3. Speculative: This other coronavirus might be specific to deer species. (From this paper published in 1995 about a different coronavirus)
  4. Negated speculative: This other coronavirus might not be specific to deer species. (I made this sentence up)

…and many other tasks that all have to be handled in order to solve the problem of summarizing those 15,000+ papers–and many other problems in getting computers to understand human language, too.

Like I said, though: in order to test our systems, we need data.  Enter a number of the big players in computational linguistics, who have created, and made freely available to the public, a large dataset of relevant papers. Their hope? That computational linguists around the world will dive into them, using them to develop and test tools for dealing with all of those papers.  Here’s an excerpt from the White House’s web site describing the effort to create and release the data, followed by a French-language appeal to the francophone computational linguistics community to work on it sent out by my colleague Pierre ZweigenbaumC’est parti…

One of the most immediate and impactful applications of AI is in the ability to help scientists, academics, and technologists find the right information in a sea of scientific papers to move research faster. We applaud the OSTP, WHO, NIH and all organizations that are taking a proactive approach to use the most advanced technology in the fight against COVID-19,” said Dr. Oren Etzioni, Chief Executive Officer of the Allen Institute for AI. “The Allen Institute for AI, and particularly the Semantic Scholar team, is committed to updating and improving this important resource and the associated AI methods the community will be using to tackle this crucial problem.”

“It’s difficult for people to manually go through more than 20,000 articles and synthesize their findings. Recent advances in technology can be helpful here. We’re putting machine readable versions of these articles in front of our community of more than 4 million data scientists. Our hope is that AI can be used to help find answers to a key set of questions about COVID-19,” said Anthony Goldbloom, Co-Founder and Chief Executive Officer at Kaggle.

Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset

 

Bonjour,

Voici une série de tâches de recherche d’information / extraction d’information / fouille de textes / recherche de réponses à des questions lancées le 16 mars [1] sur un sujet d’actualité :

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks

dans une base de 29000 articles (dont 13000 en texte intégral) concernant le coronavirus (bien sûr pas seulement le « nouveau ». Les questions sont listées sous la rubrique “Tasks”, et chaque question générique est déclinée en questions spécifiques. Voir par exemple “What is known about transmission, incubation, and environmental stability?”
(https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=568).

Par ailleurs, un corpus (LitCovid) sur le Covid-19 est mis à jour en continu à la National Library of Medicine :
https://www.ncbi.nlm.nih.gov/research/coronavirus/ (1263 articles à
l’heure où j’écris ce message contre 1120 deux jours avant).

Le DBCLS à Tokyo a mis en place dans sa plateforme de gestion d’annotations un espace pour centraliser les informations extraites sur le corpus LitCovid sous forme d’annotations :

http://pubannotation.org/collections/LitCovid

Tous les spécialistes de TAL sont donc encouragés à appliquer leurs méthodes sur ces données et à les faire tourner sur Kaggle (CORD-19), à les appliquer au corpus LitCovid et à déposer les annotations sur PubAnnotation.

Bien cordialement,

Pierre Zweigenbaum.

[1] “Today, researchers and leaders from the Allen Institute for AI,
Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for
Security and Emerging Technology (CSET), Microsoft, and the National
Library of Medicine (NLM) at the National Institutes of Health released
the COVID-19 Open Research Dataset (CORD-19) of scholarly literature
about COVID-19, SARS-CoV-2, and the Coronavirus group.”

Coup de grâce: Don Marquis’s “freddy the rat perishes”

Some of the most traumatic experiences of my life have involved me killing a mouse.  Tonight was no exception. 

Some of the most traumatic experiences of my life have involved me killing a mouse.  Tonight was no exception.  Delivering the coup de grâce was miserable for me–besides being soft-hearted, I have an incredible phobia about dead animals–but I felt like after a long and hard-fought battle, the furry little warrior deserved it.

The experience brought to mind the end of this poem by Don Marquis.  It is all in lower-case because it has been typed by a cockroach.  Archie (yes, that’s the cockroach’s name) depresses the keys by jumping on them; this precludes ever hitting the shift key to make upper-case letters.  Many thanks to the Don Marquis blog, where I found the text.  Want to know more about the strange case of Archie the cockroach poet?  See this post.

freddy the rat perishes

By Don Marquis, in “archy and mehitabel,” 1927

listen to me there have
been some doings here since last
i wrote there has been a battle
behind that rusty typewriter cover
in the corner
you remember freddy the rat well
freddy is no more but
he died game the other
day a stranger with a lot of
legs came into our
little circle a tough looking kid
he was with a bad eye

who are you said a thousand legs
if i bite you once
said the stranger you won t ask
again he he little poison tongue said
the thousand legs who gave you hydrophobia
i got it by biting myself said
the stranger i m bad keep away
from me where i step a weed dies
if i was to walk on your forehead it would
raise measles and if
you give me any lip i ll do it

they mixed it then
and the thousand legs succumbed
well we found out this fellow
was a tarantula he had come up from
south america in a bunch of bananas
for days he bossed us life
was not worth living he would stand in
the middle of the floor and taunt
us ha ha he would say where i
step a weed dies do
you want any of my game i was
raised on red pepper and blood i am
so hot if you scratch me i will light
like a match you better
dodge me when i m feeling mean and
i don t feel any other way i was nursed
on a tabasco bottle if i was to slap
your wrist in kindness you
would boil over like job and heaven
help you if i get angry give me
room i feel a wicked spell coming on

last night he made a break at freddy
the rat keep your distance
little one said freddy i m not
feeling well myself somebody poisoned some
cheese for me im as full of
death as a drug store i
feel that i am going to die anyhow
come on little torpedo don t stop
to visit and search then they
went at it and both are no more please
throw a late edition on the floor i want to
keep up with china we dropped freddy
off the fire escape into the alley with
military honors

archy

Picture source: https://gritinthegears.blogspot.com/2010/11/freddy-rat-perishes-revisited.html

Seeing the complexity of the simple: Comparative anatomy of the scapula

We can do a lot of things with our arms that a quadruped can’t do with theirs. Throw spears at edible quadrupeds. Throw tomatoes at sopranos. Throw bums out of office.

Being a scientist means finding delight in things that look complicated but are actually governed by pretty simple principles, as well as in things that look pretty simple but are actually pretty complex.  Case in point: the scapula.

Common English: shoulder blade.

Technical term: scapula, plural scapulae or scapulas.

French: l’omoplate (nf), la scapula (WordReference)

A scapula looks simple: they’re mostly flat, with a protruberance here and there.  Unlike closely associated bones, they don’t get broken very often–a Swedish study found a rate of scapular fractures of 10 per 100,000 people, while another Swedish study found 50 clavicular fractures per 100,000 people.

And yet: comparing the scapula across species, you see all kinds of interesting shit.  The point of comparative anatomy is that you can understand something better if you compare it to other ways that it could be, but isn’t.  So: let’s compare some scapulae.

The most obvious thing about the scapula is that it is positioned differently in different species.  The basic situation for most living things with limbs is this: you’re a quadruped (i.e. have four legs), and the scapula is located on the side of the trunk.  In contrast: look at a human, and the scapula is on its back. Compare the position of the scapula in this lovely picture of a horse:

horse skeleton
The axis of the scapula is on the line between points A and B. Picture source: http://www.horsecoursesonline.com/college/conformation/lesson_two_893.htm

…with the position of the scapula in this lovely picture of a human:

0d2300b469dd6a2efc1ba75783a74ef5
Picture source: https://www.pinterest.com/pin/547891110909216204/?lp=true

…and you see the difference.  It’s even more striking when you look at our closer relatives.  We are primates, and specifically, members of a group of primates known as apes, and even more specifically, of the great apes.  One of the biggest differences between us and our various and sundry primate relatives is that we are full-time bipeds.  Autrement dit: we walk upright, all the time.  In contrast, monkeys–which are primates, but not apes–are full-time quadrupeds.  Going along with this difference in locomotion is a difference in the position of the scapula: it’s on our back, but on a monkey’s side.  

Here’s a really nice view of a primate (left) and human (right) trunk.  Looking at the left side, labelled “monkey,” you see the typical quadruped architecture: the scapula is on the side of the chest cage.  On the right side, labelled “human,” it’s a different story: the scapula is on the back.

The arrows in this illustration make an important point: primates also have the typical quadruped chest cage, which is relatively narrow in comparison to its depth. In contrast, the human chest cage is sorta flat–relatively wide in comparison to its depth.  (Remember that the human skeleton in the picture is viewed from above, as if you had ripped someone’s head off in order to shit down their neck. (Sorry–a little sailor-talk there.  Unlike Trump, I served my country.)  In contrast, the monkey is being viewed from the front–I have no great analogy for you here.)

chest_compar (1)
Picture source: https://evolution.berkeley.edu/evolibrary/article/0_0_0/lines_05

Amongst primates in general, there is quite a bit of variability.  Why? Well, there’s quite a bit of variability in the extent to which they are quadrupedal versus bipedal. There’s quite a bit of variability in the capacity of the creature to do stuff with its hands over its head.  Here’s a nice layout that shows aspects of the shoulder anatomy across a range from true monkeys, to great apes, ending up with the ape-iest of us all: the anatomically modern human.  Start with the sacred monkey in the upper-left corner, and the scapula is clearly on the side of the thoracic cage.  End with the human in the bottom-right corner, and it’s clearly on the back.  In between…well, gibbons are (lesser) apes, while chimps and gorillas are great apes, like us.

324233_1_En_1_Fig6_HTML
Relative positions of the scapula in monkeys versus apes: the scapula is on the side in monkeys, on the back in apes.  Note the angle of the clavicle, too: the more dorsal the scapula is, the more perpendicular the clavicle is to the midline. Picture source: https://link.springer.com/chapter/10.1007/978-3-662-45719-1_1

What functional difference goes along with this structural difference? Well: the quadrupeds are really good at locomotion–it’s difficult to think of a quadruped that can’t outrun a human.  Try to catch your dog or your cat for a trip to the vet–good fucking luck, buddy.  But, quadrupeds also tend to have a big limitation: although their front limbs are very good at moving back and forth–see above about moving fast–they suck at anything else.  For example, we can make big circles with our arms; we can spread them.  We don’t have the speed of a quadruped, but we can do a lot of things with our arms that a quadruped can’t do with theirs. Throw spears at edible quadrupeds. Throw tomatoes at sopranos. Throw bums out of office.

There are two broad families of quadrupeds that have their scapula on their back, and they’re pretty fucking interesting. Come back next time for the details.

ResizedImage300286-Skeleton-Leg-Bones
Position of the scapula in the dog. Picture source: https://janedogs.com/dog-anatomy-terminology/


English notes

There are a lot of expressions that involve the shoulder.  For example, to stand shoulder to shoulder with someone has a literal meaning of standing close to someone (Merriam-Webster), and a figurative one of being united with someone, sharing a goal with them (Merriam-Webster). Example:

  • This resolution was offered in response to President Trump standing shoulder to shoulder with Putin while the Russian President offered the Special Counsel a chance to interview twelve Russian Military Intelligence Officers who’ve been indicted for conducting “large scale cyber operations to interfere with the 2016 presidential election” in exchange for politically-motivated Russian interrogations of U.S. citizens.  President Trump initially endorsed Putin’s cynical ploy as “an incredible offer” and during yesterday’s White House press briefing President Trump’s spokesperson said he was still considering it.  https://www.reed.senate.gov/news/releases/us-senate-votes-98-0-to-tell-trump-not-to-hand-over-american-citizens-to-putin

To give someone the cold shoulder is to intentionally treat them in a way that is cold or unsympathetic (Merriam-Webster); I would add the meaning of intentionally avoiding or ignoring them.  Example:

  • Most Twitter reactions seem to compare the president’s behavior to that of a child—which is pretty much on the money with what we’ve been saying since the start of this tale. Sure, it’s an improvement over the cold shoulder he gave German Chancellor Angela Merkel when she extended her hand for a handshake during her March White House visit. If the president doesn’t pout does he still earn, like, a half-star on the behavior chart? Even still, pushing is not encouraged, young man. https://www.vanityfair.com/style/2017/05/trump-nato-shove

If you are the kind of person who would watch and carefully re-watch Dubstep Cat videos to see just how far back his arms do and don’t go: we should probably be planning our wedding.

https://youtu.be/aQeIDhz-_eg

 

 

These are all true stories: or, Language Generation For Dummies

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer.

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.


Over the course of the past year or two, I’ve been writing a book about writing about what I do for a living.  Call it data science, call it natural language processing, call it machine learning–any way you slice it, the structure of the papers that we write in order to spread our little discoveries around is always pretty much the same.

In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

For people like me, that always inspires the same question: could I write a REALLY simple computer program to do this for me?

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. We know that technology as information retrieval. By now it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

One of my major obsessions in life is this: how do you START a research paper?  More precisely: how do you start a research paper in a way that isn’t BORING?

In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

Well, I do what I do in the medical field, specifically, and in the medical field, we have a saying: if no one dies in the first two sentences, your work is going to be ignored.

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. It is now an old technology, to the point that one might think that there are no open research questions left in the field.

I’m a little bit less pessimistic than that.  Inspired by openings like this favorite from a paper by Daniel Gildea and Daniel Jurafsky,

Recent years have been exhilarating ones for natural language understanding. The excitement and rapid advances that had characterized other language-processing tasks … have finally begun to appear in tasks in which understanding and semantics play a greater role. For example, …

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

…I wonder if it couldn’t work as well to save someone’s life in the first two sentences?

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. We know that technology as information retrieval. Today, it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

So, when I set out to write a simple computer program to generate the openings of research papers on a topic called information retrieval, I went looking for stories where someone landed in a doctor’s office–and came out of it better than one might have expected.

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

Finding those happy endings was the hardest part of this whole little after-dinner project.

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. We know that technology as information retrieval. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

From there, it was a simple matter of putting together a set of reasonable first sentences:

my @firstSentences = (“In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved.”,

 “In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved.”,

“In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted.”);

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. By now it is a familiar technology, and one might think that there was nothing more to be learned about it.

…and a set of reasonable second sentences…

my @secondSentences = (“The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer.”,

    “The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer.”);

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

…and so on.  I use a command called rand to pick a random sentence from the sets of possible first sentences, possible second sentences, and so on…

my $first_sentence   = rand @firstSentences;

my $second_sentence = rand @secondSentences;

In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

…and then I just glue my randomly selected first, second, third, fourth, and fifth sentences together…

$beginning_of_article = $first_sentence . $second_sentence . $third_sentence . $fourth_sentence . $fifth_sentence;

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. We know that technology as information retrieval. By now it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

…et voilà!  With only two options at each position for five different “sentence positions” (first sentence, second sentence, etc.), I have 2 to the 5th power (or 5 to the second power–I can never remember) possible openings that will work for any paper on information retrieval, ever.  That’s more papers on information retrieval than I will write between this evening and the day that I retire or die!

In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

My VERY simple little program does something called language generation.  That means that it produces output in “natural,”—i.e., human—language.  You can do REALLY fancy things with it–Google can now use its super-sophisticated language generation technology to produce entirely bogus news stories, novels, letters—or scientific articles, for that matter.

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. It is now an old technology, to the point that one might think that there are no open research questions left in the field.

So, two differences:

  1. Google’s shit is super-complicated, and mine is super-simple.
  2. Google’s shit is completely made up, and mine is completely true.

Am I fucking kidding?  Is Donald Trump beholden to Vladimir Putin???


Technical geekery

  1. Yes, I omitted the detail that rand() returns an integer that you then use as an index into the array of sentences, not a random sentence.
  2. Yes, I omitted the whitespace that you have to put between the sentences.
  3. Yes, I omitted the my in front of the output variable.
  4. Yes, I will submit one of those as the opening of a paper on information retrieval–how the fuck could I not????