These are all true stories: or, Language Generation For Dummies

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer.

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.


Over the course of the past year or two, I’ve been writing a book about writing about what I do for a living.  Call it data science, call it natural language processing, call it machine learning–any way you slice it, the structure of the papers that we write in order to spread our little discoveries around is always pretty much the same.

In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

For people like me, that always inspires the same question: could I write a REALLY simple computer program to do this for me?

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. We know that technology as information retrieval. By now it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

One of my major obsessions in life is this: how do you START a research paper?  More precisely: how do you start a research paper in a way that isn’t BORING?

In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

Well, I do what I do in the medical field, specifically, and in the medical field, we have a saying: if no one dies in the first two sentences, your work is going to be ignored.

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. It is now an old technology, to the point that one might think that there are no open research questions left in the field.

I’m a little bit less pessimistic than that.  Inspired by openings like this favorite from a paper by Daniel Gildea and Daniel Jurafsky,

Recent years have been exhilarating ones for natural language understanding. The excitement and rapid advances that had characterized other language-processing tasks … have finally begun to appear in tasks in which understanding and semantics play a greater role. For example, …

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

…I wonder if it couldn’t work as well to save someone’s life in the first two sentences?

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. We know that technology as information retrieval. Today, it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

So, when I set out to write a simple computer program to generate the openings of research papers on a topic called information retrieval, I went looking for stories where someone landed in a doctor’s office–and came out of it better than one might have expected.

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

Finding those happy endings was the hardest part of this whole little after-dinner project.

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. We know that technology as information retrieval. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

From there, it was a simple matter of putting together a set of reasonable first sentences:

my @firstSentences = (“In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved.”,

 “In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved.”,

“In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted.”);

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. By now it is a familiar technology, and one might think that there was nothing more to be learned about it.

…and a set of reasonable second sentences…

my @secondSentences = (“The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer.”,

    “The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer.”);

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

…and so on.  I use a command called rand to pick a random sentence from the sets of possible first sentences, possible second sentences, and so on…

my $first_sentence   = rand @firstSentences;

my $second_sentence = rand @secondSentences;

In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

…and then I just glue my randomly selected first, second, third, fourth, and fifth sentences together…

$beginning_of_article = $first_sentence . $second_sentence . $third_sentence . $fourth_sentence . $fifth_sentence;

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. We know that technology as information retrieval. By now it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

…et voilà!  With only two options at each position for five different “sentence positions” (first sentence, second sentence, etc.), I have 2 to the 5th power (or 5 to the second power–I can never remember) possible openings that will work for any paper on information retrieval, ever.  That’s more papers on information retrieval than I will write between this evening and the day that I retire or die!

In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

My VERY simple little program does something called language generation.  That means that it produces output in “natural,”—i.e., human—language.  You can do REALLY fancy things with it–Google can now use its super-sophisticated language generation technology to produce entirely bogus news stories, novels, letters—or scientific articles, for that matter.

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. It is now an old technology, to the point that one might think that there are no open research questions left in the field.

So, two differences:

  1. Google’s shit is super-complicated, and mine is super-simple.
  2. Google’s shit is completely made up, and mine is completely true.

Am I fucking kidding?  Is Donald Trump beholden to Vladimir Putin???


Technical geekery

  1. Yes, I omitted the detail that rand() returns an integer that you then use as an index into the array of sentences, not a random sentence.
  2. Yes, I omitted the whitespace that you have to put between the sentences.
  3. Yes, I omitted the my in front of the output variable.
  4. Yes, I will submit one of those as the opening of a paper on information retrieval–how the fuck could I not????

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s