These are all true stories: or, Language Generation For Dummies

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer.

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.


Over the course of the past year or two, I’ve been writing a book about writing about what I do for a living.  Call it data science, call it natural language processing, call it machine learning–any way you slice it, the structure of the papers that we write in order to spread our little discoveries around is always pretty much the same.

In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

For people like me, that always inspires the same question: could I write a REALLY simple computer program to do this for me?

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. We know that technology as information retrieval. By now it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

One of my major obsessions in life is this: how do you START a research paper?  More precisely: how do you start a research paper in a way that isn’t BORING?

In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

Well, I do what I do in the medical field, specifically, and in the medical field, we have a saying: if no one dies in the first two sentences, your work is going to be ignored.

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. It is now an old technology, to the point that one might think that there are no open research questions left in the field.

I’m a little bit less pessimistic than that.  Inspired by openings like this favorite from a paper by Daniel Gildea and Daniel Jurafsky,

Recent years have been exhilarating ones for natural language understanding. The excitement and rapid advances that had characterized other language-processing tasks … have finally begun to appear in tasks in which understanding and semantics play a greater role. For example, …

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

…I wonder if it couldn’t work as well to save someone’s life in the first two sentences?

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. We know that technology as information retrieval. Today, it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

So, when I set out to write a simple computer program to generate the openings of research papers on a topic called information retrieval, I went looking for stories where someone landed in a doctor’s office–and came out of it better than one might have expected.

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

Finding those happy endings was the hardest part of this whole little after-dinner project.

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. We know that technology as information retrieval. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

From there, it was a simple matter of putting together a set of reasonable first sentences:

my @firstSentences = (“In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved.”,

 “In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved.”,

“In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted.”);

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. By now it is a familiar technology, and one might think that there was nothing more to be learned about it.

…and a set of reasonable second sentences…

my @secondSentences = (“The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer.”,

    “The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer.”);

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

…and so on.  I use a command called rand to pick a random sentence from the sets of possible first sentences, possible second sentences, and so on…

my $first_sentence   = rand @firstSentences;

my $second_sentence = rand @secondSentences;

In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

…and then I just glue my randomly selected first, second, third, fourth, and fifth sentences together…

$beginning_of_article = $first_sentence . $second_sentence . $third_sentence . $fourth_sentence . $fifth_sentence;

In the 1990s, a physician examining an odd wart realized that it was, in fact, cancer. Surgery was done, and the patient’s life was saved. The thing that made the difference was a new technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. We know that technology as information retrieval. By now it is a familiar technology, to the point that one might think that there are no open research questions left in the field.

…et voilà!  With only two options at each position for five different “sentence positions” (first sentence, second sentence, etc.), I have 2 to the 5th power (or 5 to the second power–I can never remember) possible openings that will work for any paper on information retrieval, ever.  That’s more papers on information retrieval than I will write between this evening and the day that I retire or die!

In the 1990s, an emergency room physician noticed the outbreak of what became an epidemic of venereal disease in his American city. He was able to find a new treatment for the disorder, a bacterial infection known as chancroid, and the spread of the painful genital sores was halted. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. Today, it is a familiar technology, and one could be forgiven for assuming that there is nothing left to be learned about it.

My VERY simple little program does something called language generation.  That means that it produces output in “natural,”—i.e., human—language.  You can do REALLY fancy things with it–Google can now use its super-sophisticated language generation technology to produce entirely bogus news stories, novels, letters—or scientific articles, for that matter.

In the 1990s, a surgeon was about to amputate a woman’s foot. To his surprise, he found that there were other ways to treat her condition, and the woman’s foot was saved. The new information came from a novel technology that allowed the health care provider to search all of the articles in the National Library of Medicine via a computer. Known today as information retrieval, that technology is arguably the “killer app” that makes the Internet as we know it today useful in the daily life of much of the world. It is now an old technology, to the point that one might think that there are no open research questions left in the field.

So, two differences:

  1. Google’s shit is super-complicated, and mine is super-simple.
  2. Google’s shit is completely made up, and mine is completely true.

Am I fucking kidding?  Is Donald Trump beholden to Vladimir Putin???


Technical geekery

  1. Yes, I omitted the detail that rand() returns an integer that you then use as an index into the array of sentences, not a random sentence.
  2. Yes, I omitted the whitespace that you have to put between the sentences.
  3. Yes, I omitted the my in front of the output variable.
  4. Yes, I will submit one of those as the opening of a paper on information retrieval–how the fuck could I not????

Academic writing and how not to start a paper: Episode 2

Nobody gives a shit about medical terminology. Rethink your opening sentence.

This post is part of an occasional series on writing about academic research. Writing about writing about academic research on my blog allows me to avoid writing my book about writing about academic research. Techniques that other authors have used to avoid actually working on what they’re supposed to be working on include doing the laundry, abusing alcohol, and committing suicide. Personally, I think that writing about academic research on my blog is more adaptive than those techniques.  Well, obviously it’s not more adaptive than doing the laundry…

Introduction

Medical terminology is one of the best-studied aspects of the English language. This is important because… But, the complicated structure of its words makes it difficult to translate into other languages.  This is a problem because… To address this issue of decomposition, this paper describes a simple parser for biomedical terminology. This would allow…

To: Zipf

Subject: Comments on draft 

Zipf, nobody gives a shit about medical terminology. Rethink your opening sentence.

Introduction

When the first author’s grandmother had to visit her doctor–an increasingly common occurrence as she grew older–she understood more or less nothing that she was told.  But, she would take careful notes–and then call the first author to find out what it all meant. 

What was the problem here?  The first author’s grandmother was not an educated woman, but she was no dummy–a quick-witted and articulate woman who loved jokes of considerable linguistic sophistication. The issue here was not the first author’s grandmother.  Rather, it was the language used by her doctor–specifically, the highly specialized vocabulary of medicine. The first author was a medic in the military, and subsequently was awarded a doctoral degree in linguistics, writing a dissertation on biomedical language. He has no problem understanding medical terminology. But, for a normal person, the language with which their physician communicates with them can be every bit as much of an obstacle to their treatment as the rationing of care that characterizes the American health care system.

Medical terminology is one of the best-studied aspects of the English language. This is important because… But, the complicated structure of its words makes it difficult to translate into other languages.  This is a problem because… To address this issue of decomposition, this paper describes a simple parser for biomedical terminology. This would allow…

To: Zipf

Subject: Comments on draft

I wish I’d known your grandmother.

Screen Shot 2020-02-05 at 14.42.05
The oldest paper on medical terminology in PubMed/MEDLINE, the National Library of Medicine’s database of 27 million scientific publications. The piece was published in 1911–well over a hundred years ago as I write this–and the problems that it raises still have not been solved.  No author is listed.

For more Zipfian ravings on the topic of writing about academic research, see here. Or, buy my book.  Oh, wait–I’m writing this blog instead of working on the book… Damn it…