Pubmed.mineR: text mining from the biomedical literature with the R programming language

Update, 1 March 2016: pubmed.mineR has recently been updated a couple times.  Since the most recent update (1.0.5), the old API works again, so the code on this page will work.  However, I have not been able to reproduce the (wonderful) results that I had before the recent updates to pubmed.mineR.  Use with caution.

Something a bit different today: a little manual for using a package for the R programming language for text mining.

Pubmed.mineR is a “library” for doing text mining from the PubMed/MEDLINE collection of documents.  PubMed/MEDLINE contains references to about 23 million articles in the domain of biomedical science, broadly construed.  It was released with documentation for the various and sundry methods that it provides, but no manual.  This blog post is an attempt to put together a basic manual for using it, with some code examples.  Pubmed.mineR was written by Jyoti Rani, Ab Rauf Shah, and Srinivasan Ramachandran.  You can find an article about it here, and some documentation here.

First, you need to have the input data in the right format.  Here’s a screenshot from one of the authors, Ramachandran, showing how to do a manual query and then save the results in the proper format:

Downloading PubMed/MEDLINE abstracts in a format the Pubmed.mineR can deal with. Photo source: Ramachandran.
Downloading PubMed/MEDLINE abstracts in a format the Pubmed.mineR can deal with. Photo source: Ramachandran.

Here is some sample R code for the library:

# import the library
library(pubmed.mineR)

# read in the abstracts
abstracts <- readabs(“pubmed_result.txt”)

abstracts is an object of type S4.  This is a kind of class used for doing object-oriented programming in R.  abstract is printable with print(abstracts).  An S4 object stores its data in slots.  To understand slots in R, try this web page: http://stackoverflow.com/questions/4713968/r-what-are-slots.

The abstracts class has the following slots:

  • Journal: This returns a vector of the names of the journals for each publication in the whole collection.
  • Abstract: This returns a vector of the abstracts for the whole collection.
  • PMID: This returns a vector of the PMIDs for the whole collection.

(It’s worth noting that the elements of some of these vectors have some oddities.  For example, when you get the vector of titles, you’ll notice that each one is prefaced with the number of the element of the vector.  I suggest looking at these outputs closely, as I’m sure that I haven’t picked up on all of these oddities.)

So, this line of code will get you a vector of the PMIDs (some columns trimmed from the output for readability):

Screenshot 2015-10-19 09.54.40

Once we’ve got a PMID for an abstract, one thing that we can do with it is send it to PubTator.  Once we can do that, we can get access to lists of the genes, mutations, diseases, and chemicals that are mentioned in the abstract.  (Some columns of output omitted for readability.)

Screenshot 2015-10-19 09.31.47

These lines of code will get you access to the rest of the stuff in the PubTator results:

pubtator_output$Genes
pubtator_output$Mutations
pubtator_output$Diseases
pubtator_output$Chemicals
pubtator_output$Species

It’s pretty common to want to iterate over all of the sentences in an abstract.  You can do that by getting a vector of the sentences with the SentenceToken() method.  It has to be passed a character string, so you’ll want to pass it an element of the vector of abstract bodies that you get from abstracts@Abstract:

Screenshot 2015-10-19 10.06.35

A question that immediately arises is whether you can pass individual sentences to the PubTator function. I haven’t had good luck with that–it always seems to return “No data.” So, I guess that I would try running pubtator_function() on the whole abstract, and then search individual sentences for the things that pubtator_function() returns with a regular expression or substring function or something.

This doesn’t exhaust everything that you can do with pubmed.mineR, but it should be enough to get you started. Good luck, and if you figure out how to do something cool with it that I haven’t talked about, please tell us in the comments!

12 thoughts on “Pubmed.mineR: text mining from the biomedical literature with the R programming language”

  1. I have downloaded the pubmed_result.txt file into R using the File/open file option but when I use searchabs, to search for objects in file, it does not recognize the file. Any recommendations?

    Liked by 1 person

    1. Go to the directory where you have downloaded the “pubmed_result.txt” file, then use the following command to bring in the file in the R environment: abstracts <- readabs("pubmed_result.txt")

      Liked by 1 person

  2. It would be better to work with the xml file that Pubmed offers instead of the abstract text file
    xmlreadabs() is the corresponding function to read the xml file
    This should reduce inconsistencies mentioned ….

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Curative Power of Medical Data

JCDL 2020 Workshop on Biomedical Natural Language Processing

Crimescribe

Criminal Curiosities

BioNLP

Biomedical natural language processing

Mostly Mammoths

but other things that fascinate me, too

Zygoma

Adventures in natural history collections

Our French Oasis

FAMILY LIFE IN A FRENCH COUNTRY VILLAGE

ACL 2017

PC Chairs Blog

Abby Mullen

A site about history and life

EFL Notes

Random commentary on teaching English as a foreign language

Natural Language Processing

Université Paris-Centrale, Spring 2017

Speak Out in Spanish!

living and loving language

- MIKE STEEDEN -

THE DRIVELLINGS OF TWATTERSLEY FROMAGE

mathbabe

Exploring and venting about quantitative issues

%d bloggers like this: