What computational linguists actually do all day: the relative frequencies edition

Scroll down past the picture of the mean-looking warthog.

Hi Zipf,

I spent my first hour this morning looking for papers that describe any tools that do any kind of enrichment analysis over terms found in text, but was generally unsuccessful. Searches containing the terms “concept” “term” “enrichment analysis” “text” “natural language processing” have mainly pointed me towards GSEA and GSEA-like tools like Ontologizer that focus on gene sets. Tools that determine what a document is “about” might also be useful.”

Do you know of any tools or papers you could point me towards?


Hey there, Zellig,

I may be mis-understanding the question, so let me clarify.  Do you want to know about terms enriched in a document, or in a set of documents?  Gimme an idea about what the input looks like, and I think I’ll have an answer.

Hi Zipf,

I think I am interested in looking at each document individually. And I’ll also clarify that the point of the task is not to find concepts, but to determine what a concept’s presence or absence in a document has on what it is “about.”


OK, so in that case, the easiest thing to do would be… hm… relative frequency versus a background set of documents, or else tf*idf.  Explaining relative frequencies first:

  • your document has 100 words in total
  • mouse occurs 45 times in your document, or frequency = 45/100
  • the occurs 50 times in your document, or frequency = 50/100
  • warthog (I just learned how to say it in French, so warthogs are on my mind–“le phacochère”, if you were wondering, which sounds like a lot to scream if one of those nasty things charges you) occurs 5 times in your document, or frequency = 5/100.  Scroll down past the picture of the mean-looking warthog.
A male southern warthog. Picture source: By Charlesjsharp – Own work, from Sharp Photography, sharpphotography, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=37065293
  • your background data has 1000 words in total
  • mouse occurs 10 times, so frequency = 10/1000
  • the occurs 500 times, so frequency = 500/1000
  • warthog occurs 490 times, so frequency = 490/1000
relative frequencies, yours : background:
  • mouse = (45/100) : (10/1000), soit 45.0
  • the = 50/100 : 500/1000, soit 1.0
  • warthog = 5/100 : 490/1000, 0.1
…from which you conclude that your corpus is about mice, or at least it’s more about mice than the background data set is (’cause the word mouse occurs in your data at a ratio of 45:1 as compared to how often it occurs in the background data set).  You conclude that “the” tells you nothing about either corpus (the ratio is 1.0, meaning that the frequency of the word is about the same in both data sets), and that “warthog” tells you nothing about your corpus, but it does tell you something about the background data (because it only occurs in your data at a ratio of once to every 10 times that it occurs in the background data set).
The other easy approach: term frequency (count of occurrences of a word in a document), normalized by inverse document frequency (1 over the number of documents in which the word occurs).  This is known as tf*idf (term frequency * inverse document frequency).
Back to relative frequencies: that analysis is due to the late Adam Kilgarriff.  (I’m proud to say that we wrote a paper together before his untimely death, and lemme tell you: he really participated!)  Here’s a link to his paper about it.  He gives details about smoothing and the like that you’ll want to know about if you pursue this approach.  I’ll say that people are more familiar with the tf*idf approach, but personally, I think that relative frequency is a lot more intuitively comprehensible.

3 thoughts on “What computational linguists actually do all day: the relative frequencies edition”

  1. I can’t get beyond the word for Warthog … I had no idea. I add it to my list of tenuous words that one day I may be able to use in everyday conversation ‘je viens de rencontrer un phacochère et il été vraiment très poli …’ This stands with ‘tatou’ in importance in my really useful vocabulary!

    Liked by 2 people

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Curative Power of Medical Data

JCDL 2020 Workshop on Biomedical Natural Language Processing


Criminal Curiosities


Biomedical natural language processing

Mostly Mammoths

but other things that fascinate me, too


Adventures in natural history collections

Our French Oasis


ACL 2017

PC Chairs Blog

Abby Mullen

A site about history and life

EFL Notes

Random commentary on teaching English as a foreign language

Natural Language Processing

Université Paris-Centrale, Spring 2017

Speak Out in Spanish!

living and loving language




Exploring and venting about quantitative issues

%d bloggers like this: