What computational linguists actually do all day: the relative frequencies edition

Hi Zipf,

I spent my first hour this morning looking for papers that describe any tools that do any kind of enrichment analysis over terms found in text, but was generally unsuccessful. Searches containing the terms “concept” “term” “enrichment analysis” “text” “natural language processing” have mainly pointed me towards GSEA and GSEA-like tools like Ontologizer that focus on gene sets. Tools that determine what a document is “about” might also be useful.”

Do you know of any tools or papers you could point me towards?

Zellig

Hey there, Zellig,

I may be mis-understanding the question, so let me clarify. Do you want to know about terms enriched in a document, or in a set of documents? Gimme an idea about what the input looks like, and I think I’ll have an answer.

Zipf

Hi Zipf,

I think I am interested in looking at each document individually. And I’ll also clarify that the point of the task is not to find concepts, but to determine what a concept’s presence or absence in a document has on what it is “about.”

Zellig

OK, so in that case, the easiest thing to do would be… hm… relative frequency versus a background set of documents, or else tf*idf. Explaining relative frequencies first:

your document has 100 words in total
mouse occurs 45 times in your document, or frequency = 45/100
the occurs 50 times in your document, or frequency = 50/100
warthog (I just learned how to say it in French, so warthogs are on my mind–“le phacochère”, if you were wondering, which sounds like a lot to scream if one of those nasty things charges you) occurs 5 times in your document, or frequency = 5/100. Scroll down past the picture of the mean-looking warthog.

Southern_warthog_(Phacochoerus_africanus_sundevallii)_male — A male southern warthog. Picture source: By Charlesjsharp – Own work, from Sharp Photography, sharpphotography, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=37065293

your background data has 1000 words in total
mouse occurs 10 times, so frequency = 10/1000
the occurs 500 times, so frequency = 500/1000
warthog occurs 490 times, so frequency = 490/1000

relative frequencies, yours : background:

mouse = (45/100) : (10/1000), soit 45.0
the = 50/100 : 500/1000, soit 1.0
warthog = 5/100 : 490/1000, 0.1

…from which you conclude that your corpus is about mice, or at least it’s more about mice than the background data set is (’cause the word mouse occurs in your data at a ratio of 45:1 as compared to how often it occurs in the background data set). You conclude that “the” tells you nothing about either corpus (the ratio is 1.0, meaning that the frequency of the word is about the same in both data sets), and that “warthog” tells you nothing about your corpus, but it does tell you something about the background data (because it only occurs in your data at a ratio of once to every 10 times that it occurs in the background data set).

The other easy approach: term frequency (count of occurrences of a word in a document), normalized by inverse document frequency (1 over the number of documents in which the word occurs). This is known as tf*idf (term frequency * inverse document frequency).

Back to relative frequencies: that analysis is due to the late Adam Kilgarriff. (I’m proud to say that we wrote a paper together before his untimely death, and lemme tell you: he really participated!) Here’s a link to his paper about it. He gives details about smoothing and the like that you’ll want to know about if you pursue this approach. I’ll say that people are more familiar with the tf*idf approach, but personally, I think that relative frequency is a lot more intuitively comprehensible.

Zipf

3 thoughts on “What computational linguists actually do all day: the relative frequencies edition”

I can’t get beyond the word for Warthog … I had no idea. I add it to my list of tenuous words that one day I may be able to use in everyday conversation ‘je viens de rencontrer un phacochère et il été vraiment très poli …’ This stands with ‘tatou’ in importance in my really useful vocabulary!

LikeLiked by 2 people

zipfslaw1 says:

March 28, 2018 at 8:47 am

Ah, le tatou ! You must check out “Tex’s French Grammar,” if you haven’t already… The nexus of all tatou-related French-language activity in the US… https://www.laits.utexas.edu/tex/

LikeLiked by 2 people

Reply
1. Osyth says:
  
  March 28, 2018 at 9:45 am
  
  Thanks for the tip – it’s just the distraction I need as I pack up my life for the big move which I don’t want to make!
  
  LikeLiked by 1 person

	Anonymous on The many ways to spell “…
	Anonymous on Nightmare after nightmare: How…
	zipfslaw1 on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …

	Anonymous on The many ways to spell “…
	Anonymous on Nightmare after nightmare: How…
	zipfslaw1 on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …

Share this:

3 thoughts on “What computational linguists actually do all day: the relative frequencies edition”

Leave a comment Cancel reply