Scroll down past the picture of the mean-looking warthog.
I spent my first hour this morning looking for papers that describe any tools that do any kind of enrichment analysis over terms found in text, but was generally unsuccessful. Searches containing the terms “concept” “term” “enrichment analysis” “text” “natural language processing” have mainly pointed me towards GSEA and GSEA-like tools like Ontologizer that focus on gene sets. Tools that determine what a document is “about” might also be useful.”
Do you know of any tools or papers you could point me towards?
Hey there, Zellig,
I may be mis-understanding the question, so let me clarify. Do you want to know about terms enriched in a document, or in a set of documents? Gimme an idea about what the input looks like, and I think I’ll have an answer.
I think I am interested in looking at each document individually. And I’ll also clarify that the point of the task is not to find concepts, but to determine what a concept’s presence or absence in a document has on what it is “about.”
OK, so in that case, the easiest thing to do would be… hm… relative frequency versus a background set of documents, or else tf*idf. Explaining relative frequencies first:
- your document has 100 words in total
- mouse occurs 45 times in your document, or frequency = 45/100
- the occurs 50 times in your document, or frequency = 50/100
- warthog (I just learned how to say it in French, so warthogs are on my mind–“le phacochère”, if you were wondering, which sounds like a lot to scream if one of those nasty things charges you) occurs 5 times in your document, or frequency = 5/100. Scroll down past the picture of the mean-looking warthog.
- your background data has 1000 words in total
- mouse occurs 10 times, so frequency = 10/1000
- the occurs 500 times, so frequency = 500/1000
- warthog occurs 490 times, so frequency = 490/1000
relative frequencies, yours : background:
- mouse = (45/100) : (10/1000), soit 45.0
- the = 50/100 : 500/1000, soit 1.0
- warthog = 5/100 : 490/1000, 0.1
…from which you conclude that your corpus is about mice, or at least it’s more about mice than the background data set is (’cause the word mouse occurs in your data at a ratio of 45:1 as compared to how often it occurs in the background data set). You conclude that “the” tells you nothing about either corpus (the ratio is 1.0, meaning that the frequency of the word is about the same in both data sets), and that “warthog” tells you nothing about your corpus, but it does tell you something about the background data (because it only occurs in your data at a ratio of once to every 10 times that it occurs in the background data set).
The other easy approach: term frequency
(count of occurrences of a word in a document), normalized by inverse document frequency
(1 over the number of documents in which the word occurs). This is known as tf*idf
(term frequency * inverse document frequency).
Back to relative frequencies: that analysis is due to the late Adam Kilgarriff. (I’m proud to say that we wrote a paper together before his untimely death, and lemme tell you: he really participated!) Here’s a link
to his paper about it. He gives details about smoothing and the like that you’ll want to know about if you pursue this approach. I’ll say that people are more familiar with the tf*idf approach, but personally, I think that relative frequency is a lot more intuitively comprehensible.