Every once in a while an innocuous technical term suddenly enters public discourse with a bizarrely negative connotation. I first noticed the phenomenon some years ago, when I saw a Republican politician accusing Hillary Clinton of “parsing.” From the disgust with which he said it, he clearly seemed to feel that parsing was morally equivalent to puppy-drowning. It seemed quite odd to me, since I’d only ever heard the word “parse” used to refer to the computer analysis of sentence structures. The most recent word to suddenly find itself stigmatized by Republicans (yes, it does somehow always seem to be Republican politicians who are involved in this particular kind of linguistic bullshittery) is “encryption.” Apparently encryption is now right up there with dirty bombs in terms of things that terrorists are about to use to kill us all. (“All” might be an exaggeration. I find it interesting that the United States had 33,169 firearm deaths in 2013–roughly 11 times as many deaths as on 9/11–and yet, Republicans seem to think that it’s important that we make firearms as widely available as possible. I guess they just don’t like people very much.) As a moderately technical person, this strikes me as odd, since I’ve always thought of encryption as that nifty mathematical technique (I was about to say “algorithm,” but I think the Republicans are down on that one now, too) that keeps you from intercepting my text messages, me from reading your Ashley Madison profile, and so on.
In between the Republican outrage over parsing and the current panic over encryption, we had the sudden appearance in the public consciousness of data mining. As far as I knew up to that point, data mining was a bunch of statistical techniques for finding relationships between things. Suddenly it was showing up in scary news stories–Google the phrase “data mining is evil” (you have to put the quotes around it to search for the phrase, as opposed to the individual words) and you will get 1,400 hits as of the time of writing (May 2016).
Besides being bemused by this intrusion of American know-nothingness into public discourse, I have a personal stake in the issue, because people often refer to what I do for a living as text data mining. This is a misnomer–by its nature, data mining is not something that you can do with texts. Bear with me and I’ll explain why, and then we’ll look at some French vocabulary for talking about all of this.
Data mining is basically about databases. In a database, the statistical techniques of data mining can help you do things like discover that Republicans with HBO subscriptions are more likely to consider voting for Romney in a primary than Republicans who don’t have HBO subscriptions. (Real one, if I remember the facts correctly.) You can do that because you have a table in the database that tells who’s a Republican, a table that tells who has HBO subscriptions, and a table that tells you which members of a random sample told the interviewer that they would/wouldn’t consider voting for Romney in a primary. Data mining is the science/art of figuring out what things are related (HBO subscription/willingness to vote for Romney) and what things aren’t related (making one up here: having bought an Escalade and being willing/unwilling to vote for Romney in a primary)–this among probably thousands and thousands of variables. Doing data mining research requires things like knowing particular kinds of math, understanding how to sample a population, getting computers to do complicated calculations in a way that is time-efficient—stuff like that.
With data mining, you have that database, and you know what everything is. With “text mining,” or “text data mining,” as some people call it, you have texts, and you don’t know what anything is. (By “you,” I mean a computer program.) This is usually talked about as a difference between “structured” data (i.e., the database)–you know what everything “is”–what it “means”–in some sense, its semantics. Whoops–that sentence got a little out of control. “Unstructured” data: that’s typically how we would describe text. With text, you know what nothing is–you don’t know what anything means–in a very literal sense, you don’t know its semantics.
“Text mining” could be thought of as turning unstructured data into structured data. You’ve got a bunch of texts, and you want to use it to populate a database, perhaps. Maybe you have 23 million journal articles in the National Library of Medicine, and you want to find every statement that those 23 million articles make about which genes are affected by which drugs. Maybe you have a huge collection of French fairy tales, and you want (the computer) to find every time that a stepmother is mentioned and whether the portrayal of the stepmother is positive or negative. You could think of both of those as turning unstructured data into structured data–you’re taking that unstructured data and using it to build a database about drugs and proteins, or a database about stepmothers. You can see now why we tend to prefer the term “text mining” to “text data mining”–to the extent that “data mining” is about structured data, it doesn’t really make sense to talk about “data mining” with respect to language. Where the data mining person basically just needs to know math, the text mining person needs to know something about how people write about whatever it is that you’re interested in. I do a bit of text mining. People will have really specific requests–tell me whether or not the genes from some experiment show up in the cancer literature, say; tell me if this is a suicide note or not; read this doctor’s note and tell me if this kid is a candidate for epilepsy surgery; stuff like that. It’s not really linguistics, but it pays the bills, and it suits my need to do something that might actually make the world a better place.
A related field is natural language processing. Natural language means human language, as opposed to computer languages. Natural language processing is about building tools to handle specific linguistic tasks–parse a sentence, figure out parts of speech, stuff like that. You might use a combination of different language processing programs to do a text mining task. I find this more interesting, since the questions are less about some set of facts than they are about the language itself. Where the data mining person needs to know math and the text mining person needs to know how people write about genes and drugs, or stepmothers, or whatever, the natural language processing person needs to know something about language itself–what kinds of structures sentences can have, how word frequencies are distributed, how to build linguistic resources for letting a computer process things that can’t be directly observed (e.g. semantics). I do a lot of this kind of stuff. Recently I’ve been working on coreference resolution–how to get a computer to recognize that Obama, President Obama, and Barak Obama are all referring to the same thing in the world, while Mrs. Obama and Michelle Obama are referring to something else in the world. (Recognizing that those “things” in the world are people, as opposed to, say, locations, or the names of companies, is a whole different story.)
Yet another field is computational linguistics. This is about using computational models to test theories about language. This is my favorite, but it’s the hardest to pay the bills with. I do some of this, too. Nowadays a lot of my time goes into large-scale attempts to model the semantics of biomedical language. I’m trying to investigate differences in the semantic primitives of biomedical language versus “general” English by building a large set of data-driven semantic representations of predicates found in journal articles; I’ll then compare that resource to a similar resource built for general English and look for things like whether or not the semantic primitives seem to come from the same set, whether or not given verbs have different representations in the two types of language, etc. My hope is to get a sense of the range of types of semantic variability from this particular project. You could imagine using computational linguistics work to build natural language processing tools, and then using those to carry out practical text mining tasks. You could use the text “data” mining results to do actual data mining.
As you can tell from my examples, I’m very much in the world of biomedical language. There’s also a lot that you can do in the humanities with this kind of stuff. A hot topic in the future might be using mathematical representations of semantics to study things that are/are not thought of as binaries–gender, sexuality, race, political economy, whatever. However, I would not claim to do ANY of that–I can just barely explain it. For more on that kind of stuff, see this excellent post by Ben Schmidt.
In practice, even people in the field don’t always differentiate between these terms, or at least don’t draw sharp boundaries between them. My business card says that I’m the director of a text mining group, but I identify most strongly as a computational linguist. We figured that “text mining” makes more sense as a practical field of inquiry to have within a medical school (which is where I work), so that’s what we called the group when we formed it. If you go to the annual conference of the Association for Computational Linguistics, you will see almost no computational linguistics, but rather a ton of natural language processing. If you go to the annual Biomedical Natural Language Processing meeting, you’ll see a mix of text mining, natural language processing, and a bit of computational linguistics. Sometimes the distinctions really matter, though. This post started its life as a response to someone who asked me to be on a panel about data mining, to talk specifically about text data mining. When I responded that I don’t do data mining, they asked what the difference is–this blog post started out as my response.
As far as I can tell, the relevant community in France doesn’t make these distinctions in any kind of rigid fashion, either, despite the much-vaunted French penchant for categorization (see Nadeau and Barlow’s excellent book for a discussion of where it comes from). However, French does have technical vocabulary for all of these fields. Here it is:
- fouiller: to excavate; to rummage through, to search (see also here)
- la fouille de données: data mining
- la fouille de texte(s): text mining
- le traitement automatique des langues naturelles: natural language processing
- la linguistique informatique: computational linguistics
21 thoughts on “Data mining, text mining, natural language processing, and computational linguistics: some definitions”
The “French penchant for categorization” is real, but it only applies to certain domains of thoughts or structures, and is completely absent in others . For instance the formal rituals of “dating” in the USA are merely ununderstandable for a French bred .
BTW I love your humour and your ideas on Republicans and firearms .
Pardon me my comments concern marginal points of your texts but our interests in language proceed from very different approaches .
LikeLiked by 1 person
I very much appreciate your comments! 🙂 There’s an excellent discussion (although of course I have no idea how accurate it is) of the disparity between American definitions of relationship status and French definitions of relationship status (and the lack of correspondences between the two) in Deborah Ollivier’s book “What French Women Know: About Love, Sex, and Other Matters of the Heart and Mind.”
Good contribution towards enlightening a lot of us “know nothings” 🙂 I actually did an online course on data mining from a general and ethical point of view a couple of years ago, and do follow encryption issues. But as I’m not a techie at all, I just scanned through parts of this post. You also made me smile with your comments on Republicans – though some of these people can be downright scary.
LikeLiked by 1 person
So, you missed the stuff about French stepmothers! 🙂
LikeLiked by 1 person
No I didn’t, but all three of mine were American anyway 🙂
LikeLiked by 1 person
Yes, very true and well put.
In a country where post offices bear signs like “no weapons inside, please” you wonder what people are carrying with them OUTSIDE post offices. Having lived in California as a European among Americans (yes, California was the relaxed type – go check the East Coast for a comparison), I can also very well relate to your comment about dating and relationships. Europeans are generally just too relaxed about such things – in the US view. Hmm… Republicans bring to my mind the book “Reagan’s Reign of Error”, but Trump and some figures from any run for presidency (remember from last time 9-9-9, Alaska and counting to three) are outright hilarious – although France also does have politicians with stand-up comedy qualities.
When politicians and some journalists start using evil words like “encryption”, “parsing” and the like, I get cold shudders down my spine and know why there won’t be a shortage in the demand for consulting services in this field. However, when (at least for IT professionals seemingly) harmless and general words like “algorithm” are placed next to weapons of mass-distraction – pardon me – mass-distruction, this is not just ridiculous, it is quite sad. Headlines like “Algorithms will take 2.873 million jobs by 2020” have probably been generated by Tay or a similar bot in the training phase of its (his? her?) artificial intelligence. Ooops. I used another one: Artificial Intelligence. Not only has it been around for quite a while and we had this immense “expert system” hype in the 1980s, but it seems most politicians and journalists don’t understand much of that, either.
Text mining can be easily illustrated by what it does. However, drawing the lines (to the extent they exist) between “natural language processing”, “text mining”, “text analytics”, “computational linguistics” and a few others becomes simply impossible for the fainthearted who just want to continue doing their job and have to admit they use all of that, even more – and now have to face even being placed into the fuzzy domain of Big Data – which just happens to be something I fervently sell to my customers but at some point have to break down into what they will really get.
It all comes back to Saussure – I am sure he would love this discussion and feel proven like never before.
Best regards from a European Computer Scientist and Computational Linguist 🙂
LikeLiked by 1 person
Highly pleasant and intelligent input from a “cousin germain”, as we say in France for “first cousin” . Hallo !
LikeLiked by 1 person
I have to thank you for this comment. I saw Ubu Roi the other night. One of the last lines of the play is “Mer farouche et inhospitalière qui baigne le pays appelé Germanie, ainsi nommé parce que les habitants de ce pays sont tous cousins germains.” Without your comment, I would not have gotten it at all (to the extent that I got anything in Ubu Roi)!
Glad to help . You’re a little adventurer, nowadays only a limited Happy Few knows Jarry, and for a stranger it’s even rarer .
This “cousin germain” for first cousin is related to the fact that in Spanish brother is “hermano”, so our fathers are brothers etc… The question is why did we call these hordes of shaggy Huns “brothers” when they barged in .
LikeLiked by 1 person
Well, at this point, I have to jump in with a bit of etymology. What sounds similar must not necessarily have the same roots 🙂
The “cousin germain” uses the word “germain” – from Latin “germanus” meaning “full, own (of brothers and sisters); one’s own brother; genuine, real, actual, true”.
The term “German” comes from Germanic roots and refers to “ger” = spear and “man” = man, i.e., the spear-bearing men.
That’s why this coincidence may be a nice pun, but the roots are quite different.
After all, we’re all brothers and sisters – only some politicians don’t seem to get that.
LikeLiked by 1 person
My father often told me: “Always remember that all men are your brothers and all women your sisters–some more so than others.” (Nice verbal ellipsis BTW, eh?)
> Algorithms will take 2.873 million jobs by 2020
Yes, that’s exactly the kind of garbage that makes me crazy.
> drawing the lines (to the extent they exist) between “natural language processing”, “text mining”, “text analytics”, “computational linguistics” and a few others becomes simply impossible
Yes, I totally agree–I tried to make the point that in practice the lines are quite blurred, e.g. with the Association for Computational Linguistics annual meeting being filled with natural language processing papers. Ultimately, I suspect that there is some value in thinking in these kinds of terms if it will help you to understand what kinds of expertise you need to have or to hire, how to evaluate your work, and stuff like that.
> It all comes back to Saussure
Always. 🙂 I love to cite Saussure, preferably in French. If I can manage to cite Panini: even better!
Phildange, in these days when what we now call Trumpism has become a significant enough force in America to affect a presidential election, Ubu Roi seems more germaine (in the English sense of the word) than ever. If Ubu Roi isn’t Trump, I don’t know who he is, and vice versa…
The Italians had Berlusconi, we had Sarkozy, but I admit Americans are the best once more . Even the cold blooded bitch Le Pen wouldn’t dare saying the 10th of what Donald says .
Jürgen there’s something I miss . Where does the French meaning of germain, as in cousin germain from brother like the Spanish hermano, come from ? Is it a complete different origin from your Germanic root so it is a coincidence ? Or is there a cconnection between German and hermano/germain ?
LikeLiked by 1 person
Well, the French “germain”, Spanish “hermano” (Old Spanish “ermano”), Catalan “germà”, Galician “irmán”, Portuguese “irmão” all clearly come from the Latin “germanus”.
The origin of “German” referring to the people is actually younger than the use of “germanus”, so one can now speculate about the precise etymology. In Germanic languages themselves, you’d find “ger mann” as a term for a spear-bearing man, something the rather militaristic Romans may have encountered quite often. In fact, there is the German name “Gerhard” until today (French “Gerard”, Irish [from Celtic] “Gearóid”, Italian “Gerardo”, Dutch “Gerrit”) from “ger” = Engl. “spear” and “hart” = Engl. “brave”.
Alternate explanations take it to the root of “germanus” indeed, meaning something like “brothers”, “neighbours” in the North of the Roman Empire.
There are also explanations going to Celtic languages, i.e., Irish “gair” = “neighbour”.
The Cesar’s times, there was no German nation as such – this was a large number of heterogeneous tribes (some say this still holds today), so both approaches to the etymology of “German” seem fit: the Romans using a collective term for the (from their point of view) hostile folks in the North (actually, covering decent parts of today’s France as well), or the (from trading long-known) neighbours in the North.
There may be linguistic evidence favoring one or the other approach, but I am not aware of a finally conclusive single opinion on this topic. The name for the spear-bearing neighbours may even have been inspired by both approaches 🙂
LikeLiked by 2 people
All right . There still is an interesting remaining question but your thorough approach is quite useful . Thank you for your effort .
LikeLiked by 1 person