Social media, linguistics, suicide, and you

All of that time you spend on Facebook isn’t wasted if you donate your social media data to suicide research.

Until the 1960s or so, there were basically two ways to do linguistic research.

  1. If you were into historical linguistics and/or dead languages, you looked at ancient texts.
  2. If you were into living languages, you went and camped out on a reservation, in a village, or whatever, and you sat with native speakers and your notebook and you collected data.  You transcribed things, and then went home and copied out your notes, and then you thought about them a lot.

In either case, the underlying philosophy was that there was some body of data in your hands, and your task as a linguist was to come up with a description/explanation of what was in that body of data.  Seems straightforward enough.

In the 1960s or so, the American linguist Noam Chomsky turned the world of linguistics upside down with the idea that what you should be doing is describing/explaining native speakers’ intuitions about their language.  Intuition is a technical term here–it refers not to “what Kevin happens to think is the case about his native language,” but to native speaker judgements about questions like Is the sentence “I saw the man on the hill with a telescope” ambiguous?  This changed the conception of what constitutes “data” enormously.  On this view of linguistics, there’s no need to go freeze on the Siberian tundra to get your data–you can do it in your living room.  Les données, c’est moi !  

Today linguists are less likely to talk about binary “yes it is/no it isn’t” questions than they are about gradient judgments–“Sentence X is more acceptable than sentence Y”. For a really good discussion about the issues from a perspective I think you’ll like, see the work of John Sprouse under the general heading of “experimental syntax.”

–Philip Resnik

From a philosophical perspective, this was a radical shift–from empiricism (sometimes a very extreme empiricism, as for example in the case of Leonard Bloomfield (leading American linguist of the first half of the 20th century, author of the first article published in the journal Language, and Yale professor who was refused membership in the Faculty Club because they didn’t let Jews join in those days), who was of the opinion that mental states are not observable, and therefore semantics is not a fit topic for science) to rationalism, and a rather extreme rationalism at that.  Not everyone was happy about this, and in fact most older linguists weren’t–from a methodological point of view, it’s hard to see how you could falsify a hypothesis when the evidence that’s being presented is some version of yes it IS ambiguous–but Chomsky took the grad students by storm, and linguistics underwent a radical change.  It swept the world of academia in a way that it never had before, too.  (Check out Randy Allen Harris’s The linguistics wars for details.)

Meanwhile, Henry Kučera and W. Nelson Francis were thinking about the potential for computerized analysis of language, and in 1967 they published Computational Analysis of Present-Day American English, based on the study of a bit over one million words of American English that they had had typed up by keypunch operators.  They were at Brown, and called their data set, which they made available to the public–now you could check someone else’s data–as the Brown Corpus.  As far as I’m aware, that data itself didn’t lead to any earthshaking discoveries, but it did make people’s ears perk up: it was clear that there were possibilities for doing defensible studies of language when you could search a large body of text with a few keystrokes that just weren’t there if your data were whatever you happened to need to intuit that morning.  Or whatever your grad student happened to need to intuit that morning.  Or, in a pinch, whatever you happened to need to intuit in the heat of your dissertation defense.

There were some issues with the Brown Corpus, or at any rate, with trying to make similar corpora (the plural of corpus) on your own. One was copyrights.  The Brown Corpus was what is called a stratified sample: it deliberately tried to structure its contents.  Those contents included fiction, non-fiction, personal correspondence, books, newspaper articles–all sorts of stuff, much of which required getting permission from someone or other.  Then there was the matter of those keypunch machines–all 1,014,312 words had to be entered by hand.  People continued to pursue the construction of corpora, and cool things came out of that work, both in terms of linguistic theory and in terms of designing computer programs that could do things with language.  But, it was slow going–people realized that bigger and bigger corpora would let them do cooler and cooler things, but typing is neither fast, nor inexpensive.

Then a miracle happened: the Internet.  All of a sudden random people around the world were vomiting forth massive quantities of linguistic data, and it was mostly copyright-free, and they were typing that shit themselves.  Nectar!  Now you can get access to billions of words of text in an amazing variety of language.  Is it necessarily clean, pretty, or legible?  No.  Is it real?  Yes, and that’s what matters, at least to linguists.  My colleague Graciela Gonzalez at Penn has done amazing things with social media data, ranging from monitoring medications for previously unknown adverse effects to monitoring prescription medication abuse.

Until recently, there was remarkably little data available on the language of suicidal people, or even on the language of people with psychiatric disorders in general.  This is surprising, because with so many mental illnesses, the symptoms are, for the most part, expressed via language.  As Philip Resnik, Rebecca Resnik, and Margaret Mitchell put it in the introduction to the proceedings of the first Association for Computational Linguistics workshop on computational linguistics and clinical psychology in 2014,

For clinical psychologists, language plays a central role in diagnosis. Indeed, many clinical instruments fundamentally rely on what is, in effect, manual annotation of patient language. Applying language technology in this domain, e.g. in language-based assessment, could potentially have an enormous impact, because many individuals are motivated to underreport psychiatric symptoms (consider active duty soldiers, for example) or lack the self-awareness to report accurately (consider individuals involved in substance abuse who do not recognize their own addiction), and because many people — e.g. those without adequate insurance or in rural areas — cannot even obtain access to a clinician who is qualified to perform a psychological evaluation.

Suppose you’re interested in the language of suicidal people.  Until recently, if you wanted to get your hands on actual data, you could get your hands on a set of suicide notes collected and annotated by my colleague John Pestian at Cincinnati Children’s Hospital Medical Center (and me and a bunch of other people). That data has been revealing, and we’ve learnt things about suicide from that data that we didn’t know.  But, that data was hard to come by.  Putting that data set together took years (if you can read French, you can find a paper here on some of the issues), and if you want to get your hands on it, you need to go through some hoops to demonstrate that you have a legitimate research interest, that you will not be posting people’s suicide notes on Facebook or Pinterest, and so on.

Social media has changed all of that.  In fact, the past couple years have seen an explosion of work on the linguistic characteristics of mental states associated with mental illness, including suicidality.  Much of it has appeared in the proceedings of CLPsych, the above-mentioned Association for Computational Linguistics workshop.   To give you some examples:

  • Glen Coppersmith and his colleagues at Johns Hopkins worked with tweets from people with post-traumatic stress disorder, seasonal affective disorder, depression, bipolar disorder, and psychiatricaly healthy controls.  They found that based on the contents of the tweets, they could do a pretty good job of classifying which of those categories the poster belonged to.  They tried various methods of representing the contents of the tweets, and found that they got the best results with what are called statistical language models.  In a later paper, they looked at six more conditions, and added exploratory analysis on the distributional characteristics of emotionally relevant language.
  • Margaret Mitchell of Microsoft and her colleagues worked with tweets from schizophrenics and healthy controls, and found some unexpected signals in the language of the schizophrenics.  For example, the schizophrenic social media users were statistically more likely to use what linguists call hedging expressions like think, I believe, or I guess. 
  • In one of my favorite papers of this ilk, Munmun De Choudhury and their colleagues looked at the language of people who moved from a Reddit for people with mental health related diagnoses to a suicide watch Reddit.  They found a number of differences in the language of those Reddit users who moved to the suicide watch group and those who didn’t, including differences in what is called accommodation: the ways that we (can) adjust our language to that of the people with whom we’re communicating.  (A post on the subject is in the works.)

Now, there’s an issue here: just because people post their lives on social media doesn’t mean that it’s OK for you to use that stuff for your own purposes.  Ethical questions abound, and that’s just as true for the tweets, posts, or whatever of the psychiatrically healthy controls as it is for those with mental illness, suicidal behavior, or whatever.  And that’s where you come in. is a group that collects social media data, particularly linguistic data, for use in doing research like the stuff that I’ve described here with the goal of suicide prevention.  They want your data if you have ever flirted with suicide, but they want your data if you haven’t, too–you always need something to compare to, and people like me need data from non-suicidal people to compare to the data from suicidal people.  That could be you!  Check it out:

Work in this space is definitely emotionally taxing. I find myself with a rule similar to John’s “no more than 10 a day” rule — enough to constantly remind me of the importance of this work, without becoming emotionally oppressive. The emotional response to spot-checking the data is qualitatively different and far more visceral than something like sentiment analysis of beer reviews.

–Glen Coppersmith

One day I needed to read through some suicide notes.  I set an afternoon aside to do it.  I made it through about 150–an hour, maybe–before I read one that was like being punched in the stomach.  I went out and bought a pack of cigarettes, and I didn’t even smoke (at the time!).  I spent a lot of time over the course of the next couple weeks trying to forget it.  I mentioned it to John.  All afternoon?  Man, you can’t do that–10 a day, max.  He was, of course, right.  In this kind of work, ethical issues come up with the researchers, too.  We now provide free therapy for the people who transcribe data for us on suicide-related projects, our researchers who work directly with the data are required to visit a therapist or a clergyman once a month, and we rotate research assistants off of the project every quarter.  Moral of the story: I don’t recommend that you go digging around in this data out of curiosity. But, you can be the data–suicidal or not, why not donate your social media data to and maybe keep someone else from writing one of those notes?

Thanks to John Pestian, Philip Resnik, and Glen Coppersmith for their comments and contributions.  French notes follow.

French notes

le suicide: suicide

le/la suicidé(e): person who kills themself

suicider: to drive someone to suicide; to make someone’s death look like suicide

se suicider: to kill yourself

se donner la mort: to kill yourself

la tentative de suicide: suicide attempt

maquiller un meurtre en suicide: to make a murder look like a suicide

l’attentat-suicide: suicide attack

la mission suicide: suicide mission

la lettre d’adieu: suicide note

Derivational morphology, pragmatics, and the Great Parisian Rat Crisis

Here in France, our major worries are that we’ll do the same idiotic thing in our next election that America just did in hers. Meanwhile, all the anglophone press can find to talk about is our little rat problem, while ignoring everything linguistically interesting about it.

The French 2017 presidential race is quickly coming down to a match between the far-ish right and the extreme right, it’s not clear how much longer Europe as we know it will continue to exist, and Marine Le Pen was just voted the most admired politician in France,  but the main story about France in the anglophone press right now is… an explosion of the Parisian rat population.

Picture source:

That store window in Ratatouille: it’s for real.  (There’s a cool bar nearby, Le baiser salé (“The Salty Kiss”), that I stop into once in a while.  I’m sparing you a photograph of the real rat window because it really is quite disgusting, and I say that as someone who once posted a picture of a grilled guinea pig here.)  Friends tell me that the story has it that there is one rat for every person in Paris, but current estimates are quite a bit higher.

How would you know the size of the rat population, one way or the other?  There’s a specific sampling technique that’s used to estimate the size of a population that can’t be directly observed–think about fish in a pond, or arctic ground squirrels in their little burrows, or–rats.  Charming video involving goldfish crackers to be seen here.

Zipf’s Law being what it is, this brings up a linguistic oddity that I find interesting.  It has to do with what’s called derivational morphology: the things that we can add to words that change their meaning or their part of speech, like the un in unlock or the ic in anemic.

French has a prefix, dé, that you can add to verbs to make them mean something like a reversal of the normal action of the verb.  Alain Bentolila, in his La langue française pour les nuls (don’t mock it–it may be the best book on the linguistics of any single language that I’ve ever read) defines it and its close relatives, dés- and dis-, as contributing a meaning something like séparé de, qui a cessé de, différent.  Some examples:

visser to screw dévisser to unscrew
voiler to veil dévoiler  to unveil
vérouiller to bolt; to lock; to close (a brèche, in a military context) déverrouiller to unbolt; to unlock (a phone, a keyboard, the caps lock)
valoriser to add value to, to increase the value of dévaloriser to devalue
vêtir to dress (transitive) dévêtir to undress (transitive)

This is relevant to current events because there is a set of words that have to do with removing things–mostly pestilential things, except for the last one–that have an interesting pattern with respect to this derivational prefix.  To wit, I give you these examples from Bernard Fradin’s Nouvelles approches en morphologie (definitions in French when necessary, because these don’t typically show up in bilingual dictionaries)

dératiser  to exterminate the rats in [something] (WordReference)
désinsectiser  to spray [something] with insecticide (WordReference) (I will mention here that some of the definitions of désinsectiser that I’ve come across have specified that this means to get rid of insects by using gas.  I can’t find any at the moment, though.)
décafardiser  (not in WordReference) détruire les cafards dans un lieu, spécialement par fumigation. (Cordial)
dénicotiniser  to remove the nicotine from [something] (WordReference)
désodoriser to deodorize (WordReference)
dévirginiser  to deflower (WordReference)

What’s interesting about this–a lot, actually.  To wit:

  1. There are no corresponding forms without dé.  Unlike visser/dévisser vêtir/dévêtir, we have no form of dératiser/désinsectiser/décafardiser without dé.  
  2. These verbs seem to have both a prefix () and a suffix–where does the -is- come from?
  3. As we will see, this gets us to an interaction that is not supposed to happen in language: between pragmatics, and morphology.

Fradin explains the pattern like this (scroll down for the translation):


The second case to consider is that of the verbs like dératiser (décafardiser, désinsectiser, dénicotiniser, désodoriser, dévirginiser) which display at the same time a derivational prefix and a derivational suffix….[T]he only analysis worth considering for these verbs is to say that here  is affixed to a verb that is not present in the language, but is possible.  The solution appealing to an unattested verb is especially plausible since we can show that the verb is missing due to reasons of pragmatics.

Fradin goes on to make the case that what we have here is a set of verbs that describe the reversal of a state that you do not create.  You don’t infest something with rats, or insects, or nicotine.  (Note that Molière’s Sganarelle would disagree with the notion that nicotine is something that one is infested with.)  His story is that we see this bizarre combination of patterns:

  1. No corresponding version of the verb without 
  2. There’s an -is- that doesn’t seem to have anything obvious to do with the meaning of 

…just in the case of these verbs, in which you didn’t create the initial state of infestation.

As one of my coworkers pointed out over lunch one day: that’s not to say that you couldn’t create the initial state of infestation.  He’s right: you certainly could put rats in something, or insects, or a cockroach.  (In fact, that’s a famous scam, right?)  It’s a nice point, because it doesn’t change the essentially pragmatic nature of the explanation for this bizarre little grouping of verbs–in fact, it highlights the involvement of pragmatics, because it argues against the possibility of an ontological explanation for this.  On an ontologically-based approach, you have to have a model of reality in which it simply isn’t possible to cause something to have rats, or cockroaches, or insects, and that clearly is not the case.  Rather, this is more about what’s plausible than about what’s possible.  It’s not about what “is” (i.e., ontology)–it’s about what people expect to be the case.  (This is a big deal (to me) because you run into people who think that the answer to every question in the world is an ontology.  That doesn’t seem to be the case here.  It’s also a big deal (again, to me) because the dominant school of thought in 20th-century linguistics was heavily into denying the effects of pragmatics on language.  However, pragmatics appears to have a role here, if we buy Fradin’s story.)

My coworker also raised a counterargument.  It’s a kind of counterargument that we really like in my line of work: positing that there is a simpler explanation for the phenomenon in question.  His suggestion was that the -is- thing comes from what we call denominalization, or turning nouns into something else–in this case, a verb.  (You can find a discussion of nominalization–turning a verb into a noun–here.)  I don’t buy the adequacy of this hypothesis, because we can find so many French verbs that are pretty clearly denominalized–that is, derived from a noun–but don’t have the -is-.  Some examples:

dérater Débarrasser une personne ou un animal de l’organe appelé Rate.  Il se disait des Chiens à qui l’on faisait cette opération pour les rendre, croyait-on, plus agiles à la course. (L’appli Larousse Dict-français-français) “To remove from a person or an animal the organ called Rate (spleen).  It was said of Dogs to whom this operation was done in order to make them, it was thought, faster at racing.”
dévisser to unscrew ..from visser, to screw, from la vis (screw, and you pronounce the s)
déclouer Détacher, défaire ce qui est fixé par des clous.  (L’appli Trouve-mot) ..from clouer, to nail, from le clou, nail

I especially like the contrast between dérater and dératiser.  The semantics of both of them involves changing the state of something (linguists are heavily into the changing of states), and they both involve changing a state that you didn’t create.  So, why no -is- in dérater?  If we asked Fradin, he would be likely to point out that the verbs that he mentions–that is, the ones with dé and -is–all make reference to changing a state that is in some sense noxious.  In contrast, having a spleen is not something that you would think of as noxious, and so dérater–the removal of the spleen–doesn’t get the -is- part.  (The technical term is morpheme.)

Now, I’ve been sorta defending Fradin here, but: I hate this kind of argument in linguistics, where you’re basically arguing on the basis of examples and counterexamples.  I’m aware of the venerable history of this form of rhetoric in theoretical linguistics, but I also am more and more aware–as is much of the field–that science in general, and linguistics in particular, is less often about always and never than it is about tendencies in populations.  If you look at tendencies in the population of French verbs about changing states, you can notice a group of verbs that shares a particular “behavior” (mucking about with both dé and -is-) and a particular meaning (changing a noxious state that you didn’t create).  But, there are other verbs that have the dé-is- pattern that involve a change of state, but don’t involve a noxious condition–Friden himself gave us the example of dévirginiser, which I passed on to you in the second table above–and as far as I know, there’s nothing noxious about virginity in the Francophone world.  Furthermore, there are:

  • …verbs that have to do with changing a noxious state that you didn’t create, but have a different morphological structure that doesn’t involve dé or -is-: to delouse, which is épouiller, and likewise for to de-flea: épucer or, again, épouiller.
  • …verbs with pretty much the same semantics that do take dé, but don’t take -is-.  In particular, dévirginiser has another form, dépuceller, which led to a very embarrassing moment for me over lunch one day, but that’s a story for another time…

…and beyond that: who says that there are no corresponding verbs without dé, which you will recall is crucial to his pragmatically-based analysis?  There are hundreds of millions of easily searchable words of naturally-occurring French-language data on the web, and I would like to see a solid effort to find those words before I bought the idea that they don’t occur in the language.

So, from my point of view, I’d want to see quantitative data.  Being a minor phenomenon in a language does not by any stretch of the imagination mean that you’re not an interesting phenomenon–but, from my point of view, part of understanding anything linguistic is understanding the distribution of the phenomenon.

The mayor’s office launched a deratization campaign last month, and the story seems to have fallen out of the news.  My strolls across the city haven’t run into any of the closed-off parks that you might have read about.  I still stick my bread in the microwave before I go to bed at night–but, I always have.  I hate rats. 

English notes

rats! is a very mild way of expressing unhappy surprise.  When I say “very mild,” I mean that you could say this in front of your grandmother.

  • Oh, rats!” I couldn’t find it. I had copies of other stories and poems that I’d written in the past, but couldn’t find this particular one. (Marcus Mebes, Rats! And other frustrations)

rat: an informer.  This is slang.

  • That Richard’s been badmouthing me to the boss behind my back; he’s a rat. Ce Richard dit du mal de moi au patron derrière mon dos ; c’est une ordure. (
  • We’ve used the term “rat” to refer to an informer since approximately 1910.  (

to not give a rat’s ass: to not care (about some fact).

  • I don’t give rats ass, my niece and her boyfriend met in church but she a hoe.  (Twitter, in response to a tweet asking Guys!! Can you marry a girl you met at a Club? Not standard English, obviously (I don’t give rats ass, she a hoe).)
Picture source:
Curative Power of Medical Data

JCDL 2020 Workshop on Biomedical Natural Language Processing


Criminal Curiosities


Biomedical natural language processing

Mostly Mammoths

but other things that fascinate me, too


Adventures in natural history collections

Our French Oasis


ACL 2017

PC Chairs Blog

Abby Mullen

A site about history and life

EFL Notes

Random commentary on teaching English as a foreign language

Natural Language Processing

Université Paris-Centrale, Spring 2017

Speak Out in Spanish!

living and loving language




Exploring and venting about quantitative issues