Keeping your…together: Reproducibility in computational research

You’ve probably heard: there is a crisis in science. You don’t have to be on top of the literature to be aware of this–it’s covered in the popular press, too. This piece in Forbes is representative: How the reproducibility crisis in academia is affecting scientific research. The term “crisis” might be a bit overblown, but certainly researchers in many fields have recently been paying a lot more attention to planning their analyses for reproducibility, which can sometimes mean planning the experiments that precede the analysis for replicability (also known as repeatability). The contrast between these is that you can think of reproduction as arriving at the same values, the same findings, or the same conclusion as an earlier study; replication, on the other hand, refers to the ability to repeat the initial experiment. Replicability is important for a number of reasons, one of them being that as an initial attempt to assess the reproducibility of a study, you might want to see if you can replicate the results when you repeat the original experiment.

Lately I’ve been talking and writing about this kind of thing a lot. When I do that, I’ve found that what audiences and reviewers seem to enjoy the most is when I give details on my own failures to be able to repeat my own studies. The irony is lost on exactly no one, including (obviously) me: in theory, I have some expertise on the relevant issues, and yet I struggle just to keep my own shit together in this regard. (To keep one’s shit together explained in the English notes below.)

So: for your amusement, I present today’s reproducibility fail. To wit: I just had a paper accepted that involved doing manual examination of hundreds and hundreds of words, all of which started with letters that could be one of the negating morphemes of English. (A morpheme is a part of a word. For example, cat has one morpheme, while cats has two: cat, and the plural -s.) When I say negating morpheme, I mean things like the prefix de in deoxygenate, or the prefix in in inefficient.

Now, I said that we were examining words that start with letters that could be one of the negating morphemes of the English language. Those strings are not always negative–think of examples like these:

ineffective (not effective) versus intuitive (nothing negative in there)
unclear (not clear) versus uncle (nothing negative in there)
deactivate (cause to not be active) versus deal (nothing negative in there, although the word’s current association with Donald Trump–the molesting, draft-dodging, tax-dodging, race-baiting, disabled-mocking, religiously bigoted, lying assclown that is now the president of my fatherland–makes it somewhat nauseating for me to type it)

…the moral of which is that you can’t find all of the words with negative prefixes in a text just by starting with a list of negative prefixes and looking for all words that start with them. Doing this would lead you to count intuitive, uncle, and deal as words that start with negatives, which they are not.

affix: something that cannot be a word, but can be added to one. English examples: un-, pre-, -‘s.

prefix: an affix that is added to the beginning of a word. English examples: un-, pre-, pro-.

suffix: an affix that is added to the end of a word. English examples: -‘s, -ing, -ed.

So, when I wanted to find out how the incidence of affixal negation compares between different kinds of biomedical texts–I care about that kind of thing because my job involves researching computer programs that do things with biomedical texts, and I need to know things like how much does negation add to the burden of understanding medical texts by patients’ family members?–I knew that I could write a program to pull out all of the words that start with things like de-, un-, in-, and anti-, but I also knew that I would have to have actual human beings look at those lists and mark which ones actually started with negative prefixes, and which didn’t.

Now, when you do something like this–that is to say, when you have humans look at data (linguistic or otherwise) and make judgments about it, you typically want to have more than one person do it. Then you calculate how often they agree with each other. If they agree with each other, say, 90% percent of the time, then you probably have pretty good judgments in hand. On the other hand, if they agree with each other only 60% of the time, then you’ve got a problem. Maybe you’ve defined a task that’s just too difficult for humans to do consistently, in which case you want to redefine it in a way that makes more sense. Maybe you wrote crappy instructions, in which case you want to improve them. Maybe one of your humans is smoking what we call in France shit (marijuana–no, I do not indulge). In any case, it’s that calculation of agreement between the humans that lets you decide whether or not you have a problem that needs to be dealt with.

Coincidentally, at the moment I’m teaching a course on what I do for a living, and I wanted to give my students the opportunity to get some hands-on practice with the process of making the human judgments that provide the data that we use to do our research. This little project seemed like a good one to offer them, for a number of reasons:

It’s relatively straightforward (we got good agreement on the original project)…
…while still difficult enough to be challenging (we had to take a couple passes at developing the instructions, and even then, we didn’t have complete agreement on everything)…
…plus, you don’t need a special program to record the judgments, while more complicated tasks frequently do require that the human learn a complicated program in order to record their analyses (did you notice that little subjunctive? …that a human learn… versus …that a human learns…?)…
…and I actually need the data for future research, which means that I’ll use it to write papers, which means that the students will have the opportunity to participate in writing the papers, and for students, published papers are the key to getting your doctorate and getting the hell outta Dodge.

Now, because I care about reproducibility of research, I use a publicly available web site to archive the code (computer programs) that I use to do my analyses. (You can find the stuff for the project that I’m talking about here.) So, getting my students started seemed like it would be straightforward: send them to the web site, and tell them to download the instructions, the list of words that needed judgments, and the actual judgments of the two analysts so that they could use those to figure out how to use the analysis program and to evaluate their own judgments.

It happens that I was one of those analysts, and that a colleague who happens to be a practicing emergency room physician was the other. It also happens that we annotated a randomized mixture of text from two sources: from scientific journal articles, and from the clinical records (totally anonymized, and available free to researchers) of actual patients. In the case of the clinical records, I found when doing the analysis of our agreements and disagreements that when we disagreed, I mostly thought that he was right and I was wrong. (Not surprising, since he is currently practicing, and I haven’t touched a patient since 1991.) In contrast, I tended to be right when we disagreed on the scientific journal articles–not surprising, either, since I spend all day, every day with my nose stuck deep in them. So: it was super-important to me that my students have access to both of our data, both so that they could compare their own judgments to it, and so that they could see what kinds of things we had disagreed on. (It’s usually the differences in the world that are the most interesting, right?)

Seulement voilà, the thing is: when I went to the web site where I had archived all of the code and the data on which the analysis was based, I saw that I had totally forgotten to put the other analyst’s data there. Think about the context of this:

In theory, I have some expertise on issues of reproducibility in computational science.
I was very deliberately making an effort to make this experiment as repeatable as possible.

…and yet, I still screwed it up. This is important in that when you read about reproducibility problems in science, sometimes you’ll see–often implicitly, and sometimes even explicitly–the view that reproducibility problems come from deliberately deceptive actions on the part of the researcher. Now, I know that a certain amount of self-deception can take place pretty easily in research, typically taking the form of screwing around with statistical tests of significance. But, that’s a pretty different thing from deliberately publishing crap research. When you consider that someone who is pretty deeply invested in doing, and in promoting, reproducible research–that is: me–can still fail to archive everything that would be needed to repeat one of his own experiments, it gives you an object example of how difficult it can be to ensure even the less-ambitious goal of repeatability of one’s work…and a fortiori, reproducibility of one’s results.

In French, there are some very interesting things associated with affixal negation, including the phenomenon of verbs like dératiser and décafardiser that we talked about in the post that you can find here. Several of the English-language examples in this post come from this paper on affixal negation by Chantal van Son, Emiel van Miltenburg, and Roser Morante, all of the Vrije Universiteit Amsterdam.

English notes

One of the expressions in this post, along with its many relatives, strikes me as interesting because it contains the word shit, which is almost always an “inherently negative” word, and yet it describes a desirable state. The expression in question: In theory, I have some expertise on the relevant issues, and yet I struggle just to keep my own shit together in this regard. (I should point out that you can only use these expressions in contexts, and with people, such that it would be acceptable to use obscenity. So, I would use this with my siblings and cousins, maybe or maybe not with my aunts, depending on which one, and most definitely not in front of my grandmother.) Unless otherwise stated, the examples here come from the OPUS2 English corpus, a collection of 19.7 billion words of English texts. I searched it through the Sketch Engine web site, purveyor of fine linguistic corpora and tools for searching them.

to have one’s shit together is the most basic of this surprisingly large family of expressions. In its most central sense, it means something like to be functioning in an efficient way. Here are some examples of how it’s used.

That’s because I have my shit together and I prioritize properly.
If you don’t have your shit together chances are it’s because you surround yourself with people who don’t have their shit together (Twitter) (Note: “chances are” means “probably.”)
I know I probably sound like I have my shit together , but really I feel confused inside.
And pretending I have my shit together when it comes to deadlines and paperwork is one of my specialties, a skill to which I probably owe every job I’ve ever had.
I thought, perhaps naively, that by almost a year along I would have my shit together – or at least have some sort of clue and I do not.
Turns out she’s sharp as a tack and really has her shit together. (Note: “to be sharp as a tack” means “to be quite intelligent.”)
And so perhaps this is why he doesn’t find himself attracted to his students, and instead finds himself attracted to Audrey’s silver hair and faintly lined face: these things signify a woman who has her shit together, who has moved on to the next level.

With that established: if to have one’s shit together means to be in a particular state–the state of having one’s shit together–to keep one’s shit together means to maintain that state. Some examples:

I am now one of the countless unemployed because I could not keep my shit together.
I’m trying to navigate the holiday season as a crafter, keep my shit together at work, plan the holidaze both for Thanksgiving at my mom’s house and what will surely be a painful Christmas Eve . . .if we could decide who’s host/essing it.
“As long as you keep improving.” I raised my eyebrows. “Is that your way of telling me to keep my shit together?” (Michelle Hodkin, The evolution of Mara Dyer)
Ruby Pelletier put her hands on her skinny hips, threw her head back, and bellowed laughter. “You think you can keep your shit together when twelve CB cowboys pull in all at once and order scrambled eggs, bacon, sausage, french toast, and flapjacks?” (Stephen King, The dead zone. A CB cowboy is a truck driver–very, very old slang, although not quite as old as I am. 1970s, I would say. French toast is pain perdu. Flapjacks are pancakes.)

There are several more of these odd expressions where shit means something positive–so many that if I tried to get them all into one post, I would be writing this for the next two weeks. Watch this space for more, as the spirit moves me–and don’t say this stuff in front of my grandmother.

	Anonymous on The many ways to spell “…
	Anonymous on Nightmare after nightmare: How…
	zipfslaw1 on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …

	Anonymous on The many ways to spell “…
	Anonymous on Nightmare after nightmare: How…
	zipfslaw1 on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …

Share this:

Leave a comment Cancel reply