May 2019 – Zipf's Law

I make my living as an academic, which is to say: I write papers, I write grant proposals, I write papers, I write grant proposals, I write grant proposals, I write grant proposals, I write grant proposals… The good thing is, I like to write, and it matters little to me what I write–papers, grant proposals, grant proposals, or grant proposals.

It matters little to me: this is a somewhat literary way of saying “I don’t care.” I think the French equivalent would be peu me chaut.

The nature of doing research and writing proposals to get grant funding to do yet more research is that you’re constantly pushing yourself into what you don’t know. Consequently, you have a lot of questions–you don’t know how to use some computer programming technique that you’ll need to use in order to take the next step in your research; you don’t know how to apply some statistical test; you don’t understand the format of some kind of data that you need to use.

The best way that I know of to predict who will be successful in academia, versus who will fail, is this: people who are afraid to ask questions will fail; people who are not afraid to ask questions might succeed. (It’s a tough line of work.) Of course, this goes way beyond academia. I’ve had a long and bizarre career path that has taken me through professions as different as being an ambulance attendant and developing mapping software, and I would say that the same is true of any field in which I’ve worked: if you’re afraid to ask questions, you’re probably going to find yourself out of a job at some point.

One way to feel OK about asking questions in a professional context: (a) know that you know that some ways of asking questions are better than others, and (b) know that you know the best way. The best way to ask a question in a professional context is to tell the questionee (I just made that word up, on an analogy with questioner) what you have already done to try to find the answer yourself. Here’s an example of doing that. I was writing something about silver standard corpora, and I needed a good diagram to illustrate the process of building them.

A corpus (plural corpora) is a set of linguistic data with other data added to it that tells you something about the contents of that data. A “gold standard” corpus is one that has had its contents labelled by humans; a “silver standard” corpus is one that has had its contents labelled by multiple computer programs. The idea behind a silver standard corpus is that if multiple programs agree about something, then they’re probably correct, and so you keep the labels that come from multiple programs and throw out the ones that come from only a single program. Perfect? No–hence, “silver standard,” not “gold standard.”

I looked around for such a diagram, but I couldn’t find one. So, I wrote the following email to a colleague who is an expert on the topic:

Hi there, Zellig,

I hope you’re doing well and enjoying the spring! I wonder if you have a good diagram that illustrates the construction of a silver standard corpus? What I’m envisioning is a picture of a string of text with some kind of highlighting or underlining or something of what multiple systems label in it. I’ve looked at your two 2010 papers and the 2011 paper, but didn’t find a diagram of that sort. I found this diagram:

journal.pone.0116040.g002 — Picture source: Oellrich, Anika, Nigel Collier, Damian Smedley, and Tudor Groza. “Generation of silver standard concept annotations from biomedical texts with special relevance to phenotypes.” PLoS one 10, no. 1 (2015): e0116040.

…in this paper:

Oellrich, Anika, Nigel Collier, Damian Smedley, and Tudor Groza. “Generation of silver standard concept annotations from biomedical texts with special relevance to phenotypes.” PloS one 10, no. 1 (2015): e0116040.

…and it’s pretty much what I’m looking for, except that it doesn’t actually show very much in the way of overlaps between the systems. I also looked at Kang et al. (2012) and the Krallinger et al. (2015) paper on CHEMDNER–no luck.

So, I thought that maybe you have a nice figure in a PowerPoint slide or something, and thought that I would ask…

Zipf

Notice the format of the message. I started out with the question:

I wonder if you have a good diagram that illustrates the construction of a silver standard corpus? What I’m envisioning is a picture of a string of text with some kind of highlighting or underlining or something of what multiple systems label in it.

Then I went on to tell the recipient what I had already done to try to answer the question myself:

I’ve looked at your two 2010 papers and the 2011 paper, but didn’t find a diagram of that sort. I found [a diagram] in this paper:

Oellrich, Anika, Nigel Collier, Damian Smedley, and Tudor Groza. “Generation of silver standard concept annotations from biomedical texts with special relevance to phenotypes.” PloS one 10, no. 1 (2015): e0116040.

…and it’s pretty much what I’m looking for, except that it doesn’t actually show very much in the way of overlaps between the systems. I also looked at Kang et al. (2012) and the Krallinger et al. (2015) paper on CHEMDNER–no luck.

Now the questionee knows that I looked in six places, two of which are papers that the questionee himself wrote, before bugging him. How could he possibly have a problem with being asked the question? And of course, the best thing about asking questions this way is that in the process of doing the things that you’ve done to try to answer them yourself, you often come up with the answer!

“Zellig” is not actually the poor questionee’s name. It’s a reference to Zellig Harris, a famous linguist of the mid-late 20th century. He was a pioneer in the development of the idea of sublanguages (which, as far as I know, he invented). Since I work on a number of biomedical sublanguages, I cite his work a lot. See this blog post for some French-language vocabulary related to the topic, from Sur la notion de sous-langage, by the French scholar Roland Dachelet.

	Anonymous on The many ways to spell “…
	Anonymous on Nightmare after nightmare: How…
	zipfslaw1 on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …

Month: May 2019

The best way to ask a question at work