How to design the methods for a data science, machine learning, or natural language processing project: Part I

Yay—I have a data science/machine learning/natural language processing project! Now what do I do??

I occasionally use this blog to try out materials for something that I will be publishing.  This post is a casual version of something that will go into a book that I’m writing about…writing.

So, you’re going to do a data science project.  Maybe you’re going to use natural language processing (processing: using a computer program to do something; natural language: human language, as opposed to computer languages) to analyze social media data because you want to find out how veterans feel about the medical care that they receive through the Veterans’ Administration.  (Spoiler alert: a number of my buddies are vets, and they do indeed use the Veterans’ Administration health care system, and they both (a) are happy with it, and (b) recommend it to the rest of us.)  Maybe you’re doing it as a project for a course; maybe you’re doing it as your first assignment at your high-paying brand-new data scientist job; maybe you’re planning to write a research paper for a journal on military health care.  How do you go about doing it?


An excellent piece of advice when you’re trying to figure out how to do any research project: write out what you’re going to do, in prose, before you start doing it.  As my colleague Graciela Gonzalez, of the Health Language Processing Laboratory at the University of Pennsylvania School of Medicine, puts it:

Most of us make some mistakes in the process of thinking through how we will test our hypothesis.  The advantage of writing down what you’re going to do–the Methods section of a research paper, the design of your research project–before you do it is that when you see it on paper, spelled out explicitly and step by step, you will often notice the logical or procedural errors in what you were thinking, and then you won’t spend weeks making those errors before realizing that they were never going to get you where you wanted to go.

OK, so: you know that you’re going to write out your methods, very explicitly and in the order in which you will do them.  But, how do you figure out what those methods should be?


An efficient way to go about this is to read research papers by other people who have done similar things.  As you read them, you’re going to look for a general pattern–think of this as an example of the frameworks that we’ve talked about in other parts of this book.  Returning to our example of using natural language processing to analyze social media data, you might go to PubMed/MEDLINE, the National Library of Medicine’s database of 27 million biomedical research articles, and search for papers that mention either natural language processing or text mining, and also have the words social media in the title or abstract.  (Click here if you would like to see the set of 190+ papers that this search would find.)

The results of that search will return these three papers that are studying a problem similar to yours: they’re using natural language processing to find women talking about their pregnancy, people talking about adverse reactions to drugs, or people talking about abuse of prescription medications–not exactly what you need to do, but similar. You’ll see two steps that are carried out in all of them.  I’ve highlighted the points where they’re mentioned in the abstracts of the three papers:

METHODS: Our discovery of pregnant women relies on detecting pregnancy-indicating tweets (PITs), which are statements posted by pregnant women regarding their pregnancies. We used a set of 14 patterns to first detect potential PITs. We manually annotated a sample of 14,156 of the retrieved user posts to distinguish real PITs from false positives and trained a supervised classification system to detect real PITs. We optimized the classification system via cross validation, with features and settings targeted toward optimizing precision for the positive class. For users identified to be posting real PITs via automatic classification, our pipeline collected all their available past and future posts from which other information (eg, medication usage and fetal outcomes) may be mined.

Sarker, Abeed, Pramod Chandrashekar, Arjun Magge, Haitao Cai, Ari Klein, and Graciela Gonzalez. “Discovering cohorts of pregnant women from social media for safety surveillance and analysis.” Journal of medical Internet research19, no. 10 (2017): e361.

METHODS: One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies.

Sarker, Abeed, and Graciela Gonzalez. “Portable automatic text classification for adverse drug reaction detection via multi-corpus training.” Journal of biomedical informatics 53 (2015): 196-207.

METHODS: We collected Twitter user posts (tweets) associated with three commonly abused medications (Adderall(®), oxycodone, and quetiapine). We manually annotated 6400 tweets mentioning these three medications and a control medication (metformin) that is not the subject of abuse due to its mechanism of action. We performed quantitative and qualitative analyses of the annotated data to determine whether posts on Twitter contain signals of prescription medication abuse. Finally, we designed an automatic supervised classification technique to distinguish posts containing signals of medication abuse from those that do not and assessed the utility of Twitter in investigating patterns of abuse over time.

Weissenbacher, Davy, Abeed Sarker, Tasnia Tahsin, Matthew Scotch, and Graciela Gonzalez. “Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods.” AMIA Summits on Translational Science Proceedings 2017 (2017): 114.

Now we can abstract out the two steps that we found in all three papers:

  1. The authors built a data set.
  2. The authors used a technique called classification–a form of machine learning–to differentiate between the social media posts that did and did not talk about a person’s own pregnancy, or an adverse reaction to a medication, or abuse of prescription medications.

So, now you have a basic outline of your methodology.  Your goal being to use natural language processing to investigate, using social media data, how veterans feel about the care that they receive through the Veterans’ Administration health care system, maybe your methodology will look like this:

  1. Create a data set containing tweets in which veterans are talking about how they feel about the care that they receive in the VA health care system.
  2. Use machine learning to classify those tweets into ones where the vets feel (a) positive, (b) negative, or (c) neutral about that care.

OK, so: now you can expand that.  You’re quickly going to realize that Step 2–classifying those tweets–is actually going to require you to be able to do three classifications:

  1. You have to be able to differentiate tweets written by veterans from tweets written by everybody else.
  2. You have to be able to differentiate tweets where the vets are talking about the VA health care system from where they’re talking about things other than the VA health care system.
  3. You have to be able to classify whether the feelings that they express about the VA health care system are positive, negative, or neutral.

Now that you’ve started to flesh out your methodology, you realize something: creating that data set is going to take a really long time, since you essentially have to be able to label three different kinds of things in the social media posts.  You have a finite amount of time and resources with which to do it, so how are you going to make that possible?

Faced with an enormous amount of work to accomplish with limited time and resources, the most sane approach is this: go to your supervisor, show them your detailed methods plan, and let them come to the conclusion that they had better either (a) give you a lot more resources, or (b) modify your assignment.  Having gone through this multiple times over the course of my career, I can tell you that (b) is a hell of a lot more likely.  What is the modified assignment going to look like?  It’s probably going to be a reduction of the task to “just” the task of detecting tweets that were and weren’t written by veterans.  Now you can go back to your outline, and modify it:

  1. Create a data set containing tweets written by veterans, and tweets written by anybody else.
  2. Use machine learning to classify those tweets into the ones that were written by veterans, and the ones that weren’t.

This is going to be hard enough, believe me.  Here are some examples of what those tweets might look like–I made them up, but they’re totally plausible:

  1. HM1 Zipf here, USS Biddle 1980-1982–BT3 Raven McDavid, you out there?
  2. AFOSC raffle drawing at 1500–win that lawnmower and help us buy books for the squadron?
  3. FTN today, FTN tomorrow, FTN and fuck Chief Chomsky til I get out this motherfucker
  4. Mario Brothers, still nothin like it, bitchboys

Have you figured it out?  Here are the answers:

  1. Clearly written by a veteran.
  2. Almost certainly written by the spouse of an active duty Air Force officer, so not written by a veteran.
  3. Clearly written by a sailer who is still on active duty, so not written by a veteran.
  4. No clue who it was written by, and/but there’s no reason whatsoever to think that it was written by a veteran, so it should be classified as not written by a veteran.\

What’s that you say?  It wasn’t clear to you at all?  Think about this: if it wasn’t clear to you, it’s certainly not going to be clear to a computer program, so your classification step is going to be difficult.  In fact, if it’s not clear to you, you’re going to have a hell of a difficult time building the data set–time to go back to your supervisor and ask for the resources to hire some veterans to help you out!

…and (4) raises a super-difficult question: what the hell counts as a reasonable experimental control for this research project?  (Spoiler: I don’t know, and I have a doctoral degree in this particular topic.)


All of this to say:

  1. Your redefined project is going to be plenty hard, thank you very much.
  2. You wouldn’t know how crucial it was to redefine said project if you hadn’t started the process of writing out what exactly you’re going to do.

…and hell–you hadn’t even gotten to the “exactly” part yet!  So: take Graciela’s point seriously, and write some things down before you start doing anything else.

…and now you can think about what you’re going to measure to figure out whether or not you were successful in doing what you were trying to do.


Linguistic geekery: Raven McDavid was a dialectologist back in day.  He is said to be the inspiration for the Harrison Ford character in Raiders of the lost ark.  Chomsky is Noam Chomsky, the most important (although not the best, in my humble opinion) linguist of the 20th century.  Where they appear in the post:

  1. HM1 Zipf here, USS Biddle 1980-1982–BT3 Raven McDavid, you out there?
  2. AFOSC raffle drawing at 1500–win that lawnmower and help us buy books for the squadron?
  3. FTN today, FTN tomorrow, FTN and fuck Chief Chomsky til I get out this motherfucker
  4. Mario Brothers, still nothin like it, bitchboys’

We who smell of anchovy pizza salute you

I thought how horrible it would be to kiss someone whose beard smelled of anchovy pizza.

The adventure of the moment has me in Philadelphia, one of the great eastern cities of the Colonial era, where I have the privilege of spending the summer as a guest of the Health Language Processing Laboratory of the University of Pennsylvania School of Medicine.  I’m living in a dormitory on the edge of the campus, which is a beautiful part of town–a beautiful part of town that borders what we call in French a quartier sensible.  In plain English: a shitty neighborhood.  The grocery store that I go to is right on the border between the two, so the clientele is a mix of two very different populations: hip young college students and faculty, and the kinds of people who live in shitty neighborhoods.  (Word to the wise: don’t go past 45th Street.)

So, yesterday I’m standing in line in said grocery store.  In front of me is a smart-sounding young woman.  Being an American, she’s talking on her cell phone.  Being an American, she’s talking loudly enough for everyone else to hear.  Being American, we’re all listening.  (In France, we do not have loud conversations on the phone in public–rude, rude, rude.)  She’s talking about what medical specialty she’ll be going into–it’s clear which side of the border she’s from.

In the line next to us is an old dude who clearly is from the other side of the border.  He stands in line with his hands behind his back.  Being an American, he’s fat.  Ratty jeans, and his nasty t-shirt reveals a jailhouse tattoo on his arm that says FTW, which for those of us of a certain age stands for Fuck The World.  His beard looks like it probably smells of anchovy pizza.

The smart-sounding young lady is talking about what medical specialty she’ll be going into.  Funny thing is, she’s not just talking about it–she’s apologizing for it.  Mom, dermatologists help people, too.  I mean, at least in small ways, we can affect their lives a little bit.  …No, it’s not like saving lives all day in the emergency room, but we can make a little bit of a difference for people. 


So, I’m standing there in line, and I notice something: my favorite soap is on sale for a dollar off.  Computational linguistics is not nearly as remunerative as one might think, and that dollar off could be translated into the luxury of a cup of coffee, so as the smart-sounding young lady says her goodbyes and hangs up, I think: I’m getting out of this line and I’m grabbing a bottle of soap, and the wait be damned.  Then the nasty-looking old dude with his hands behind his back says something, and I think: hearing this one out is going to be worth spending an extra dollar for a bottle of soap–stay right where you are.

  • “Are you a doctor?  You sound like a doctor.”
  • Well…I’m an intern.  I just finished medical school.
  • “If my grandma has a melanoma, is that a bad thing?”
  • She should definitely go to a doctor.  It can be fatal.
  • “Like, a skin doctor?”
  • Yes, a dermatologist.
  • “A…dermalologist?”
  • Yes, a dermatologist.  Melanoma can be fatal, but if it’s caught in time, it could save her life.
  • “So, a skin doctor could save someone’s life?  Who woulda thought?”  

For the first time, she looked at him, right in the face.  I’m guessing that she was thinking something like I was thinking: nobody as ancient as this old fuck has a living grandmother.  What the hell?

“I sure do love my grandma,” he said.  And he smiled.

She looked at him for a while.  You could see the wheels turning.  And then she smiled, too.  Thank you, she said.

She paid for her groceries, and left.  The old fuck turned to put his groceries on the counter, and I saw what was in those hands behind his back: a book about existentialist perspectives on psychoanalysis.  I looked at his FTW tattoo.  I thought how horrible it would be to kiss someone whose beard smelled of anchovy pizza.  I bet his dead grandma didn’t mind, though.


to say something in plain English: to say something clearly, with no big words or complicated sentences.  (Phil dAnge: is there a French equivalent?) Examples:

 

 

funny thing is: …means that you’re about to describe something that is strange about a situation.  Could also be weird thing is, strange thing is, and you could also put the in front of it.  Examples:

 

 

  • How I used it in the post: Funny thing is, she’s not just talkingabout it–she’s apologizing for it. 

remunerative: adjective that means that something pays a good salary.  How I used it in the post:

  • Computational linguistics is not nearly as remunerative as one might think, and that dollar off could be translated into the luxury of a cup of coffee, so…

(dollar) off: “Off” here means the amount of a reduction in price.  Examples:

 

 

to hear something out: to listen to the end of a discourse of some kind–an idea, an explanation.  Similar construction: to hear someone out, which means to listen until they have said all that they have to say. Examples:

 

 

  • How I used it in the post: Then the nasty-looking old dude with his hands behind his back says something, and I think: hearing this one out is going to be worth spending an extra dollar for a bottle of soap–stay right where you are.

woulda: in informal spoken American English, “would have.”

sure do + verb: an emphatic construction.  Has a very rural flavor.

Looks like [it] smells of anchovy pizza is a marvelously evocative description, and I wish I could say that I came up with it myself, but I didn’t: it’s from an old Berke Breathed cartoon, where Opus the penguin uses it to describe Bruce Springsteen.

Why doing the laundry makes me happy

Doing the laundry will make you happy if you spend sufficient time contemplating the zombie apocalypse.

What will suck about the zombie apocalypse is….well, everything, really. For example: when the zombie apocalypse comes, most people will be completely filthy most of the time. For a while, you’ll at least be able to scavenge clean clothes–you won’t have many opportunities to bathe, but let’s face it: Old Navy will not be the first store to be looted. Eventually the clean clothes will all be gone. Eventually the day will come when you’ll strip a coat off of a reeking zombie whose head you’ve just smashed like a watermelon and be happy that you have something to keep yourself warm.

Today I woke up at 5:30–late for me–and headed down to the basement laundry room. Then I went to work–in clean underwear, clean jeans, and a clean t-shirt from the 2007 Association for Computational Linguistics meeting in Prague. (I learned to say gde je stan’ce metra–where is the subway station–which was undeniably useful. I also learned to ask questions about the National Theater, which amused the taxi drivers but did not accomplish much else.)

When you compare it with how bad life is going to suck during the zombie apocalypse, doing the laundry was actually pretty fun. Going to work in clean clothes was a pleasure, as it is every day, and it always will be if you spend sufficient time contemplating the zombie apocalypse.  There’s a reason I’m the happiest person you know. Hell, I’m the happiest person you don’t know.  Think about it.


English notes

In American English, “like a watermelon” is a common simile for describing actions of crushing, smashing, and the like.  Some examples:

 

 

 

 

How I used it in the post: The day will come when you’ll strip a coat off of a reeking zombie whose head you’ve just smashed like a watermelon and be happy that you have something to keep yourself warm.


Language geekery: similes versus analogy

Simile and analogy are similar (is that a pun? if so, it’s not a very sophisticated one), but they’re not quite the same.  Analogy starts with focusing on similarity between unlike items, and then typically is followed by pointing out the differences between them.  In contrast, simile does not require any actual similarity between the unlike items, and does not include pointing out the differences.

Thus, the heuristic Detached roles is like a Hearst & Schütze super-category, but not constructed on a statistical metric, rather on underlying semantic components. (Source: Litkowski, Kenneth C. “Desiderata for tagging with WordNet synsets or MCCA categories.” Tagging Text with Lexical Semantics: Why, What, and How? (1997).)

A recursive transition network (RTN) is like a finite-state automaton, but its input symbols may be RTNs or terminal symbols. (Source: Goldberg, Jeff, and László Kálmán. “The first BUG report.” In COLING 1992 Volume 3: The 15th International Conference on Computational Linguistics, vol. 3. 1992.)

Therefore, a conversation is like a construction made of LEGO TM blocks, where you can put a block of a certain type at a few places only.  (Source: Rousseau, Daniel, Guy Lapalme, and Bernard Moulin. “A Model of Speech Act Planner Adapted to Multiagent Universes.” Intentionality and Structure in Discourse Relations (1993).) Note that a native speaker probably would have put this somewhat differently.  Where the authors say where, a native speaker might have said where you can only put a block of a specific type at a few places, or more likely, except that you can put a block of a specific type only specific places.

Given all of that: is this an analogy, or a simile? The day will come when you’ll strip a coat off of a reeking zombie whose head you’ve just smashed like a watermelon and be happy that you have something to keep yourself warm.  Scroll down past the gratuitous Lisa Leblanc video for the answer.

I sometimes use this blog to try out materials for something that I will be publishing.  This brief description of how to use analogy is intended for a book about writing about data scientist.  I would love to know what parts of it are not clear.  (My grandmother will tell me how great it is, so no need for you to bother with that.)

Answer: it’s a simile.  Note that we’re not asserting any difference between the way that you’re going to smash the zombie’s head and the way that you would smash a watermelon: a reeking zombie whose head you’ve just smashed like a watermelon.  Note also that we are not then contrasting the way that you’re going to smash the zombie’s head and the way that you would smash a watermelon.  Simile, not analogy.

 

 

 

Yes, please–do volunteer to be a reviewer

Yes, you CAN volunteer to be a peer reviewer!

Get any two researchers together in a bar at the end of a day at any randomly chosen conference.  They will get around to complaining about the difficulty of getting grant funding these days, service responsibilities in their institution, and how grad students don’t want to work as hard as we did back in the day.  But, before that, they will complain about the real pain point of academic work: reviewing.  (See the English notes below for an explanation of the expression “pain point.”)

“Peer review” is the process by which academic writing is considered for publication.  The mechanics of it are this:

  1. An author submits an article to a journal or conference.
  2. An “associate editor” at the journal or an “area chair” at the conference finds reviewers who are willing to read and comment on the paper–your “peers.”
  3. The reviewers read the paper, write up detailed comments on it, and make a suggestion regarding acceptance.
  4. The associate editor or area chair makes a decision about the paper.

That decision in step 4 can take a number of forms, including outright acceptance (rare), rejection (not rare), and giving the author the option of making changes in response to the reviewers’ comments and resubmitting the paper, in which case steps 3 and 4 repeat.  (They can repeat multiple times, too.)

At step 2, the associate editor or area chair needs to find three reviewers in the typical case–rarely fewer, and sometimes more.  (I once submitted a paper to a journal for which I am the deputy editor-in-chief, and the editor who handled it had it reviewed by SIX reviewers–the most I have ever seen.  To avoid the appearance of a conflict of interest, that made sense.)

Three reviewers per submission, and the big conferences in my area (computational linguistics) typically get between 1,000 and 2,500 submissions–that’s 3,000 to 7,500 reviews per conference.  There are several big conferences in my area–assume five per year, and that’s 15,000 to 37,500 reviews that need to get written per year.  And that’s just the conferences–journal publications are appearing faster than ever before in history, which is in itself not a surprise–most things are happening faster than ever before in history—but, the publication rate has been growing logarithmically, and if you’ve been reading about Zipf’s Law for a while, you know that that’s fast.   Journal submissions take quite a bit more time to review than conference papers, too–a conference paper in my field is typically limited to 8 pages, but most journals in my field no longer have page limits at all.

Just for grins, here are the page counts on my 5 most recent journal articles: 15, 8, 14, 24, and 12.  The 8-pager was in a journal with a page limit–of 7 pages!  We paid an extra-page fee.

Who writes those peer reviews?  Well…your peers.  You write your share of those 15,000 to 37,500 reviews, and the authors of those 5,000 to 12,500 papers write reviews of your papers, and… Well, it’s a huge workload.  How huge, exactly?  It’s hard to say what an average would be, but I have a reviewed a couple hundred papers over the course of the past couple of years.  Is that typical?  Probably.  And the conference papers come in bursts–conferences are deadline-driven, so all of the 1,000 to 2,500 submissions to an individual conference are being done at once.  A reviewer for a conference in my field is typically assigned 5 papers.  Of course, there is a limited set of time slots when conferences can happen–they mostly take place during breaks in the academic year, so either during the summer, or around the end-of-year holidays.  That means that their submission deadlines tend to cluster together, so you are probably reviewing for multiple conferences in the same time period.  How many?  I’ve written 14 in the past two weeks.  I may actually have spent more time reviewing other people’s papers than working on my current grant proposal–and it’s the grant proposals that bring in my salary.  Could I say no to review requests?  Of course.  But, it would not be fair to do so–while I’m reviewing those papers, someone else is reviewing mine.

….All of this en préambule to the answer to a question that I don’t get asked often enough: can you volunteer to be a reviewer?  The answer: yes.  Here’s a good example of a request that I got recently:

Dear Dr. Zipf:

I am a Ph.D. student at university name removed, majoring in computer science, under the supervision of advisor name removed. My main research fields are bioinformatics, deep learning, machine learning and  artificial intelligence.
I have done some researches in bimolecular function prediction, Nanopore sequencing, fluorescence microscope super resolution, MD simulation, sequence analysis, graph embedding and catastrophic forgetting, which were published in journals, such as PNAS, NAR and Bioinformatics, and conferences, such as ISMB, ECCB and AAAI. Attached please find my complete CV about my background.

I am very interested in serving the community and acting as a reviewer for the manuscripts which are related to my background. I know you are serving as an associate editor for a number of journals, such as BMC Bioinformatics. If you encounter some manuscripts which are highly related to my background, feel free to refer me as a reviewer.

Thank you very much for your consideration! Have a nice day!

My response:

Hi, name removed,

Thank you for writing–it is always nice to see a volunteer for reviewing!  However, I only handle articles on natural language processing, which seems outside of your areas of expertise.  I would recommend that you send your CV, and a similar email, to associate editors who specialize in your areas.  Your advisor could suggest some, and you could also look at the editorial board of relevant journals, especially ones in which you have published.
Thank you again for volunteering, and keep looking for opportunities–I am pretty sure that you will find them!
Best wishes,
Beauregard Zipf
Response to THAT:

Dear Dr. Zipf:

OK! Thank you very much for the clarification and the instruction! Have a nice day!

Notice what you do not see in this exchange: what people are afraid of, which is a response saying something along the lines of “who the hell do you think you are to dare to propose yourself as a reviewer?”  Of the 200 emails that I probably plowed through that day, this offer might have been the only message that actually brought me a little joy–even though I couldn’t use this particular reviewer, I’m certain that someone else will.  Yes: you can volunteer to be a peer reviewer!


French notes

en préambule (à): as a preamble, en guise d’introduction.
la relecture par les pairspeer review.  WordReference.com also gives l’évaluation par les pairs and l’inter-évaluation, but I’ve never actually heard that last one.  Native speakers??
Want to read a French-language blog post about peer review in computational linguistics?  Here’s one by my colleagues Karën Fort and Aurélie Névéol.
English notes
pain pointa marketing term referring to the problem that a salesman is going to try to solve for you by selling you his product.  How I used it in the post: Before that, they will complain about the real pain point of academic work: reviewing.

The best way to ask a question at work

The best way to predict who will be successful, versus who will fail: people who are afraid to ask questions will fail; people who are not afraid to ask questions MIGHT succeed. 

I make my living as an academic, which is to say: I write papers, I write grant proposals, I write papers, I write grant proposals, I write grant proposals, I write grant proposals, I write grant proposals… The good thing is, I like to write, and it matters little to me what I write–papers, grant proposals, grant proposals, or grant proposals.

It matters little to me: this is a somewhat literary way of saying “I don’t care.”  I think the French equivalent would be peu me chaut.

The nature of doing research and writing proposals to get grant funding to do yet more research is that you’re constantly pushing yourself into what you don’t know.  Consequently, you have a lot of questions–you don’t know how to use some computer programming technique that you’ll need to use in order to take the next step in your research; you don’t know how to apply some statistical test; you don’t understand the format of some kind of data that you need to use.

The best way that I know of to predict who will be successful in academia, versus who will fail, is this: people who are afraid to ask questions will fail; people who are not afraid to ask questions might succeed.  (It’s a tough line of work.)  Of course, this goes way beyond academia.  I’ve had a long and bizarre career path that has taken me through professions as different as being an ambulance attendant and developing mapping software, and I would say that the same is true of any field in which I’ve worked: if you’re afraid to ask questions, you’re probably going to find yourself out of a job at some point.

One way to feel OK about asking questions in a professional context: (a) know that you know that some ways of asking questions are better than others, and (b) know that you know the best way.  The best way to ask a question in a professional context is to tell the questionee (I just made that word up, on an analogy with questioner) what you have already done to try to find the answer yourself.  Here’s an example of doing that.  I was writing something about silver standard corpora, and I needed a good diagram to illustrate the process of building them.

A corpus (plural corpora) is a set of linguistic data with other data added to it that tells you something about the contents of that data.  A “gold standard” corpus is one that has had its contents labelled by humans; a “silver standard” corpus is one that has had its contents labelled by multiple computer programs.  The idea behind a silver standard corpus is that if multiple programs agree about something, then they’re probably correct, and so you keep the labels that come from multiple programs and throw out the ones that come from only a single program.  Perfect?  No–hence, “silver standard,” not “gold standard.”

I looked around for such a diagram, but I couldn’t find one.  So, I wrote the following email to a colleague who is an expert on the topic:

Hi there, Zellig,

I hope you’re doing well and enjoying the spring!  I wonder if you have a good diagram that illustrates the construction of a silver standard corpus?  What I’m envisioning is a picture of a string of text with some kind of highlighting or underlining or something of what multiple systems label in it.  I’ve looked at your two 2010 papers and the 2011 paper, but didn’t find a diagram of that sort.  I found this diagram:
journal.pone.0116040.g002
Picture source: Oellrich, Anika, Nigel Collier, Damian Smedley, and Tudor Groza. “Generation of silver standard concept annotations from biomedical texts with special relevance to phenotypes.” PLoS one 10, no. 1 (2015): e0116040.
…in this paper:
Oellrich, Anika, Nigel Collier, Damian Smedley, and Tudor Groza. “Generation of silver standard concept annotations from biomedical texts with special relevance to phenotypes.” PloS one 10, no. 1 (2015): e0116040.
…and it’s pretty much what I’m looking for, except that it doesn’t actually show very much in the way of overlaps between the systems.  I also looked at Kang et al. (2012) and the Krallinger et al. (2015) paper on CHEMDNER–no luck.
So, I thought that maybe you have a nice figure in a PowerPoint slide or something, and thought that I would ask…
Zipf

Notice the format of the message.  I started out with the question:

I wonder if you have a good diagram that illustrates the construction of a silver standard corpus?  What I’m envisioning is a picture of a string of text with some kind of highlighting or underlining or something of what multiple systems label in it.

Then I went on to tell the recipient what I had already done to try to answer the question myself:

I’ve looked at your two 2010 papers and the 2011 paper, but didn’t find a diagram of that sort. I found [a diagram] in this paper:

Oellrich, Anika, Nigel Collier, Damian Smedley, and Tudor Groza. “Generation of silver standard concept annotations from biomedical texts with special relevance to phenotypes.” PloS one 10, no. 1 (2015): e0116040.
…and it’s pretty much what I’m looking for, except that it doesn’t actually show very much in the way of overlaps between the systems.  I also looked at Kang et al. (2012) and the Krallinger et al. (2015) paper on CHEMDNER–no luck.

Now the questionee knows that I looked in six places, two of which are papers that the questionee himself wrote, before bugging him.  How could he possibly have a problem with being asked the question?  And of course, the best thing about asking questions this way is that in the process of doing the things that you’ve done to try to answer them yourself, you often come up with the answer!

“Zellig” is not actually the poor questionee’s name.  It’s a reference to Zellig Harris, a famous linguist of the mid-late 20th century.  He was a pioneer in the development of the idea of sublanguages (which, as far as I know, he invented).  Since I work on a number of biomedical sublanguages, I cite his work a lot.  See this blog post for some French-language vocabulary related to the topic, from Sur la notion de sous-langage, by the French scholar Roland Dachelet.

 

Prévert’s “Le balayeur:” French, Hungarian, and a bit of English

It’s taken me a long time to understand the idea of “the impossibility of translation.”  Jacques Prévert has given me some insight into what that might mean.

It’s taken me a long time to understand the idea of “the impossibility of translation.”  Jacques Prévert has given me some insight into what that might mean.

Prévert was a poet, playwright, and screenwriter who came to prominence in the post-war period.  (The odd word playwright is discussed in the English notes below.)  More than most poets I’ve run into in any language, he plays with the sounds of words.  For example, his poem Le temps haletant, “The panting time,” which sounds like le temps a le temps, “time has time,” or from one of my favorites, Il ne faut pas…, which starts Il ne faut pas laisser les intellectuels jouer avec les allumettes–“you must not let intellectuals play with matches”–and ends le monde mental ment monumentalment.  Notice all of the strings of ment, which is the 3rd person singular present tense of the verb to lie:

Le monde mental ment monumentalement.

Translatable in a way that doesn’t lose how wonderful that line is?  I think not.


Today’s National Poetry Month treat is his poem Le balayeur, which I have only found on a page with a translation into Hungarian–why not… Here’s the stanza that got me thinking about “the impossibility of translation.”  To establish the context: an angel is trying to convince a street sweeper to jump into the Seine to save someone who’s drowning.  The sweeper (le balayeur) eventually concedes:

Finalement
le balayeur enlève sa veste
puisqu’il ne peut faire autrement
Et comme c’est un très bon nageur
grimpe sur le parapet
et exécute un merveilleux « saut de l’ange »
et disparaît
Et l’ange
littéralement « aux anges »
louange le Seigneur

In the end
the sweeper takes off his jacket
because he can’t do otherwise
And because he’s a very good swimmer
climbs onto the parapet
and executes an excellent swan dive
and disappears
And the angel
literally beside himself with joy
praises the Lord

Here’s what makes that impossible to translate well: we’re talking about an angel here, l’ange, right?  And that whole part of the poem is full of expressions that are built on the noun “angel:”

Finalement
le balayeur enlève sa veste
puisqu’il ne peut faire autrement
Et comme c’est un très bon nageur
grimpe sur le parapet
et exécute un merveilleux « saut de l’ange »
et disparaît
Et l’ange
littéralement « aux anges »
louange le Seigneur

Here are the relevant French/English correspondences:

  • le saut de l’ange: literally “angel’s jump,” but in English, “swan dive”
  • aux anges: literally something like “with the angels,” which in English is an extremely euphemistic way of saying “dead;” meanwhile, the English equivalent of aux anges would be “over the moon; beside oneself; beside oneself with joy.”

The thing that I find really clever, though, is the last verse of this part of the poem, where the angel praises the Lord:

louange le Seigneur

Sigh–Prévert is wonderful…


Here’s the poem, followed by its translation into Hungarian by Justus Pál, followed by some notes on the English in this post:

Le Balayeur

Au bord d’un fleuve
le balayeur balaye
il s’ennuie un peu
il regarde le soleil
il est amoureux
Un couple enlacé passe
il le suit des yeux
Le couple disparaît
il s’assoit
sur une grosse pierre
Mais soudain la musique
l’air du temps
qui était doux et charmant
devient grinçant
et menaçant

Apparaît alors
l’Ange gardien du balayeur
qui d’un très simple geste
lui fait honte de sa paresse
et lui conseille de reprendre le labeur

L’Ange gardien plante l’index vers le ciel
et disparaît
Le balayeur reprend son balai

Une jolie femme arrive
et s’accoude au parapet
regarde le fleuve
Elle est de dos
et très belle ainsi
Le Balayeur sans faire de bruit
s’accoude à côté d’elle
et d’une main timide et chaleureuse
la caresse
ou plutôt fait seulement semblant
mimant le geste de l’homme qui tout à l’heure
caressait son amie en marchant

La femme s’en va sans le voir
Il reste seul avec son balai
et soudain constate
que l’Ange est revenu
et l’a vu
et le blâme
d’un regard douloureux
et d’un geste de plus en plus affectueux
et de plus en plus menaçant

Le balayeur reprend son balai
et balaye
L’Ange gardien disparaît

Une autre femme passe
Il s’arrête de balayer
et d’un geste qui en dit long
lui parle de la pluie et du beau temps
et de sa beauté à elle
tout particulièrement

L’Ange apparaît
La femme s’enfuit épouvantée

L’Ange une nouvelle fois
fait comprendre au balayeur
qu’il est là pour balayer
puis disparaît

Le balayeur reprend son balai

Soudain des cris
des plaintes
venant du fleuve
Sans aucun doute
les plaintes de quelqu’un qui se noie

Le balayeur abandonne son balai
Mais soudain hausse les épaules et
indifférent aux cris venant du fleuve
continue de balayer

L’Ange gardien apparaît
Et le balayeur balaye
comme il n’a jamais balayé
Travail exemplaire et soigné

Mais l’Ange toujours l’index au ciel
remue des ailes courroucées
et fait comprendre au balayeur
que c’est très beau bien sûr
de balayer
mais que tout de même
il y a quelqu’un
qui est peut-être en train de se noyer
Et il insiste
le balayeur faisant la sourde oreille

Finalement
le balayeur enlève sa veste
puisqu’il ne peut faire autrement
Et comme c’est un très bon nageur
grimpe sur le parapet
et exécute un merveilleux « saut de l’ange »
et disparaît
Et l’ange
littéralement « aux anges »
louange le Seigneur
La musique est une musique
indéniablement céleste
Soudain
le balayeur revient
tenant dans ses bras
l’être qu’il a sauvé

C’est une fille très belle
Et dévêtue

L’Ange la toise d’un mauvais œil
Le balayeur
la couche sur un banc
avec une infinie délicatesse
et la soigne
la ranime
la caresse

L’Ange intervient
et donne au balayeur
le conseil de rejeter dans le fleuve
cette « diablesse »

La « diablesse » qui reprend goût à la vie
grâce aux caresses du balayeur
se lève
et sourit

Le balayeur sourit aussi
Ils dansent tous deux

L’Ange les menace des foudres du ciel

Ils éclatent de rire
s’embrassent
et s’en vont en dansant

L’Ange gardien essuie une larme
ramasse le balai
et balaye… balaye… balaye… balaye…
in-exo-ra-ble-ment.

Az utcaseprő (Balett) (Hungarian)

A folyó partján
seper az utcaseprő
unatkozik tán
felnéz a napra
kicsit szerelmes ő
Arra megy egy ölelkező
szerelmespár rájuk tapad a szeme
A pár eltűnik
leül
egy kőre ő
Most a zene
az idő dallama
mely eddig szép volt és szelíd
hirtelen megkeményedik
csikorgó lesz és fenyegető

Ekkor megjelenik
az utcaseprő őrangyala
nagyon egyszerű mozdulat
reápirít a lustaság miatt
s azt ajánlja jó lesz munkához látnia

Így inti ég fele emelt ujjal
majd eltűnik az angyal
Söprűt ragad az utcaseprő

Jön egy remek nő
a párkányra könyököl
a folyóba néz
Háttal fordul felé
így is nagyon szép
Az utcaseprő nesztelenül odalép
mellé könyököl
félénk meleg kezével
megsimogatja
azaz csak úgy tesz mintha simogatna
utánozza az előbbi férfit aki barátnőjével
erre sétált és simogatta

A nő elmegy észre sem veszi
ő meg ott marad a söprűjével
s hirtelen megállapítja
hogy az angyal közben visszalibbent
és látott mindent
megrovón néz rá
fájdalmas pillantással
mind szeretőbben
s fenyegetőbben

Az utcaseprő veszi a söprűjét megint
söpörni kezd
Az angyalnak hűlt helye mire feltekint

Arra megy egy másik nő
Abbahagyja a söprést
Sokatmondó mozdulatokkal
ezt is azt is elmeséli neki
hogy ő mármint a nő
milyen gyönyörű azt dicséri

Megjelenik az őrangyal
Rémülten menekül a nő

Az őrangyal még egyszer
megmagyarázza az utcaseprőnek
azért van ott hogy söpörjön
aztán lelép

Az utcaseprő seprűt ragad miként elébb

Hirtelen kiáltásokat hall
jajveszékelést
a folyó felől
Nyilván
fuldoklik valaki az kiabál

Az utcaseprő félreteszi a seprőt
De aztán meggondolja magát vállat von
és nem is hederít a kiáltásokra
söpör tovább

Megjelenik az őrangyal
Az utcaseprő pedig úgy söpör
hogy hasonlítaná sem lehet semmit a söpréséhez
Példás és pontos munkát végez

Ám az angyal ég felé emeli mutatóujját
Haragos szárnyát meglebbenti
s értésére adja az utcaseprőnek
hogy persze nagyon szép feladat
utcát seperni
de hogy viszont
valaki esetleg vízbe fúl ezalatt
És nyomatékosan rábeszéli
mivel az utcaseprő
hallani sem akar róla

A végén
az utcaseprő leveti zubbonyát
mást nem tehet nem hagyják békén
kitűnő úszó lévén
felkapaszkodik a párkányra
és csodálatos „angyal-fejessel„
eltűnik a habok között
Az angyal pedig ezalatt
a szó szoros értelmében angyali hangulatban
dicséri az Urat
A zene ezúttal
kétségtelenül mennyei jellegű
Az utcaseprő
hirtelen felmerül
karjában hozza
kit megmentett a habok közül

Nagyon szép lány
és meztelen

Az angyal rossz szemmel nézi a dolgot
Az utcaseprő
végtelen gyengéden
lefekteti egy padra
ápolja
éleszti
simogatja

Ám az angyal közbelép
s melegen ajánlja az utcaseprőnek
hogy dobja vissza a folyóba.
ezt a „nőarcú ördögöt”

közben a „nőarcú ördög” életkedve
hála az utcaseprő simogatásának visszatér
felkel
nevetve

Az utcaseprő is mosolyog
Mindkettő táncra perdül

Az őrangyal megfenyegeti őket a menny villámaival

Azik ketten kinevetik
átölelik egymást
elballagnak tánclépésben szemérmetlenül

Az őrangyal letörli kicsorduló könnyét
veszi a söprűt
és seper… seper… seper… seper…
kér-lel-hetet-lentül.


English notes

I used the word playwright to describe Jacques Prévert.  It seems somewhat bizarre, in that it means “someone who writes plays,” but it ends not with -write, but with -wright.  For example, here are some other words that refer to people who write things:

  • copywriter
  • ghostwriter
  • screenwriter
  • skywriter
  • speechwriter
  • songwriter

Copywriter, speechwriter, and songwriter are clear analogues: they mean someone who writes copy, speeches, and songs, respectively, while a playwright is someone who writes plays.  What gives?

As my theater professor explained it to me (decades ago–yikes…), the intent is to convey the idea that writing a play is a matter of craft, of arduous labor, of building.  How does that work?  Because other words that end in -wright refer to people who build things “by the sweat of their brows:”

  • wheelwright (le vanneur, I think)
  • wainwright (person who builds builds wagons–I had to look that one up myself)
  • shipwright (person who builds ships)
  • cartwright (person who makes carts)

 

Why professors might be yawning and checking their cell phones

#Iamatiredhungrycrankyprofessorwhoneedstoeatdinnerandthenreviewthreemorepapers

This question landed in my Quora inbox:

In the past, I met some professors to talk about my research. When I explained my findings, some started heavily yawning and keep looking at their cell phones. Why do professors yawn when they have to listen?


Let’s start with the least controversial observation:

When people are tired, they sometimes yawn.

Abductive inference suggests the hypothesis that those professors are tired.  This would be pretty credible: professors generally work very long hours, and if you ran into them in the kind of context where one talks to lots of professors about one’s research–that is to say, at a conference–then they probably travelled to get there, and are jet-lagged on top of their usual exhaustion.  Of course, abductive inference is weaker than deductive inference–people yawn for lots of reasons, including boredom and uncomfortableness–and even inductive inference, so I won’t belabor this point beyond mentioning that we have no data for inductive inference (“some” is not really data, as such) and no premises for deductive inference.

Let’s move onto another observation–not as uncontroversial as “people sometimes yawn when they are tired,” but still pretty accurate:

People sometimes receive texts that are super-important to them.  Maybe their kid is sick, or their spouse just had an automobile accident.  They might have just gotten laid off, and be needing an appointment with their Human Resources office to find out about unemployment insurance, health insurance, etc.  You might not care whether their spouse or child lives or dies; you might not care whether or not they can feed their family; but, it’s not very fair of you to expect THEM not to care.

I wasn’t able to find any indication in your question regarding whether or not you have evidence to share with us regarding the topics and contents of those texts.  I certainly don’t have any.  Do you?

Why do professors yawn when they have to listen?

Actually, nobody has to listen.  Not that I’m aware of, at any rate.  Most people will, though.  But, here’s a caveat: many researchers get bored very quickly by people who are not saying interesting things to them.  You need to grab their interest right away, or you will probably lose it; the higher up the food chain the researcher is, the quicker you are likely to lose their interest if you are boring them.  (Thanks to KJ for this observation–it made me be really careful about preparing to run into Famous Scientists, and it turns out that if you’re ready to interest Famous Scientists, you are ready to interest anybody.  Not that I’m claiming to be able to interest anybody, mind you.  (Catch the ambiguity?))

When I explained my findings, some started heavily yawning and keep looking at their cell phones.

So far, your question has not given the reader the ammunition to do any kind of inference but abductive, and as I said, that’s the weakest kind.  But, you may be onto something here.

Here’s the thing about findings:

  • They’re usually not nearly as interesting as the question.  Either they’re not surprising, or they don’t actually answer the question one way or the other, or they’re not convincing–maybe your experimental design was bad, maybe you didn’t have a large enough sample, etc.  A good question, though–a good question will grab people’s attention.
  • If you have to explain your findings, there might be a problem.  Here are the most likely things that I can say about any of my findings, personally:
    • They’re clear enough that anyone in my field could interpret them, so they don’t need to be explained to people in my field
    • They’re so unclear that I can’t even interpret them MYSELF, in which case I am not going to try to explain them to ANYBODY–I am going to ask THEM to explain my findings to ME.

So, there are a number of reasons that professors might yawn and/or check their cell phones while you’re explaining your findings to them.

  1. They’re tired
  2. There are urgent things going on in their lives
  3. They are jerks
  4. You bored them

My advice: consider option #4 carefully, and if you see a way to improve your interactions with people when you talk with (note that I did not say to) them about your research, then do so.  Then: assume that (1) and (2) are the case.  We should always consider the possibility that we need to do a better job, but if we need an explanation for someone else’s behavior, it’s best to be charitable both to them–and to ourselves.  Then: go back and try again!  Good luck with your research. #Iamatiredhungrycrankyprofessorwhoneedstoeatdinnerandthenreviewthreemorepapers