Natural language processing (sometimes also known as text mining or computational linguistics, although I disagree with that last one) is the use of computers to process language in some way, such as finding names of businesses, rating reviews as positive or negative, summarizing news stories, etc.
Natural language processing is often done with an approach called machine learning. Machine learning is a set of techniques for letting computers “learn” for themselves how to classify things (e.g., is this word a name or not? Is this review positive, or not? Does this sentence belong in a summary of this news story, or not?), versus having humans write explicit rules for the computer about how to classify things.
In natural language processing in general and machine learning approaches to natural language processing in particular, we often use data that has been labelled by humans with the correct answers. These humans are called “annotators.” For various reasons, we often compute the agreement between the annotators–if two annotators look at the same things, how often do they agree about the classification?
I sent the preceding text, along with the following question, to a company that offers answers to arbitrary questions. They’re called Wonder:
I’d like as many citations as you can find to support the assertion that inter-annotator agreement is probably an upper bound on possible system performance in natural language processing. Please give me the citation, and a text snippet that demonstrates that the citation does make this claim.
I was told in grad school that it is probably not possible for a computer program that does natural language processing (and probably machine learning more broadly) to perform at a rate that is higher than the inter-annotator agreement for the task. Do I have any good citations for this assertion? No. Please find me as many as you can. I’d like the citations, along with text snippets from the citations that show that they’re really making this assertion. Bonus points for Wonder if you give me the citations in BibText format!
Here’s what I got back from Wonder:
1) “It is generally agreed that human inter-annotator agreement defines the upper limit on our ability to measure automated performance…”
- Resnik, Philip, and Jimmy Lin. “Evaluation of NLP Systems.” The handbook of computational linguistics and natural language processing 57 (2010): 271.
2) The above source includes a citation from Gale et al. 1992: “our ability to measure performance is largely limited by our ability [to] obtain reliable judgments from human informants.”
- Gale, William, Kenneth Ward Church, and David Yarowsky. “Estimating upper and lower bounds on the performance of word-sense disambiguation programs.” Proceedings of the 30th annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 1992.
3) “The average inter-annotator agreement indicated by the kappa values of 0.66 for surface understanding and 0.64 for conceptual understanding should be seen as upper bounds on the accuracy of any classifier.”
- Ormandjieva, Olga, Ishrar Hussain, and Leila Kosseim. “Toward a text classification system for the quality assessment of software requirements written in natural language.” Fourth international workshop on Software quality assurance: in conjunction with the 6th ESEC/FSE joint meeting. ACM, 2007.
4) “Recent estimations of the inter-annotator agreement when using the WordNet inventory report figures of 72.5% agreement in the preparation ofhe English all-words test set at Senseval-3 (Snyder and Palmer, 2004) and 67.3% on the Open Mind Word Expert annotation exercise (Chklovski and Mihalcea, 2002). These numbers lead us to believe that a credible upper bound for unrestricted fine-grained WSD is around 70%, a figure that state-of-the-art automatic systems find it difficult to outperform.”
- Navigli, Roberto. “Meaningful clustering of senses helps boost word sense disambiguation performance.” Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006.
5) “Human judgments of semantic relatedness provide a gold standard for evaluating the results of automatic methods. The inter-annotator agreement defines an upper bound for the evaluation of automatic methods (Resnik, 1995).”
- Gurevych, Iryna, and Hendrik Niederlich. “Computing semantic relatedness in german with revised information content metrics.” Proceedings of” OntoLex. 2005.
6) “We rather measured the agreement between our algorithm M and both the human annotators A and B. Besides the inter-annotator agreement A–B, which serves as an upper bound, we tried the naive baseline approach 0 that always chooses the first target word sense.”
- Meyer, Christian M., and Iryna Gurevych. “Worth its weight in gold or yet another resource—A comparative study of Wiktionary, OpenThesaurus and GermaNet.” Computational linguistics and intelligent text processing. Springer Berlin Heidelberg, 2010. 38-49.
7) “We also compared our results to the upper bound given by the inter-annotator agreement on the calibration data set.”
- Padó, Sebastian, and Mirella Lapata. “Cross-linguistic projection of role-semantic information.” Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2005.
As you can see, most of the citations I can find basically make the same assertion (that machine learning cannot outperform inter-annotator agreement) and treat it as fact, without giving much evidence for that assertion. I hope this information was helpful! –Alexandra G.
I was pretty pleased with these citations. I should note that Alexandra G. also gave me links to all of the citations. It’s also important to point out that if you’re going to use Wonder, you need to be careful about (a) what kinds of questions you ask–see their webs site for the kinds of questions that they feel they can be helpful with–and (b) ask for exactly what you want. It took me several tries to get it right; after speaking with someone on their team, I now know how to formulate questions (and the types of questions that I can ask) to get what I’m looking for. If someone reading this has the evidence that the Wonder person pointed out doesn’t seem to be out there, it would be great if you could add it to the Comments.