I pay the rent by researching the issues involved in getting computers to understand biomedical language–for example, the language of scientific journal articles, or the language of health records. I’m in the midst of writing a chapter about this topic for a handbook of computational linguistics. The audience is people who are interested in computational linguistics, but don’t have any experience with the biomedical domain. If you’re a reader of this blog, that’s probably not a bad description of you. So, it would be super-helpful to me to have your critique of my introduction. I’m looking for anything that isn’t clear, anything that makes it difficult to understand my prose–anything that you think could be improved. My grandmother will tell me how wonderful it is, so just feel free to plow into me with both fists–seriously, you’d be surprised at how much pain you can take in your old age.
What makes the biomedical domain an interesting one from the perspective of computational linguistics? Indeed, what makes any domain an interesting one from the perspective of computational linguistics? In fact, Roger Shuy has asserted that the notion of any specific kind of data defining a particular area of linguistics is unsupportable. As he puts it: “There is little reason for the data on which a linguist works to have the right to name that work” (Shuy 2002).
Shuy’s statement is surprising, since he himself is North America’s leading forensic linguist—a linguist whose career has been defined entirely by his excellent work on language as it appears in the legal system. And, indeed, many computational linguists describe themselves as doing biomedical natural language processing.
So, why study computational linguistics in the biomedical domain? One can identify at least three primary types of reasons: theoretical, practical, and use-case-oriented.
Theoretical aspects of biomedical language
Biomedical languages are of interest to computational linguistics for two reasons: their relevance to questions about the nature and limits of grammar, and the light that they can shed on issues of reproducibility in natural language processing.
Biomedical languages and grammaticality
Biomedical languages are of interest from the perspective of computational linguistics in part because they stretch the limits of what can possibly be grammatical in a natural language. Since the second half of the 20th century, much of linguistic argumentation has focused around grammaticality, which at a first approximation we can define as the question of whether or not an utterance is within the boundaries of some language, or not (Partee et al. 2012). Early in the second half of the 20th century, utterances that came under discussion in linguistic debates tended to be either quite ordinary (such as the famous John loves Mary (Fowle 1850)), or interestingly ambiguous—sentences like John loves his wife, and so does Tom (Duží 2012) whose grammaticality (as opposed to their interpretations) was mostly not in question. Although the discourse of that period of linguistic inquiry—particularly with respect to the development of syntactic theory—was often couched in terms of defining—and constraining—some set of sentences (“strings”), in practice it tended to be more about operations on (and to a much lesser extent, interpretation of) those strings.
This changed in the 1970s and1980s with the emergence of a research community that explored sublanguages: language associated with a particular genre and a particular kind of interlocutor. Harris (1976) laid out a number of the principles of the sublanguage approach: semantics was embraced, not pushed off to some later date. Although not always formalized as such, lexical preferences and statistical tendencies were taken advantage of (unusual in the era of a linguistics that had a complicated relationship with the lexicon and famously open disdain for statistics (Harris 1995)). As Grishman (2001) explains, these were interesting for at least two reasons: they seemed amenable to syntactic description by reducing complex syntactic structures into simpler ones, reminiscent to the transformational analyses that were becoming dominant in linguistics, and they held the promise of mapping to a tractable model of the world, or semantics—something that had largely eluded linguistics up to that point.
The biomedical domain seemed like a fruitful area of research to the early investigators of the topic, and it was. Scientific journal articles were one such genre, with the interlocutors being researchers; clinical documents provided another, with the interlocutors being physicians. Harris et al. (1989) provided an in-depth description of the language of scientific publications about immunology . It set a standard for sublanguage research on biomedical languages that would remain unparalleled for years. The usefulness of the sublanguage model can be seen in the fact that researchers continue to find it fruitful (some prominent examples in the biomedical domain are reviewed in Demner-Fushman et al. 2009). Some examples that illustrate particularly well the use of the sublanguage model for semantic representation include Dolbey (2009) in the molecular biology domain and Deléger et al. (2017), which also includes a review of the basic issues and of other approaches to resolving them. Clinical sublanguages soon turned out to be full of data that was ungrammatical on any standard treatment of syntax (see Table 1 for some examples), making it clear that they were good areas for investigating the limits of grammaticality at a time when grammaticality was generally considered a binary characteristic of language with strict semantic constraints .
|Chest shows evidence of metastatic disease.|
|Examination shows the same findings.|
|x-rays of spine showed extreme arthritic change.|
|Urinalysis shows 1% proteinuria.|
|Brain scan shows midline lesion.|
Table 1: Examples of ungrammatical sentences from radiology reports. In English, the verb to show is usually thought of as requiring a sentient subject. In these sentences, we see a wide range of non-sentient subjects: an anatomical organ (chest), an event (examination), x-ray films (x-rays of spine), a laboratory test (urinalysis), and the output of a computed tomography exam (brain scan). All of the sentences have “generic” noun phrases where they would normally require an article or demonstrative (chest, examination, x-rays of spine, and brain scan). Source: Hirschman (1986). No human subjects approval or HIPAA training is required for use of these examples.
 Shuy, Roger. Linguistic battles in trademark disputes. Springer, 2002.
 The Association for Computational Linguistics Special Interest Group on Biomedical Natural Language Processing has over 100 members at the time of writing.
 Fowle, William B. (1850) “English Grammar: Goold Brown.” Common School Journal, pp. 245-249.
 Duží, Marie (2012) “Extensional logic of hyperintentions.” In Düsterhöft, Antje, Meike Klettke, and Klaus-Dieter Schewe, eds. Conceptual Modelling and Its Theoretical Foundations: Essays Dedicated to Bernhard Thalheim on the Occasion of His 60th Birthday. Vol. 7260. Springer Science & Business Media, 2012.
 See Chapter 18, Sublanguages and controlled languages, this volume.
 Harris, Zellig. “On a theory of language.” The Journal of Philosophy 73.10 (1976): 253-276.
 Harris, Randy Allen. The linguistics wars. Oxford University Press, 1995.
 Grishman, Ralph. “Adaptive information extraction and sublanguage analysis.” Proc. of IJCAI 2001. 2001.
 Harris, Randy Allen. The linguistics wars. Oxford University Press, 1995.
 Harris, Z., Gottfried, M., Ryckman, T., Daladier, A., & Mattick, P. (2012). The form of information in science: analysis of an immunology sublanguage (Vol. 104). Springer Science & Business Media.
 Demner-Fushman, Dina, Wendy W. Chapman, and Clement J. McDonald. “What can natural language processing do for clinical decision support?.” Journal of biomedical informatics 42.5 (2009): 760-772.
 Dolbey, Andrew. “BioFrameNet: a FrameNet extension to the domain of molecular biology.” (2009).
 Deléger, Louise, Leonardo Campillos, Anne-Laure Ligozat, and Aurélie Névéol. “Design of an extensive information representation scheme for clinical narratives.” Journal of biomedical semantics 8, no. 1 (2017): 37.
 Hirschman, Lynette. “Discovering sublanguage structures.” Analyzing Language in Restricted Domains: Sublanguage Description and Processing (1986): 211-234.
Harsh critiques in the Comments section below, please!