March 2018 – Zipf's Law

Veterans running for office as Democrats in 2018

It irritates me when people assume that US military people all vote Republican–plenty of us are reliable Democratic voters.

It irritates me when people assume that US military people vote Republican. We’re humans, which means that we are not all the same, and plenty of us are reliable Democratic voters. Trump the war-mongering draft-dodger? A poll in October of last year by the Military Times (the most recent one I can find today) showed the following:

53% of officers oppose him
Only 30% of officers support him
Only 47% of enlisted personnel support him
38% of enlisted personnel view him unfavorably

I’ve written a number of times about why his approval ratings are so low in the military–today I’ll just leave you with this link to a nice article about veterans running for office in 2018 as Democrats.

apple.news/AdOf4MjO4RGCibkB0mTBW3w

I tried to think of a different way to say this… Variability in biomedical languages

If ambiguity is the major problem in natural language processing, variability is the second.

This post is a draft of part of a piece that I’m writing at the moment, and on which I would like your feedback. The topic is variability in language. I pay the rent by researching the issues involved in getting computers to understand biomedical language–for example, the language of scientific journal articles, or the language of health records. I’m in the midst of writing a chapter about this topic for a handbook of computational linguistics. The audience is people who are interested in computational linguistics, but don’t have any experience with the biomedical domain. If you’re a reader of this blog, that’s probably not a bad description of you. So, it would be super-helpful to me to have your critique of this material. I’m looking for anything that isn’t clear, anything that makes it difficult to understand my prose–anything that you think could be improved. My grandmother will tell me how wonderful it is, so just feel free to plow into me with both fists–seriously, you’d be surprised at how much pain you can take in your old age, and I’m getting pretty old.

Variability is the property of being able to express the same proposition in multiple ways. If ambiguity is the major problem of natural language processing, variability is the second. From a theoretical perspective, the field of sociolinguistics sees the study of variation in language as the central problem of linguistics, and it makes a strong case for that claim (e.g. Labov 2004)[1]. From a practical perspective in natural language processing, the high degree of variability in natural language prevents us from ever being able to use a dictionary-like data structure (such as hash tables, B-trees, or tries) to accomplish our tasks: we will never have a “dictionary” of all possible sentences (Chomsky 1959)[2]. This kind of approach would be fast and efficient—if only it were possible (Gusfield 1997)[3].

Sources of variability

Some of the sources of variability in language are well-known even to the casual reader—for example, synonymy, or the availability of multiple words that have the same dictionary meaning. A kind of synonymy that is especially relevant in biomedical languages occurs when there is both a technical and a lay or common term for something, such as the lay term heart attack and the technical term myocardial infarction. Using technical terminology is important for the precision of scientific writing and of medical records (Rey 1979)[4]. However, the use of technical terminology also can make it difficult for patients and their families to learn about their illness or to understand their own health records (Kandula et al. 2010)[5]. One way to deal with this problem is to use natural language processing techniques to replace technical terms with their lay synonyms (Elhadad 2006[6], Elhadad and Sutaria 2007,[7] Deléger and Zweigenbaum 2009[8], Leroy et al. 2013a[9], Leroy et al. 2013b[10]) or their definitions (Elhadad 2006)[11] in order to make clinical documents or scientific journal articles accessible to non-professionals. Doing this computationally, rather than manually, allows it to be done at enormous scales, or on demand. This is a good example of why to do natural language processing in the biomedical domain: the possibility of doing real good in the world.

Paraphrase is the phenomenon of different (and typically syntactically different) expressions in language of the same meaning (Ganitkevitch et al. 2013)[12]. Where synonymy operates of the level of words, paraphrase operates at the level of the phrase, or group of words. Paraphrasing is a source of variability that is especially interesting in the biomedical domain because of how it interacts with the technical vocabulary of the field (Deléger and Zweigenbaum 2008, Deléger 2009, Deléger and Zweigenbaum 2010, Grabar and Hamon 2014)[13],[14],[15],[16]. Funk et al. looked for possibilities to paraphrase or replace synonyms in 41,853 terms from the Gene Ontology, and found that 27,610 out of 41,852 were paraphrasable, or had synonyms, or both[17]. This indicates that the possibilities for variant forms of the same thing occurring in the biomedical literature are tremendous.

But, do those tremendous numbers of variants really occur? It appears that they do. Cohen et al. (2008) looked at the incidence of alternative syntactic constructions involving common nominalizations (nouns derived from verbs, such as treatment from to treat) in scientific journal articles—for example, drug treatment of cancer and cancer treatment with drugs. Figure 1 shows a typical finding: for some nominalizations, as many as 15 out of 16 possible variants could be found even in a relatively small corpus[18].

How different can these paraphrases be from each other? Technical terms in biomedical research can be quite long, which means that there can be multiple candidates for paraphrasing and for replacement of synonyms (see above). This means that the number of possible paraphrases of a long term can be explosive. Those paraphrases, even for a short term, can be quite different—for example, Cohen et al. (2017) examined the relationship between the length of terms in the Gene Ontology and the length of appearances of those terms in the CRAFT corpus of biomedical journal articles, and found that 2-word terms could show up with paraphrases as long as 15 words[19]. The high incidence of just these two forms of variability in language—synonymy and paraphrasing—as well as the large differences that can be seen in forms with the same meanings illustrate just how much of an issue variability is for natural language processing in general, and in biomedical texts in particular.

Harsh critiques in the Comments section below, please!

[1] Labov, William. “Quantitative reasoning in linguistics.” Sociolinguistics/Soziolinguistik: An international handbook of the science of language and society 1 (2004): 6-22.

[2] Chomsky, Noam. “A review of BF Skinner’s Verbal Behavior.” Language 35, no. 1 (1959): 26-58.

[3] Gusfield, Dan. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge university press, 1997.

[4] Rey, Alain. La terminologie: noms et notions. No. 1780. Presses Univ. de France, 1979, p. 56.

[5] Kandula, Sasikiran, Dorothy Curtis, and Qing Zeng-Treitler. “A semantic and syntactic text simplification tool for health content.” In AMIA annual symposium proceedings, vol. 2010, p. 366. American Medical Informatics Association, 2010.

[6] Elhadad, Noemie. “Comprehending technical texts: Predicting and defining unfamiliar terms.” In AMIA annual symposium proceedings, vol. 2006, p. 239. American Medical Informatics Association, 2006.

[7] Elhadad, Noemie, and Komal Sutaria. “Mining a lexicon of technical terms and lay equivalents.” In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, pp. 49-56. Association for Computational Linguistics, 2007.

[8] Deléger, Louise, and Pierre Zweigenbaum. “Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora.” In Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora, pp. 2-10. Association for Computational Linguistics, 2009.

[9] Leroy, Gondy, David Kauchak, and Obay Mouradi. “A user-study measuring the effects of lexical simplification and coherence enhancement on perceived and actual text difficulty.” International journal of medical informatics 82, no. 8 (2013): 717-730.

[10] Leroy, Gondy, James E. Endicott, David Kauchak, Obay Mouradi, and Melissa Just. “User evaluation of the effects of a text simplification algorithm using term familiarity on perception, understanding, learning, and information retention.” Journal of medical Internet research 15, no. 7 (2013).

[11] Elhadad, Noemie. “Comprehending technical texts: Predicting and defining unfamiliar terms.” In AMIA annual symposium proceedings, vol. 2006, p. 239. American Medical Informatics Association, 2006.

[12] Ganitkevitch, Juri, Benjamin Van Durme, and Chris Callison-Burch. “PPDB: The paraphrase database.” Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013.

[13] Deléger, Louise, and Pierre Zweigenbaum. “Paraphrase acquisition from comparable medical corpora of specialized and lay texts.” AMIA Annual Symposium Proceedings. Vol. 2008. American Medical Informatics Association, 2008.

[14] Deléger, Louise. Exploitation de corpus parallèles et comparables pour la détection de correspondances lexicales: application au domaine médical. Diss. Paris 6, 2009.

[15] Deléger, Louise, and Pierre Zweigenbaum. “Identifying Paraphrases between Technical and Lay Corpora.” LREC. 2010.

[16] Grabar, Natalia, and Thierry Hamon. “Unsupervised method for the acquisition of general language paraphrases for medical compounds.” Proceedings of the 4th International Workshop on Computational Terminology (Computerm). 2014.

[17] Funk, Christopher S., K. Bretonnel Cohen, Lawrence E. Hunter, and Karin M. Verspoor. “Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition.” Journal of biomedical semantics 7, no. 1 (2016): 52.

[18] Cohen, K. Bretonnel, Martha Palmer, and Lawrence Hunter. “Nominalization and alternations in biomedical language.” PloS one 3.9 (2008): e3158.

[19] Cohen, K. B., Verspoor, K., Fort, K., Funk, C., Bada, M., Palmer, M., & Hunter, L. E. (2017). The colorado richly annotated full text (craft) corpus: Multi-model annotation in the biomedical domain. In Handbook of Linguistic Annotation (pp. 1379-1394). Springer, Dordrecht.

Harsh critiques in the Comments section below, please!

What computational linguists actually do all day: the relative frequencies edition

Scroll down past the picture of the mean-looking warthog.

Hi Zipf,

I spent my first hour this morning looking for papers that describe any tools that do any kind of enrichment analysis over terms found in text, but was generally unsuccessful. Searches containing the terms “concept” “term” “enrichment analysis” “text” “natural language processing” have mainly pointed me towards GSEA and GSEA-like tools like Ontologizer that focus on gene sets. Tools that determine what a document is “about” might also be useful.”

Do you know of any tools or papers you could point me towards?

Zellig

Hey there, Zellig,

I may be mis-understanding the question, so let me clarify. Do you want to know about terms enriched in a document, or in a set of documents? Gimme an idea about what the input looks like, and I think I’ll have an answer.

Zipf

Hi Zipf,

I think I am interested in looking at each document individually. And I’ll also clarify that the point of the task is not to find concepts, but to determine what a concept’s presence or absence in a document has on what it is “about.”

Zellig

OK, so in that case, the easiest thing to do would be… hm… relative frequency versus a background set of documents, or else tf*idf. Explaining relative frequencies first:

your document has 100 words in total
mouse occurs 45 times in your document, or frequency = 45/100
the occurs 50 times in your document, or frequency = 50/100
warthog (I just learned how to say it in French, so warthogs are on my mind–“le phacochère”, if you were wondering, which sounds like a lot to scream if one of those nasty things charges you) occurs 5 times in your document, or frequency = 5/100. Scroll down past the picture of the mean-looking warthog.

Southern_warthog_(Phacochoerus_africanus_sundevallii)_male — A male southern warthog. Picture source: By Charlesjsharp – Own work, from Sharp Photography, sharpphotography, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=37065293

your background data has 1000 words in total
mouse occurs 10 times, so frequency = 10/1000
the occurs 500 times, so frequency = 500/1000
warthog occurs 490 times, so frequency = 490/1000

relative frequencies, yours : background:

mouse = (45/100) : (10/1000), soit 45.0
the = 50/100 : 500/1000, soit 1.0
warthog = 5/100 : 490/1000, 0.1

…from which you conclude that your corpus is about mice, or at least it’s more about mice than the background data set is (’cause the word mouse occurs in your data at a ratio of 45:1 as compared to how often it occurs in the background data set). You conclude that “the” tells you nothing about either corpus (the ratio is 1.0, meaning that the frequency of the word is about the same in both data sets), and that “warthog” tells you nothing about your corpus, but it does tell you something about the background data (because it only occurs in your data at a ratio of once to every 10 times that it occurs in the background data set).

The other easy approach: term frequency (count of occurrences of a word in a document), normalized by inverse document frequency (1 over the number of documents in which the word occurs). This is known as tf*idf (term frequency * inverse document frequency).

Back to relative frequencies: that analysis is due to the late Adam Kilgarriff. (I’m proud to say that we wrote a paper together before his untimely death, and lemme tell you: he really participated!) Here’s a link to his paper about it. He gives details about smoothing and the like that you’ll want to know about if you pursue this approach. I’ll say that people are more familiar with the tf*idf approach, but personally, I think that relative frequency is a lot more intuitively comprehensible.

Zipf

What makes something interesting? The biomedical language version

What makes any domain an interesting one from the perspective of computational linguistics?

I pay the rent by researching the issues involved in getting computers to understand biomedical language–for example, the language of scientific journal articles, or the language of health records. I’m in the midst of writing a chapter about this topic for a handbook of computational linguistics. The audience is people who are interested in computational linguistics, but don’t have any experience with the biomedical domain. If you’re a reader of this blog, that’s probably not a bad description of you. So, it would be super-helpful to me to have your critique of my introduction. I’m looking for anything that isn’t clear, anything that makes it difficult to understand my prose–anything that you think could be improved. My grandmother will tell me how wonderful it is, so just feel free to plow into me with both fists–seriously, you’d be surprised at how much pain you can take in your old age.

What makes the biomedical domain an interesting one from the perspective of computational linguistics? Indeed, what makes any domain an interesting one from the perspective of computational linguistics? In fact, Roger Shuy has asserted that the notion of any specific kind of data defining a particular area of linguistics is unsupportable. As he puts it: “There is little reason for the data on which a linguist works to have the right to name that work” (Shuy 2002)[1].

Shuy’s statement is surprising, since he himself is North America’s leading forensic linguist—a linguist whose career has been defined entirely by his excellent work on language as it appears in the legal system. And, indeed, many computational linguists describe themselves as doing biomedical natural language processing[2].

So, why study computational linguistics in the biomedical domain? One can identify at least three primary types of reasons: theoretical, practical, and use-case-oriented.

Theoretical aspects of biomedical language

Biomedical languages are of interest to computational linguistics for two reasons: their relevance to questions about the nature and limits of grammar, and the light that they can shed on issues of reproducibility in natural language processing.

Biomedical languages and grammaticality

Biomedical languages are of interest from the perspective of computational linguistics in part because they stretch the limits of what can possibly be grammatical in a natural language. Since the second half of the 20^th century, much of linguistic argumentation has focused around grammaticality, which at a first approximation we can define as the question of whether or not an utterance is within the boundaries of some language, or not (Partee et al. 2012). Early in the second half of the 20^th century, utterances that came under discussion in linguistic debates tended to be either quite ordinary (such as the famous John loves Mary (Fowle 1850)[3]), or interestingly ambiguous—sentences like John loves his wife, and so does Tom (Duží 2012[4]) whose grammaticality (as opposed to their interpretations) was mostly not in question. Although the discourse of that period of linguistic inquiry—particularly with respect to the development of syntactic theory—was often couched in terms of defining—and constraining—some set of sentences (“strings”), in practice it tended to be more about operations on (and to a much lesser extent, interpretation of) those strings.

This changed in the 1970s and1980s with the emergence of a research community that explored sublanguages: language associated with a particular genre and a particular kind of interlocutor[5]. Harris (1976) laid out a number of the principles of the sublanguage approach: semantics was embraced, not pushed off to some later date[6]. Although not always formalized as such, lexical preferences and statistical tendencies were taken advantage of (unusual in the era of a linguistics that had a complicated relationship with the lexicon and famously open disdain for statistics (Harris 1995)[7]). As Grishman (2001) explains, these were interesting for at least two reasons: they seemed amenable to syntactic description by reducing complex syntactic structures into simpler ones, reminiscent to the transformational analyses that were becoming dominant in linguistics, and they held the promise of mapping to a tractable model of the world, or semantics[8]—something that had largely eluded linguistics up to that point[9].

The biomedical domain seemed like a fruitful area of research to the early investigators of the topic, and it was. Scientific journal articles were one such genre, with the interlocutors being researchers; clinical documents provided another, with the interlocutors being physicians. Harris et al. (1989) provided an in-depth description of the language of scientific publications about immunology [10]. It set a standard for sublanguage research on biomedical languages that would remain unparalleled for years. The usefulness of the sublanguage model can be seen in the fact that researchers continue to find it fruitful (some prominent examples in the biomedical domain are reviewed in Demner-Fushman et al. 2009)[11]. Some examples that illustrate particularly well the use of the sublanguage model for semantic representation include Dolbey (2009) in the molecular biology domain[12] and Deléger et al. (2017), which also includes a review of the basic issues and of other approaches to resolving them[13]. Clinical sublanguages soon turned out to be full of data that was ungrammatical on any standard treatment of syntax (see Table 1 for some examples), making it clear that they were good areas for investigating the limits of grammaticality at a time when grammaticality was generally considered a binary characteristic of language with strict semantic constraints .

Chest shows evidence of metastatic disease.

Examination shows the same findings.

x-rays of spine showed extreme arthritic change.

Urinalysis shows 1% proteinuria.

Brain scan shows midline lesion.

Table 1: Examples of ungrammatical sentences from radiology reports. In English, the verb to show is usually thought of as requiring a sentient subject. In these sentences, we see a wide range of non-sentient subjects: an anatomical organ (chest), an event (examination), x-ray films (x-rays of spine), a laboratory test (urinalysis), and the output of a computed tomography exam (brain scan). All of the sentences have “generic” noun phrases where they would normally require an article or demonstrative (chest, examination, x-rays of spine, and brain scan). Source: Hirschman (1986)[14]. No human subjects approval or HIPAA training is required for use of these examples.

[1] Shuy, Roger. Linguistic battles in trademark disputes. Springer, 2002.

[2] The Association for Computational Linguistics Special Interest Group on Biomedical Natural Language Processing has over 100 members at the time of writing.

[3] Fowle, William B. (1850) “English Grammar: Goold Brown.” Common School Journal, pp. 245-249.

[4] Duží, Marie (2012) “Extensional logic of hyperintentions.” In Düsterhöft, Antje, Meike Klettke, and Klaus-Dieter Schewe, eds. Conceptual Modelling and Its Theoretical Foundations: Essays Dedicated to Bernhard Thalheim on the Occasion of His 60th Birthday. Vol. 7260. Springer Science & Business Media, 2012.

[5] See Chapter 18, Sublanguages and controlled languages, this volume.

[6] Harris, Zellig. “On a theory of language.” The Journal of Philosophy 73.10 (1976): 253-276.

[7] Harris, Randy Allen. The linguistics wars. Oxford University Press, 1995.

[8] Grishman, Ralph. “Adaptive information extraction and sublanguage analysis.” Proc. of IJCAI 2001. 2001.

[9] Harris, Randy Allen. The linguistics wars. Oxford University Press, 1995.

[10] Harris, Z., Gottfried, M., Ryckman, T., Daladier, A., & Mattick, P. (2012). The form of information in science: analysis of an immunology sublanguage (Vol. 104). Springer Science & Business Media.

[11] Demner-Fushman, Dina, Wendy W. Chapman, and Clement J. McDonald. “What can natural language processing do for clinical decision support?.” Journal of biomedical informatics 42.5 (2009): 760-772.

[12] Dolbey, Andrew. “BioFrameNet: a FrameNet extension to the domain of molecular biology.” (2009).

[13] Deléger, Louise, Leonardo Campillos, Anne-Laure Ligozat, and Aurélie Névéol. “Design of an extensive information representation scheme for clinical narratives.” Journal of biomedical semantics 8, no. 1 (2017): 37.

[14] Hirschman, Lynette. “Discovering sublanguage structures.” Analyzing Language in Restricted Domains: Sublanguage Description and Processing (1986): 211-234.

Harsh critiques in the Comments section below, please!

A time and a place for everything

When to correct the other guy’s grammar–and when not to.

One evening I was riding the métro home, minding my own business, when a very, very drunk man got on. He was carrying an open bottle of some sort of hard liquor, and occasionally took a swig. (This and other obscure vocabulary items discussed in the English notes below.) He was so plastered that he could barely stay on his seat as the train swerved. He ranted incoherently–really incoherently. (After he left, I asked the guy next to me: Pardon me sir, was he speaking French? (If it’s in italics, it happened in French.) He gave me that look that people in Paris (and New York) give you when approached by a stranger before deciding that you’re OK, and then said: Of a sort.)

A young woman got on the train and took a seat. She had her phone to her ear, and was talking. The drunk, ranting guy leaned over, put his fingers to his lip, and said: Shhhhhh.

Bizarre, hein? No–in a Parisian context, this actually wasn’t surprising at all. The general French approach to politeness is: don’t do anything that would inconvenience the other person. A very noticeable way that this works out is that in general, the French tend to communicate more quietly than Americans do. Indeed, the first thing that I notice when I get off the plane in the US is how loud everyone is–I clear Customs, go sit in the United club, and find myself listening to the cell phone conversations of every random stranger within earshot. In Paris, if you see someone talking on the phone on the train, the chances are excellent that they’re not French–it’s just not done. So, it wasn’t that bizarre for a shitfaced lunatic to interrupt his raving to say shhhhh to someone talking on the phone on the métro—he might have been hammered, but she was being rude. In America, someone would have said some equivalent of “it’s a free country, she can talk on the phone if she wants to.” People did hush him up when he got too carried away, but no one criticized him for saying shhhhh to the girl on the phone–that’s just logical, quoi...

For an extended discussion of the “don’t inconvenience the other guy” principle in French culture, see Raymonde Carroll’s Cultural misunderstandings: The French-American experience, or the original French version, Évidences invisibles: Américains et Français au quotidien. Carroll’s book is the uber-citation on American/French cultural differences.

I thought about the drunk guy on the train and his shhhhh just now when I stepped out on my balcony (I have the good luck to have an apartment on the étage noble) for a cigarette–and overheard a delivery guy in the street below speaking on his phone. Avant qu’elle apparaisse, he said–before she appears. Even though I’m in France, where correcting other people’s grammar is just part of daily intercourse, I suppressed the urge to yell avant qu’elle N‘apparaisse–who doesn’t hate to see a good opportunity to use the ne expletif be wasted?–on the theory that this guy’s day was already going poorly enough without the shame of having some random foreigner fuck with his langue de Molière. A time and a place for everything.

For the meaning of étage noble and the significance of what floor you live on, see this post on Parisian apartment buildings. English notes below.

English notes

swig: the amount drunk at one time; a gulp. (Merriam-Webster) Some examples from the English Preposition Corpus, courtesy of Sketch Engine, purveyor of fine linguistic data sources and search engines therefor (note the lack of an E at the end–therefor is a different word from therefore):

I scowled into the night, took a swig of my beer and dumped the rest over the side of the deck .
I picked up the bottle beside me and took another long swig.
If, after a stiff swig of nectar, we were to watch further developments, we’d find that in another 100,000 years or so, or even longer, exactly the same thing would happen again, and the compass would swing back suddenly to its original position.

How I used it in the post: He was carrying an open bottle of some sort of hard liquor, and occasionally took a swig.

plastered: slang for drunk. Some examples from the enTenTen corpus (just under 20 billion words of English scraped from the Web), again courtesy of Sketch Engine:

Once Dolly and I got really plastered together.
An hour or so later, the Englishman is really plastered.
Jonathan is so ugly; I could only have sex with that double bagger if I was really plastered.
And by “former glory,” of course, we mean “a time when college-aged people used beer pong as an excuse to get so plastered they sometimes made sexual overtures toward bar stools.
Only to realise the switch happens yet again and you’re there staring at the mouth of Gingy the Gingerbread Man (Midgett in a triplicate role with Sugar Plum) so plastered on that baking sheet like an angry drunk.

How I used it in the post: He was so plastered that he could barely stay on his seat as the train swerved.

shitfaced: also slang for drunk. Don’t use this one in front of my grandmother.

The night ended with Patty directing my drunk ass to grab the mattress and set up the bed while I was completely stumbling and shitfaced.
Let me get this straight — this stuff supposedly gives you more energy … so you can stay out later, drink more and get more thoroughly shitfaced?
The end of the week and I’m tired, over-worked and really just in need of deep sleep so I can get to work the next day with a fresh brain that can fire on all six creative cylinders but I opt to get shitfaced on free beer instead.
In the Black Forest they celebrate by getting shitfaced, setting fire to 800-lb straw-packed oak wheels, rolling them down mountainsides into sleepy villages and making bets on the fates of the panicked peasantry as they flee in terror.

How I used it in the post: It wasn’t that bizarre for a shitfaced lunatic to interrupt his raving to say shhhhh to someone talking on the phone on the métro—he might have been hammered, but she was being rude.

hammered: ….and, once again: slang for drunk.

By the time we got to the dessert, I was, to put it delicately, hammered , as you can see from the picture above.
Made me want to check out more, especially as I was so hammered that I was in danger of keeling over, and consequently remember very little, other than that it was good.
I think the only way I’ll ever feel the urge to try that is if I’m already so hammered that it seems like a really good idea.

How I used it in the post: It wasn’t that bizarre for a shitfaced lunatic to interrupt his raving to say shhhhh to someone talking on the phone on the métro—he might have been hammered, but she was being rude.

Conflict of interest statement: I don’t have one. Sketch Engine doesn’t pay me to shill their stuff–I pay them to use it.

Cursing incoherently

I’m sitting at the breakfast table one beautiful spring morning when I start cursing in some incoherent mixture of French and English.

I’m sitting at the breakfast table one beautiful spring morning when I start cursing in some incoherent mixture of French and English: fuck! Mais c’est pas possible ! Bordel de cul ! No!!! What had happened: I was reading a comic book, and the ending touched me, deeply. A comic book. A COMIC BOOK. I read Céline, and he mostly makes me laugh; I read Jean Genet, and he makes me laugh even more; reading Les liaisons dangereuses, I often shut the book just to let the beauty of a sentence that I had just read sink in. But, what led me to break out in inarticulate multilingual shouts of rage and sadness was a comic book. A fucking COMIC BOOK.

I’m hanging out in a bookstore not far from my little deux-piece (a two-room apartment, very common in Paris). I’m browsing through a book, and all of a sudden I have to put it down and dash to a quiet, hidden corner of the store, where I burst into sobs. (For context: I am an American male in his 50s, and American men of my generation do not, not, not cry.) What caused this sudden storm of emotion: a comic book. A comic book. A fucking COMIC BOOK.

Comic books–les bandes dessinées–are considered literature in France, like any other high-brow written form. It’s not unusual to see men and women in business suits or stereotypically academic clothes (which is to say, blue jeans and a backpack full of journal articles on math or literature) reading one on the train on the way to work in the morning, and comic books can get literary prizes just like anything else. The series that had me screaming over my breakfast was this one by Peru and Cholet:

Zombies, Tome 1 : La divine comédie

My Uncle John immigrated to the US from the UK as a young man and promptly joined the Army, which sent him to Korea. Before he died, some oral history project sent someone to interview him about the experience, and we learned things that he had never, ever talked about, like the time that he had to pile up a couple bodies of his dead pals so that he could shelter behind them while he shot at the North Koreans (or Chinese, or whoever it was that was actually behind the triggers on the other side). When I was a little tike, he made me solemnly swear to never read a comic book. I still feel a little guilty every time I pick one up–I feel exempt from fulfilling that particular oath, since I made it as a small child, but as an adult, I take promises super-seriously, and rarely make them. Hopefully, the quality–and power–of this particular one takes it out of the realm of the kinds of comic books that Uncle John was talking about. Yes: I was moved to rage and sadness by a comic book. A comic book. A fucking COMIC BOOK.

To my surprise, I notice that this is the 500th post on the Zipf’s Law blog. It’s super-amazing to me that this thing that started out as a way to publicize information about the judo clubs of Paris, and then evolved into a way to keep my family and friends up to date on Parisian adventures that were too long for Facebook posts, has become something else entirely, with as of today, more than 45,000 page views and just under 28,000 visits. I thank Ellen Epstein for suggesting the blog in the first place, and all of you who comment on the posts–you give me the positive feedback of knowing that someone out there listens to what I say, and the helpful guidance of pointing out my errors in French, explaining French history and culture to me, and the like. Even beyond the relief of getting the shit that grouille dans ma tête out of it and “on the page,” you folks who leave comments make this an enriching experience for me. Thank you again.

English notes:

tike/tyke: a small child. “When I was a little tike” is a common way of introducing something that you’re going to say about your early childhood.

French notes:

la bande dessinée : comic book, graphic novel.

How to abandon ship

The most important thing is to look before you leap: you have to expect the water to be full of debris, as well as your shipmates, and you don’t want to land on either of them.

May 19th, 2018

United States

Dear Zipf,

Chlöé says that she and her uncle both passed the highest ARC water-safety tests, but that her uncle, who got his cert a generation earlier, had to learn to jump into the water from destroyer-height, wearing a Mae West, without having the vest break his neck on hitting the water.

She wondered whether you’d learned how to do this, and if so, how to do it.

Reynaud

March 20, 2018

Zurich

Dear Reynaud,

Yep, sure did. The most important thing is to look before you leap: you have to expect the water to be full of debris, as well as your shipmates, and you don’t want to land on either of them. The vest thing makes perfect sense, but I don’t remember what to do about it–the old kapok vests have a high collar, which is meant to keep your face out of the water if you lose consciousness, and indeed, if forced straight upwards hard enough, it could probably take out your cervical spine. What I do remember how to do is that when you jump, you hold your balls. And, no: I’m not kidding about your balls. The idea is to avoid them getting racked up when you hit the water. Today there are women on board ship, but I don’t know what they’re told to do. You’re also taught to use a hat, your shirt, or your pants as flotation devices. That last one is effective, but fucking HARD to do–I got worn out the first time I tried, and had to do it again to pass the test.

The basic thing once you’re safely off of the vessel is to get as far from the ship as quickly as possible: you don’t want to get sucked down when it sinks, and depending on how deep it is when (if) the engines explode, you could get injured by the shock.

The thing that they didn’t have us practice is swimming with burning oil on the surface. They told us that at night, the burning oil lights up the water underneath it, so you look for a shaft of darkness, swim up to the surface through it, take a breath, and then submerge again to find your way away from the oil.

Zipf

Here’s a video showing how to use your pants as a flotation device. This is actually better than what they were teaching when I was a squid (slang for “sailor”), in that we were taught to tie each pants leg individually, which is a hell of a lot harder than what this guy does: tie them together. Note that this guy is using a floating technique, so he’s not expending very much energy while he prepares his pants–we just treaded water, which is exhausting when you can’t use your arms to help ’cause they’re occupied trying to get your pant legs tied and the @#$% things inflated.

uss_biddle — My ship, the USS Biddle. It’s a cruiser, formerly called a destroyer escort–bigger than a destroyer, but smaller than a lot of other things. Picture source: https://www.helis.com/database/unit/1068_USS_Biddle/ Hey, guess who *didn’t* serve? Donald Trump–multiple draft deferments for college, and then a claim of a bone spur in his foot. Snowflake.

English notes

Royal Air Force pilot wearing an inflated “Mae West” flotation device. Note how it comes behind the pilot’s neck–that is meant to keep his head out of the water if he’s unconscious. Picture source: http://www.alamy.com/stock-photo/mae-west.html

ARC: American Red Cross.
cert: certification.
destroyer: a small ship, mostly used to screen big ships from submarines and aircraft.
Mae West: a kind of life vest. It’s named after Mae West, a film star of the epoque known for playing super-sexy roles.

MaeWest — Mae West showing how you get a life vest named after yourself. Picture source: http://www.selenie.fr/2014/04/mae-west-la-sandaleuse-de-hollywood.html

French notes

This vocabulary comes up in Jean Genet’s lyrical Le miracle de la rose, in the occasional flights of fancy about shipboard promiscuity.

la frégate: frigate.

le destroyer or le contre-torpilleur: destroyer.

le croiseur: cruiser.

Giving back: Pronouncing English words that end with -ive

Paradoxically, the better your skill in a second language, the more your mistakes stick out.

I work with a couple of French folks whose English is so good that they are effectively native speakers, as far as I can tell. It’s super-impressive—if my French were ever anywhere near as good as their English…

It’s their very skills themselves that make it obvious when they make a pronunciation error–it’s as if I were making a pronunciation error. It is not at all the case that I don’t make pronunciation errors in my native language, and people most definitely do notice them–but, I suspect that they’re all the more obvious precisely because (a) I’m a native speaker, and (b) I’m an “educated native speaker” (sounds hoity-toity, but it’s a technical term in linguistics). I would guess that many of my “smaller” mistakes in French go unnoticed because they get lost in the thick fog of all of my other mistakes–in my native language, though, they all stand out.

hoity-toity: pretentious.

So, when my French-speaking-colleagues-who-are-essentially-native-speakers-of-English-too make pronunciation errors in English, it is, indeed, noticeable. Happily, their English-language pronunciation errors often fall into a single category, and that’s what we’re going to go after today–my little attempt to repay more hours than I even want to think of that they’ve spent hammering on my pronunciation/lexicon/syntax/politeness/EVERYTHING in French.

You may have noticed that written vowels in English are pronounced differently than those vowels would be pronounced in essentially every other written language on the planet. (That’s just a fraction of all languages, by the way–the vast majority of languages have no writing system.)

The reason behind all of this English-versus-the-world divergence in vowel sound pronunciation is something called the Great Vowel Shift. It changed the pronunciation of many vowel sounds, and it happened after English spelling was mostly established. The result was that English vowel sounds didn’t line up with their spelling as well as they used to.

greatVowelShift-time — The Great Vowel Shift, with approximate dates–and yes, with some training in phonetics, it does make perfect sense. Picture source: http://sites.millersville.edu/bduncan/221/history/4.html

One of the changes in pronunciation affected words that happen to be spelled with an e at the end. It’s a silent e now, but it wasn’t always. The preceding vowel sound changed–in a very systematic way that requires knowing a bit about what you do with your mouth to make sense of–and one of the consequences was that if that preceding vowel was i, it went from being pronounced like i in most languages to being pronounced like the word eye is pronounced today.

So, today, if you’re an Anglophone kid, you grow up being taught that when a word ends in -iCe, where C means any consonant, the i indicates the sound of the word eye. There are plenty of examples of this:

five
drive
dive
thrive
alive
hive
archive
strive

But–and this is a big “but” (which is why I italicized and underlined it)–iCe (i followed by a consonant followed by an e at the end of the word) is not always pronounced that way. There are plenty of times when it is not, and those tend to be longer words that educated people would use, and my French co-workers are super-educated, so they use these words. For some of the native speakers of French that I know, mis-pronouncing these words is essentially the only mistake that I ever hear them make in English. So: let’s work through some of these.

You’ll notice something about the words that are pronounced the way that Anglophone kids are told you always pronounce -iCe: they tend to be single-syllable. Consider:

five
drive
dive
thrive
live (the adjective only, as in live bait)
alive
hive

But, not all single-syllable words of this type are pronounced that way. Here’s the one counter-example that I can think of:

give

And, not all of the words in which -iCe is pronounce like “eye” are single-syllable words. The counter-examples that I can think of:

archive
derive
arrive
survive
revive
deprive

I know what you’re thinking now: Zipf, this is simple–regardless of the number of syllables, the i is pronounced as in five if it’s in a STRESSED syllable. And, yes, that almost works–but, consider archive, which is stressed on the first syllable, but is still pronounced like five.

…and live is weird–when it’s a verb, it’s pronounced like give, but when it’s an adjective, it’s pronounced like five.

OK, we’re more or less good with the words that end in iCe and get pronounced like five. What about the words that don’t get pronounced like five? Let’s take a look at some. Now, I’m not going to select these randomly. I went to this web page on the Morewords.com web site. What it gave me is a list of words that end in -ive, sorted by how frequent they are. Here’s what the output looks like. You’ll notice that every word is followed by two numbers. The first one is the length of the word in letters, while the second one is how many times the word occurs in every million words of text. (What collection of texts did they do their counts in? They don’t say.) So, give is 4 letters long and occurs 1735 times per million words, executive is 9 letters long and occurs 171 times per million words, and so on.

Screen Shot 2018-01-26 at 16.40.33 — Source: MoreWords.com

With that list in my greedy little fingers, I’ll go through it and pull out some of the ones that are not pronounced like five. That gives us this:

receive
executive
alternative
objective
representative
conservative
effective
initiative
positive
relative
olive

…and there’s a little attempt to help with the already-almost-perfect English spoken by so many of my French colleagues. Got a funny story related to mispronunciation? Tell us about it in the comments…

The last duel in France: traces in syntax

The last duel in France leads to a discussion of syntactic theory, ’cause that’s how I roll.

Wanna watch the last duel in France? Here you go. Scroll down past the video for an excerpt from an article on the topic from Le monde and the definitions of some of the French vocabulary therein.

The article in Le monde: click here. Some relevant vocabulary:

retrousser [+ sleeves or pant legs] : to roll up. Elle avait un de mes pyjamas dont elle avait retroussé les manches. (Camus, L’étranger)

l’hôtel particulier : like a château, but it’s in a city, versus being in the country, and it could just as well be owned by a bourgeois as an aristocrat–I think it’s actually more likely to have been owned by a bourgeois, at least in Paris. Don’t quote me on this.

Dans un jardin ombragé par des arbustes bienveillants, enveloppé d’une douceur printanière, chemise blanche, col ouvert, manches retroussées, deux hommes, épée à la main, se jugent, se jaugent, puis, sur un signe de l’arbitre, croisent le fer. Quatre minutes plus tard, le combat cesse un des deux duellistes ayant été touché par deux fois au bras. Cette scène n’est extraite d’aucun roman ou film de cape et d’épée. Elle eut lieu il y a exactement cinquante ans, le 21 avril 1967, dans le parc d’un hôtel particulier de Neuilly-sur-Seine.

English notes

wanna: the written form of the contraction of want + to. One of the interesting things about this contraction is that it is only possible in specific syntactic contexts, and is absolutely impossible in others. This lets you distinguish between the following. Suppose that the following situations exist:

There is going to be a contest. Whoever wins the contest will be awarded a horse. There are a number of horses available, and the winner of the contest will be able to choose the horse that they will receive.
There is going to be a horse race. One of the horses will win the race.

In situation number 1, if you want to ask someone which of the horses they would choose were they to win the contest, you could ask the question in either of two ways. The second one is more casual, but they are both completely acceptable from a linguistic point of view:

Which horse do you want to win?

Which horse do you wanna win?

In situation number 2, if you think that someone has a preference regarding the winner of the race, and you want to ask them which of the participating horses they hope will emerge the winner of the race, you only have one option:

Which horse do you want to win?

Google the quoted phrase “which horse do you wanna win” and you will get 5 results, all of them in Japanese. WTF, you’re wondering…

What you’re seeing in the Google results is sentences that illustrate interesting syntactic phenomena. Most of the literature on syntax is written about English syntax (blame Chomsky), mostly by (notoriously monolingual) anglophones, and the classic examples in the field are hence mostly in English. (Actually, the only classic non-English examples that I can think of are in Swiss German–more on that another time, perhaps.) The which horse do you want to/wanna win sentences are used in classic transformational-generative grammar to argue for the existence of something called a trace. This is held to be something that is present in the structure of the sentence, but that is not observable–the claim is that you can’t “see” it, but it’s there. What is that “it”? The idea is that underlying those two sentences are two “deeper” forms:

For situation 1 (there’s a contest, and the winner gets a horse): Which horse do you want to win [the horse]?
For situation 2 (there’s a horse race, and one of the horses will win): Which horse do you want [the horse] to win?

(Linguists in the audience: yes, I am simplifying this for didactic purposes–no hate mail, please.) In both cases, the bracketed [the horse] goes away; in the second case, the “trace” that is left behind blocks the contraction of want + to to wanna.

Now, I know what you’re thinking: It’s obsessing about things like this that keeps Zipf from ever getting a second date. …and you’re right, I imagine.

Matching game IV: Zipf’s Law in French

Zipf’s Law is why if someone is looking for a web page and types “dogs in marseilles” into the query box, your search engine should pay no attention to the word “in,” some attention to “dogs,” and quite a bit of attention to “marseilles.”

Zipf’s Law describes the frequencies of words: there is a very, very small number of words that occur very, very often, and a very, very large number of words that occur very, very rarely–but, they do occur. This blog is focused on one of the consequences of Zipf’s Law: it means that if you are seriously studying a second language, you are going to run into words that you don’t know every day for the rest of your life.

You know how the matching game works: we have words in English, words in French, and we match them. Today’s words (and a tiny bit of grammar) are taken from the discussion of Zipf’s Law in the book Recherche d’information: Applications, modèles et algorithmes, by Massih-Reza Amini and Éric Gaussier, second edition. Recherche d’information is information retrieval, the task of finding documents in response to an information need: what Google does for you every day. One of the great embarrassments of linguistics is the fact that information retrieval is mostly about language, in the sense that mostly what you’re looking for is web pages with stuff written for them and you use words to find them–and yet, most of the work of information retrieval is done without actually doing anything that looks very much like doing anything with language. At its heart, the technology of information retrieval is almost entirely done with counting and very simple arithmetic–nothing linguistic there. You could think of that very simple arithmetic as taking advantage of Zipf’s Law–the very simple arithmetic is used to figure out things like the fact that if someone is looking for a web page and types dogs in marseilles into the query box, your search engine should pay no attention to the word in, some attention to dogs, and quite a bit of attention to marseilles when it is making the decision about which web pages to put at the top of the search results. Scroll down to find today’s vocabulary items, and click on the pictures of the relevant pages from Amini and Gaussier’s book if you’d like to see those words in context. As for me: a second cup of coffee, go over these flashcards, and then off to the lab. Today’s goal: explain why researchers calculated the ratio of vocabulary size to length of conversation of a bunch of soldiers–after chasing them through the woods, catching them, depriving them of food and sleep, and then interrogating them.

I included La fréquence du second mot because I’ve been trying to understand when to use second and when to use deuxième. If I understand the Académie’s Dire/Ne pas dire page correctly, the Academy would prefer that this be deuxième, but not even the Académie thinks that it’s mandatory to make the distinction:

On peut, par souci de précision et d’élégance, réserver l’emploi de second aux énoncés où l’on ne considère que deux éléments, et n’employer deuxième que lorsque l’énumération va au-delà de deux. Cette distinction n’est pas obligatoire.

On veillera toutefois à employer l’adjectif second, plus ancien que deuxième, dans un certain nombre de locutions et d’expressions où il doit être préféré : seconde main, seconde nature, etc., et dans des emplois substantivés : le second du navire.

academie-francaise.fr/second-deuxieme

As the CarriereOnline.com web site puts it: C’est pour cela qu’on parle de la Seconde Guerre mondiale parce qu’on espère qu’ il n’y en aura pas de troisième !

	Anonymous on The many ways to spell “…
	Anonymous on Nightmare after nightmare: How…
	zipfslaw1 on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …