The big thing in language these days is distance-based representations of semantics. The idea is that the meaning of a word can be discussed in terms of its closeness to, or distance from, other words.
How the hell would you measure that? Current approaches to distance-based semantics are based on something called the distributional hypothesis: the idea that a word’s meaning is, in essence, the set of words that it occurs with. (With which it occurs, if you prefer.) When you have sets, you can calculate the distance between (or closeness between–it doesn’t matter what you call it) those sets. I’ll give you an example of this in which we’ll use a distance metric (metric, in this case, means a number that measures something) called the Jaccard index. It’s based on counting the number of things that two sets have in common and then adjusting it with respect to the total number of things in the sets.
Let’s walk through the intuitions behind the Jaccard index. The first intuition: the more things that you share with another set, the more similar to that set you are. Let’s think about two sets of words:
What do those two sets share?
That’s three things. Now let’s look at Set 1 again, versus a third set:
How many things do they share?
Based just on the counts of things that these three sets have in common, you might say that Set 1 and Set 2 are the most similar to each other, since they have the most things in common.
Now, it’s a bit more complicated than this. Think about these two pairs of sets, and tell me which you think is closer: Set 3/Set 4, or Set 1/Set 2? Here’s Set 1/Set 2 again:
…and here’s Set 3/Set 4:
To brumate: similar to hibernating, but the state of dormancy is not as deep.
Set 1/Set 2 share 3 things. Set 3/Set 4 share even more–5 things. But, how much more similar does that make them? I’m going to suggest that it’s not as much as you might think. The reason that I’m saying this is that the fact that Set 3 and Set 4 share as much as they do has to take into account the fact that Set 4 has more things in it than any of the other sets have.
How can we take this difference in the set sizes into account? We’ll do something called “normalizing” the count of the things that they share: we’ll make it relative to the sizes of the sets that we’re comparing. How we’ll calculate the sizes of the sets: we’ll count up the total number of words that you would get if you added both sets of words together, and only counted each unique word one time. We’ll go back to Sets 1 and 2:
What are the unique words in the combination of both sets?
There are 10 total words in the two sets, but if you only count each word once–each unique word, that is to say–you have 7. Now let’s look at 3 and 4, this time counting the unique words that are found in the combination of the two sets:
To normalize the number of things that two sets of things have in common by the total number of types of things in the set, we divide the number of things that they have in common by the total number of things. So, for Set 1 and Set 2:
3 things in common / 7 types of things = 0.43
For Set 3 and Set 4:
5 things in common / 10 types of things = 0.50
…and those are the Jaccard indexes for Set 1 and Set 2, and for Set 3 and Set 4.
Let me give you one more pair: Set 1 and Set 1. If you calculate the similarity between a set and itself, you get a value of 1.0. What you should take from that is that the range of values for the Jaccard index is from 0.0 to 1.0. Knowing that, you have a point of reference: if the Jaccard index is close to 1.0, then the two things are very similar (because identical things give you a Jaccard index of 1.0). On the other hand, if two things are very different, then you’ll have a Jaccard index that’s close to zero. This might seem obvious, but imagine if there were no upper limit on how big the Jaccard index could get. What would 20 mean? What would 4,808 mean? Who the hell knows? Metrics in the range of 0.0 to 1.0 are the ginchiest.
So, now that we have a way to quantify the similarity (or difference) between two sets based on the quantity of things that they share, normalized by the total quantity of things: suppose that those things are the other words that some word occurs with. If you replace the names of the sets like this:
- Set 1: dog
- Set 2: cat
- Set 3: iguana
- Set 4: snake
…then you could imagine the words in those sets being the words that dog, cat, lizard, and snake occur with. When we calculate our numbers, we end up with dog being more like cat than it is like iguana or snake. In contrast, our numbers are consistent with the idea that iguanas are more like snakes than they are like dogs or cats…and that’s one way that you can think about quantifying the similarities between the meanings of words.
These particular kinds of distributional representations of word meanings go back to the 1950s and the work of John Firth, who famously (OK: famously among linguists) said that “you shall know a word by the company it keeps.” Distributional representations suddenly become popular in the language processing world (surprising, to some extent, because the language processing world is populated much more by computer scientists than by linguists) a few years back, for two reasons:
- Thanks to the Internet, we now have access to quantities of textual data that are big enough to be able to calculate reliable quantities–you need a lot of data to actually make this kind of approach work.
- People have recently had some success with figuring out ways to do calculations of these numbers in ways that are efficient enough to be able to handle those enormous quantities of data without bringing every supercomputer in the world to its knees. If you tried to do something like this naively, you would be calculating the similarity between every word and every other word; no one actually knows how many words there are in (to take one example) English, but you’re probably talking about a table with 10,000,000,000 cells in it. A few years ago people came up with a couple ways of reducing that number drastically, and that makes it practical to do the calculations and to store their results. (If you could do one calculation a second, it works out to a bit over 19,000 years.) Now my laptop can crunch the numbers for a few million words or so worth of text overnight.
We’ve talked about calculating the Jaccard index today (shared things divided by total things), and calculating it on the basis of words. That’s a very straightforward way of doing this–the Jaccard index is the simplest distance metric that I know of, and words are the easiest things to count. However: words are actually much more difficult to count in real life–or even to define–than they seem to be in the examples that we looked at, and there are lots of other things that one could count that might work out better. There are also different ways to define what counts as “occurring together.” To give you some examples of the kinds of questions that you need to think about in doing this kind of thing:
- Words: What is a unique word? Do you want to count Dog and dog as the same word, even though one starts with an upper-case letter, and the other starts with a lower-case letter? Do you want to count reproduisisse, reproduisît, reproduisissions, reproduisissiez, and reproduisissent (the forms of the imparfait du subjonctif of the French verb reproduire, “to reproduce”) as the same word? How about pet peeve and bête noire–do those count as one word, or two? Do you want to count bete noire as the same word as bete noir, bête noire, and bête noir? (More generally: do you count an incorrectly spelt word with its correctly-spelled equivalent? If so: how the hell do you spell-correct the Internet?)
- Things to count: Do you want to count $1, $2.25, 50%, and 75% as 4 different things? Maybe you want to consider them all as numbers, in which case there is just 1 “thing?” Maybe you want to count $1 and $2.25 as prices, and 50% and 75% as percentages, in which case there are 2 things?
- What “occurring together” means: Is it occurring in the same sentence? The same newspaper article? The same book? Maybe it means occurring within two words to the right or within two words to the left–i.e., occurring within the four surrounding words?
…and that’s the kind of thing that will keep graduate students busy for the next 5 years or so, unless something else becomes au courant in the meantime (au courant discussed below in the French and English notes), in which case all of the grad students who were betting their careers on the latest cool thing will be spending some time engaging in some serious nombrilisme and then either starting all over again or quitting grad school and going into building better search engines for Twitter or something. Welcome to my world.
For more information on distance-based semantics and its alternatives, see Elisabetta Jezek’s The lexicon: an introduction.
I got into this 2400-word little essay in the course of trying to come up with a way to respond to a series of comments on my last post in which we got into a discussion of whether or not the English word bete noire means the same as the English word pet peeve (see how I snuck an assumption in there about how many “words” are in bete noire and pet peeve?) Obviously I went down a bit of a rabbit-hole here. More on the bete noire/pet peeve thing some other time, if Trump doesn’t nuke some country because the president said something mean about him (remember how he was saying that Hillary wasn’t “tough enough?”) and bring the world as we know it to an end, along with all of the electricity. A quick discussion of some relevant French and English words follows.
French and English notes
au courant: This expression exists in both English and French, but with different uses in the two languages. In French, it means something like “up to date,” and is used to describe people. In English, it can be used in the same way, but is also (and I think more commonly, although I don’t have the data to demonstrate this, one way or the other) used to describe things, in which case it means something like “in fashion.” Additionally, in English, this is a very high-register word–you wouldn’t use it with just anybody. Here are some French examples from the frTenTen12 corpus, a collection of 9.9 billion words of French scraped off of the Internet that I searched via the Sketch Engine web site:
- Nous tiendrons nos lecteurs au courant de cette tentative…
- …que Dieu t’entende pas petite Marie Bon courage et tiens nous au courant…
- Vous êtes au courant de ces dangers, vous devez donc protéger votre PC contre toutes intrusions.
- …mea culpa, je n’étais pas au courant
- …ni les Etats-Unis ni l’URSS n’ont été au courant de cet événement…
- Peut-être que le jeune mutant était au courant , aussi elle décida de l’attendre devant la porte.
I like the second-to-last one, because it describes two countries there, rather than the two people that you would expect.
To find examples of au courant in English, I went to the enTenTen13 corpus, a collection of 19.7 billion words of English-language text, which, again, I accessed through Sketch Engine. Here is some of what I found:
- …a library of au courant phraseology and jabber…
- Where once the adage “Things go better with bacon” was au courant ,”Things go better with cheese” is timeless.
- That isn’t to say that paisley prints are reserved solely for custom-fitted, au courant French fashion houses; just the opposite.
- Pappardelle is the au courant cut of pasta right now…
- It’s all very au courant , yet it’s not at all.
- Being au courant can be its own sort of stultifying endgame.
Comparing the experience of putting these two lists together, I can tell you that I had to hunt to find examples of au courant in French where it wasn’t modifying a human, and I had to hunt to find examples of au courant in English where it was modifying a human (my last example here is the closest that I came). Here’s how it was used in the post: That’s the kind of thing that will keep graduate students busy for the next 5 years or so, unless something else becomes au courant in the meantime, in which case all of the grad students who were betting their careers on the latest cool thing will be spending some time engaging in some serious nombrilisme and then either starting all over again or quitting grad school and going into building better search engines for Twitter or something.