I don’t sleep well. That is to say: I don’t sleep very much. Not at night, anyway.
In the best-case scenario, the middle of the night, when in theory I should be sleeping, is my time to study vocabulary or to read. In the worst-case scenario, the middle of the night is when I return emails from people who are in North America, and therefore awake.
Tonight’s email brought a help-wanted ad from the School of Informatics at the University of Edinburgh, posted by the amazing Mirella Lapata. (I say “amazing” because her paper with Regina Barzilay at the Association for Computational Linguistics annual meeting in 2005 opened my eyes to the possibilities for inventive evaluation strategies in computational linguistics in a way that my eyes had not previously been opened.) For my money, the University of Edinburgh’s graduate program in computational linguistics is the best in the world, so I forwarded Mirella’s email to the students in our program, most of whom are not computational linguists, but most of whom would be quite suited for one of the advertised jobs in the School of Informatics. I added the following introduction to the email:
This got me the following response from one of my students in the US (and therefore awake):
Now, I love getting this kind of question, for many reasons. It lets me repay the apparently endless patience of my colleagues in France for my crappy command of their language. It lets me be the person who knows the answer to a question about language, which in French happens exactly never. It gives me a socially acceptable excuse for talking about language, which I enjoy way more than is cool. It suggests that someone actually both read and thought about what I wrote. (You pick whichever one you think portrays me in the best light.) In fact, I love that kind of question so much that I will often go out and find naturally-occurring examples, which like any good linguist these days, I do on the Interwebs. A trip to the Sketch Engine web site and a search of the Open American National Corpus found me these:
…which, of course, like most things of interest, leads to a question. In this case, the question is: what’s wrong with the Sketch Engine web site? Where did all of those spaces come from?
The answer: there’s nothing wrong with the Sketch Engine web site. Part of any analysis of written data is choosing an answer to this question: what is a word? It’s not typically obvious what the answer is. Give students in a beginning language processing class this sentence, and ask them what the words are:
My dog has fleas.
(For reasons that are obscure to me but that I think have something to do with playing the ukulele, that is a famous sentence.) Ask them what the words are, and the first answer will be anything separated by white space:
…at which point they quickly realize that they’ve just posited that fleas. is a word, and they modify their hypothesis, to be anything separated by white space and stripped of punctuation:
(I’m not making this up–in fact, I did it in class last Tuesday.) Next they figure out that they probably want My and my to be considered the same word, which means that they need to do something about the case of letters, and if they speak any of the bazillion languages that have more inflectional morphology (example in a minute) than English does, then they might want to do something with aller/allais/allai/allasse, etc.
Things get pretty complicated pretty quickly, though. Suppose that you’re dealing with English. What do you do with
wouldn’t don’t haven’t didn’t
Seems pretty straightforward–you want something like this:
would n’t do n’t have n’t did n’t
…except that it’s not straightforward at all, because then you have to propose
…which people generally aren’t happy about.
There are a variety of ways to answer these sorts of questions, and it does actually matter. From a practical point of view, the choices that you make about how you do this–the process is called tokenization–is important enough that it affects the performance of computer programs that do things with language. (Here’s a recent paper on the topic.) From a theoretical point of view, your choice takes a position on a hugely controversial topic in linguistics: what a word is. (The best discussion of the controversy that I’m aware of is in the book Le mot, by Maurice Pergnier.)
So, why are those spaces there in the Sketch Engine output? Let’s look at it again:
One of the immediately obvious things is that they have “tokenized” the punctuation off, so that “personal growth” becomes ” personal growth ” and (1995) becomes ( 1995 ). The next thing that you might notice is that there is some ambiguity in the output. Look at what happens to that’s and people’s ..
that ‘s and people ‘s
Now we have two ‘s …and they are different, but look the same. What is a computer program to do with that? Welcome to my world. Nobody said that computational linguistics was going to be all about suicide prevention and curing cancer, right?
The Regina Barzilay and Mirella Lapata paper that I mentioned above:
Regina Barzilay and Mirella Lapata. 2005. Modeling Local Coherence: An Entity-based Approach. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, 141-148. Ann Arbor.
The declaration of competing interests: I don’t have any. Sketch Engine doesn’t pay me–I pay them, and I get a hell of a lot of use out of it.
— Je vous regardais tout à l’heure, vous étiez marants tous les deux le flicmane et vous.
— A tes yeux, dit la veuve Mouaque.
— “A mes yeux? Quoi, “à mes yeux”?
— Marants, dit la veuve Mouaque. A d’autres yeux, pas marants.
— Les pas marants, dit Zazie, je les emmerde.
—Raymond Queneau, Zazie dans le métro
How would you say for my money in French, or more generally, label something as someone’s opinion, yours or otherwise? There are a lot of options, and unfortunately, I don’t know the status of any of them with respect to register of language, contexts in which they are or aren’t appropriate, etc. Here’s what I’ve come across so far, and I should point out that I also don’t know which of these can only gracefully be used to introduce your own opinion, versus which could also be used to introduce someone else’s opinion. I’ll also mention (and then I’ll shut up) that of all of these, I’ve heard the first one (à mon avis) the most, the second one exactly once (in Raymond Queneau’s Zazie dans le métro), and the rest never, as far as I know. If any of you native speakers out there can offer suggestions about when and where to use which of these, it would be great.
- à mon avis
- a mes yeux
- selon moi
- à mon gré
- de mon point de vue
- d’après moi
- d’après mon point de vue
- à mon sens
- de l’aveu de qqn: I think that this one implies something negative, along the lines of “as Chomsky himself admits,” as opposed to relaying an opinion about which you’re not necessarily making any judgment one way or the other.