For my money, I really need to get more sleep

I can’t sleep, which leads to tokenization issues and the definition of “for my money.”

I don’t sleep well.  That is to say: I don’t sleep very much.  Not at night, anyway.

In the best-case scenario, the middle of the night, when in theory I should be sleeping, is my time to study vocabulary or to read.  In the worst-case scenario, the middle of the night is when I return emails from people who are in North America, and therefore awake.

Tonight’s email brought a help-wanted ad from the School of Informatics at the University of Edinburgh, posted by the amazing Mirella Lapata.  (I say “amazing” because her paper with Regina Barzilay at the Association for Computational Linguistics annual meeting in 2005 opened my eyes to the possibilities for inventive evaluation strategies in computational linguistics in a way that my eyes had not previously been opened.)  For my money, the University of Edinburgh’s graduate program in computational linguistics is the best in the world, so I forwarded Mirella’s email to the students in our program, most of whom are not computational linguists, but most of whom would be quite suited for one of the advertised jobs in the School of Informatics.  I added the following introduction to the email:

Picture source: me.

This got me the following response from one of my students in the US (and therefore awake):

Picture source: also me.

Now, I love getting this kind of question, for many reasons.  It lets me repay the apparently endless patience of my colleagues in France for my crappy command of their language.  It lets me be the person who knows the answer to a question about language, which in French happens exactly never.  It gives me a socially acceptable excuse for talking about language, which I enjoy way more than is cool.  It suggests that someone actually both read and thought about what I wrote.  (You pick whichever one you think portrays me in the best light.)  In fact, I love that kind of question so much that I will often go out and find naturally-occurring examples, which like any good linguist these days, I do on the Interwebs.  A trip to the Sketch Engine web site and a search of the Open American National Corpus found me these:

Picture source: screen shot of the Sketch Engine web site.

…which, of course, like most things of interest, leads to a question. In this case, the question is: what’s wrong with the Sketch Engine web site?  Where did all of those spaces come from?  

The answer: there’s nothing wrong with the Sketch Engine web site.  Part of any analysis of written data is choosing an answer to this question: what is a word?  It’s not typically obvious what the answer is.  Give students in a beginning language processing class this sentence, and ask them what the words are:

My dog has fleas.  

(For reasons that are obscure to me but that I think have something to do with playing the ukulele, that is a famous sentence.)  Ask them what the words are, and the first answer will be anything separated by white space:

My dog has fleas.

…at which point they quickly realize that they’ve just posited that fleas. is a word, and they modify their hypothesis, to be anything separated by white space and stripped of punctuation: 

My dog has fleas .

(I’m not making this up–in fact, I did it in class last Tuesday.)  Next they figure out that they probably want My and my to be considered the same word, which means that they need to do something about the case of letters, and if they speak any of the bazillion languages that have more inflectional morphology (example in a minute) than English does, then they might want to do something with aller/allais/allai/allasse, etc.

Things get pretty complicated pretty quickly, though.  Suppose that you’re dealing with English.  What do you do with

wouldn’t don’t haven’t didn’t

Seems pretty straightforward–you want something like this:

would n’t do n’t have n’t did n’t

…except that it’s not straightforward at all, because then you have to propose

wo n’t

…which people generally aren’t happy about.

The table of contents of “Le mot,” by Maurice Pergnier. The point of the picture is that the first 46 pages of the book address the various arguments for and against the whole idea of the word. Picture source: me.

There are a variety of ways to answer these sorts of questions, and it does actually matter.  From a practical point of view, the choices that you make about how you do this–the process is called tokenization–is important enough that it affects the performance of computer programs that do things with language.  (Here’s a recent paper on the topic.)  From a theoretical point of view,  your choice takes a position on a hugely controversial topic in linguistics: what a word is.  (The best discussion of the controversy that I’m aware of is in the book Le mot, by Maurice Pergnier.)

So, why are those spaces there in the Sketch Engine output?  Let’s look at it again:

Picture source: screen shot of the Sketch Engine web site.

One of the immediately obvious things is that they have “tokenized” the punctuation off, so that “personal growth” becomes ” personal growth ” and (1995) becomes ( 1995 ).  The next thing that you might notice is that there is some ambiguity in the output.  Look at what happens to that’s and people’s ..

…which become

that ‘s and people ‘s

Now we have two ‘s …and they are different, but look the same.  What is a computer program to do with that?  Welcome to my world.  Nobody said that computational linguistics was going to be all about suicide prevention and curing cancer, right?

The Regina Barzilay and Mirella Lapata paper that I mentioned above:

Regina Barzilay and Mirella Lapata. 2005. Modeling Local Coherence: An Entity-based Approach. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, 141-148. Ann Arbor.

The declaration of competing interests: I don’t have any.  Sketch Engine doesn’t pay me–I pay them, and I get a hell of a lot of use out of it.  

French notes

— Je vous regardais tout à l’heure, vous étiez marants tous les deux le flicmane et vous.

— A tes yeux, dit la veuve Mouaque.

— “A mes yeux?  Quoi, “à mes yeux”?

— Marants, dit la veuve Mouaque.  A d’autres yeux, pas marants.

— Les pas marants, dit Zazie, je les emmerde.

Raymond Queneau, Zazie dans le métro

How would you say for my money in French, or more generally, label something as someone’s opinion, yours or otherwise?  There are a lot of options, and unfortunately, I don’t know the status of any of them with respect to register of language, contexts in which they are or aren’t appropriate, etc.  Here’s what I’ve come across so far, and I should point out that I also don’t know which of these can only gracefully be used to introduce your own opinion, versus which could also be used to introduce someone else’s opinion.  I’ll also mention (and then I’ll shut up) that of all of these, I’ve heard the first one (à mon avis) the most, the second one exactly once (in Raymond Queneau’s Zazie dans le métro), and the rest never, as far as I know.  If any of you native speakers out there can offer suggestions about when and where to use which of these, it would be great.

  • à mon avis
  • a mes yeux
  • selon moi 
  • à mon gré
  • de mon point de vue
  • d’après moi
  • d’après mon point de vue
  • à mon sens
  • de l’aveu de qqn: I think that this one implies something negative, along the lines of “as Chomsky himself admits,” as opposed to relaying an opinion about which you’re not necessarily making any judgment one way or the other.

6 thoughts on “For my money, I really need to get more sleep”

  1. All the choices you provide above belong to a very correct register of language and are interchangeable in any situation . I see two intruders in the list : “à mon gré” doesn’t concern opinions but how people want or like things being done “J’étais satisfait parce que je pouvais travailler à mon gré”, “Il est heureux que les choses se fassent à son gré”. “De l’aveu de …” hardly concerns opinions, although it can happen, but more often feelings ” De l’aveu de Paul l’accident de sa belle-mère a été une heureuse surprise” .
    You can add ” Pour ma/sa part” that introduces a different opinion . “Il/je pense ça, pour ma/sa part je/il pense plutôt que … ” Same for “de mon/son côté”.
    Thank you about “for my money” . In repaying “marrant” takes two Rs .

    Liked by 2 people

    1. OK: I searched monolingual (French-French) dictionaries going back to 1606, and can’t find any word spelled “marant,” with one “r.” However: Queneau uses this spelling several times, in widely separated parts of the book. Your thoughts?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Curative Power of Medical Data

JCDL 2020 Workshop on Biomedical Natural Language Processing


Criminal Curiosities


Biomedical natural language processing

Mostly Mammoths

but other things that fascinate me, too


Adventures in natural history collections

Our French Oasis


ACL 2017

PC Chairs Blog

Abby Mullen

A site about history and life

EFL Notes

Random commentary on teaching English as a foreign language

Natural Language Processing

Université Paris-Centrale, Spring 2017

Speak Out in Spanish!

living and loving language




Exploring and venting about quantitative issues

%d bloggers like this: