English – Page 21 – Zipf's Law

Compound nouns: why my kid said friendgirl instead of girlfriend

The errors of a child learning their native language can be tremendously interesting.

french knife vocabulary 09c37ab6157f4e281abd6477065caf2f When my kid was about four years old, he went through a period where he switched the orders of certain kinds of words. It wasn’t random–this happened only with a particular kind of word formed by putting two nouns together. For example, he would say:

light kitchen instead of “kitchen light”
friendgirl instead of “girlfriend”

On the other hand, if there were a noun preceded by an adjective, he got the order right:

big kitchen
mean girl

The phenomenon has some implications for theories of how children learn language. In particular, it’s difficult to give a simple behaviorist explanation for this phenomenon, where the kid gets exposed to stimuli, repeats them, and gets reinforced for producing them correctly: to my knowledge, the kid was never exposed to things like friendgirl. There are also interesting things about his pronunciation of these things on a smaller scale, though, and in particular, how we make compounds–read on, if you want to know more.

One of the most difficult problems in getting a computer to understand language is understanding compound nouns. These are nouns that are made up of two or more words in a sequence. The toughest ones can be compounds where the words that make up the compound are both nouns. For example, in English:

school bus
kitchen cupboard
fire engine

I’ve given you examples where the two nouns are written with a space between them, but they might also be spelt with a hyphen, or without a space. For example:

gunboat (no space)
timesheet (no space)
rainbow (no space)
gun-carriage (hyphen)
train-spotting (hyphen, and yes, you are allowed to argue about whether or not spotting is a noun)

From a theoretical perspective, there isn’t a distinction between these–they’re all compound nouns. From the point of view of writing a computer program that deals with language, we would tend to treat the ones that are written with a hyphen or with no space as single words that don’t necessarily get analyzed further, but the ones written with a space usually need special treatment. (In fact, amongst people who do natural language processing, there’s a whole field of research concerning what are called multi-word expressions.

From both a theoretical and a practical perspective, the big question about compound nouns is: how can you describe, understand, and get a computer to deal with the different kinds of relationships that can exist between the nouns? It’s not a random thing–languages tend to exploit particular kinds of relationships in compounds. Even describing these things from the perspective of theoretical linguistics is tough, though, separately from the practical problem of getting a computer program to process them. A classic English example (due, I believe, to the recently departed linguist Chuck Fillmore) is the names for different kinds of knives in English.

bread knife: a knife for cutting bread
butter knife: a knife for spreading butter
pocket knife: a knife that is carried in a pocket
butcher knife: a knife that is used by a butcher
palette knife: a knife that is shaped like a palette
utility knife: a knife that is used in food preparation
paring knife: a knife that is used for paring
steak knife: a knife that is used for cutting steak
boning knife: a knife that is used to trim meat from a bone
boot knife: a knife that’s meant to be carried on or in a boot

Just with this partial list, we can see some patterns of semantic relationships between the nouns in the compound:

intended material	bread knife, butter knife, steak knife
used by	butcher knife
used for	paring knife, boning knife
carried in	pocket knife, boot knife
shaped as	palette knife

dog bones 1003118_10201602413728925_39172732_n — Dog bones at a Hungarian butcher shop in Cleveland, Ohio. Picture source: me.

How should we classify utility knife? Or dog bone? I don’t know. As I said, this is difficult–it’s not like this is something that they teach you in linguistics grad school. And, do you get to just make these kinds of relationships up on an ad hoc basis? If so, you’ve got descriptions that couldn’t possibly be shown to be wrong, and from a scientific point of view, that’s bad–your theories need to be testable, and falsifiable. (Generally we assume that we can’t prove anything, but we do try to construct theories in such a way that if they’re wrong, in principle we should be able to demonstrate that.) Some people have proposed limited sets of relationships that they hope can capture all such compound nouns–for example, the Generative Lexicon theory of James Pustejovsky. It’s not clear that all of the issues that are involved in this are resolved, though.

Rather than this kind of noun-noun compound, French generally has nouns modified by prepositional phrases. That is, you have the noun, then a preposition, and then another noun. For example, compare these English and French nouns:

railroad (rail + road)	chemin de fer
windmill	moulin à vent
wine glass	verre à vin
goods transport	transport de marchandises
shaped as	palette knife

For more examples, see the picture in this post, which shows the vocabulary for a variety of kinds of knives in French.

It’s not the case that all French nouns of this sort follow the prepositional phrase pattern–for example, we have homme grenouille, “frogman.” But, the pattern with the prepositional phrase is much more common. Having said that: one of the biggest mysteries of French for me is how you know when the preposition will be de versus à. Is there some principle that would let me know that it’s a boîte à gants (glovebox) and a cuillere à café (coffee spoon), but a animal de compagnie (pet) and a crème de cacao? A boîte à bijoux (jewelry box), but a boîte d’allumettes (matchbox)? A boîte à chaussures (shoebox), but a boîte de nuit (nightclub)? I have no clue.

Some details of compound nouns in English: the pronunciation of these things is different from phrases with adjectives. In general, in a compound noun, you’ll have the stress on the first noun, e.g.:

chef’s knife is pronounced CHEF’S knife, while David’s knife would usually be pronounced equal stress on both words.
coffee spoon is pronounced COFFEE spoon, while yellow spoon would be pronounced with stress on both words.
beat box is pronounced BEAT box, while big box would be pronounced with stress on both words.

Some details of compound nouns in French: I have no clue how to pluralize these things, and I’m not sure that all French people do, either. Here’s what the Wikipedia page on French compound nouns has to say on the topic. It breaks the compounds down to what they’re made up of: a noun plus a noun, a verb plus a noun, a noun plus a verb, etc.:

noun + noun: pluralize both. Example: oiseau-mouche, oiseaux-mouches (hummingbird). Exception: I don’t understand the Wikipedia explanation for this, but sometimes you only pluralize the first noun: des chefs-d’œuvre (masterpiece), des arcs-en-ciel (rainbox).
verb + noun: plural only at the end. Example: cure-dent, cure-dents. Exception: I don’t understand the Wikipedia explanation for this, either, but sometimes you don’t mark the plural at all: des chasse-neige (snowplow) (= chasser la neige, devenu variable dans l’orthographe de 1990), des trompe-l’œil… (direct quote from Wikipedia)
adjective + noun: pluralize both. Example: la basse-cour, des basses-cours (farmyard; chickens and rabbits; outer courtyard).
verb + verb: don’t mark the plural at all. Example: des garde-manger (pantry).

If you’d like to know more about the Generative Lexicon theory and how it accounts for these kinds of relationships between nouns, but don’t feel like you want to tackle the primary sources (I have a PhD in linguistics and I’ve never been able to finish working my way through the last chapter), there’s a book called Generative Lexicon theory: A guide, by James Pustejovsky and Elisabetta Jezek, coming out. For a detailed discussion of relationships in this kind of noun in French and Italian, see this paper by Pierrette Bouillon, Elisabetta Jezek, Chiara Melloni, and Aurélie Picton. (I got some of the examples in this post from there.)

So, back to my poor kid: why friendgirl and light kitchen, but mean girl and big kitchen? He seems to have come up with some conception of there being a difference between the compound nouns and a sequence of an adjective and a noun. Remember that he was maybe 4 years old, so no one taught him this. As is characteristic of kids learning their native language(s), he came up with a hypothesis about how to produce the difference between these things, and what he came up with was an ordering difference for the compound nouns. So: don’t freak out if your kid comes up with some weird things in the language department, and be aware that it’s mostly not trying to correct them–it’s not like they’re consciously aware of these “rules,” and nothing that you can say to them is going to change them. However: they’ll figure it out. Keep Calm And Keep Talking.

Some French vocabulary on the topic:

le mot composé: compound word

The Paris hustling ecosystem: the bad side

There are scammers all over the world, but there are some scams that are especially Parisian.

i-hustle-hard — The good meaning of hustle. Picture source: http://www.top-law-schools.com/forums/viewtopic.php?t=177021&start=50.

The verb to hustle can have a couple different meanings in English, one of which is good, and one of which is bad.

The good meaning of hustle: behaving with what the Merriam-Webster dictionary calls “energetic activity.” Someone who’s hustling in this sense is working hard; moving around a lot; expending a lot of effort, in a good way. If you want to get into a good college, you’re going to have to hustle this year. She really hustled, and she finished the program early. Commonly said to athletes: Come on, get out there and show some hustle!
The bad meaning of hustle: “to sell something to or obtain something from by energetic and especially underhanded activity…to lure less skillful players into competing against oneself at (a gambling game)” (Merriam-Webster dictionary again). (“Underhanded” means through trickery or dishonesty.) This is basically the same meaning as to con someone–to trick them out of money—and a hustle (it can be a noun, too) can also be known as a con, or a con game, or a confidence game (which is where the shorter name comes from).
A pool hustler is more or less the archetype of the hustler. Pool hustlers are excellent pool players. They trick people into betting with them by pretending to not be very good, and then reveal their true skill after the bets are laid. Picture source: http://bankingwiththebeard.com/?p=1425.

You will find people running hustles (or cons) pretty much everywhere you go in the world, including places where there are no tourists–people try to hustle the locals, too. But, there are some hustles that are especially common in Paris, and some that I haven’t seen anywhere else. Read on for descriptions of how they work.

The common Parisian hustles

There are some pretty common hustles in Paris, and you will probably see at least one of these if you go to any of the famous tourist sites (and you totally should–I firmly believe that everyone should do as many of the stereotypical Paris tourist things as they can, at least once). Here are the things that you’re likely to see:

The ring hustle
The friendship bracelet
3-card Monte, or whatever
The fake petition
The fake deaf/mute

What I find especially interesting about all of this is that there is a system in operation here–an ecosystem, if you will. We saw in a previous post that there are specific kinds of beggars that do their thing in specific areas–the guys who make speeches on subways, the Roma ladies on the Champs Elysées, etc. There’s a similar kind of system in effect with regard to hustles–different groups more or less own specific hustles, and specific hustles are associated with specific areas of Paris. In addition, there are some common types of robbery: picking pockets, and snatch-and-runs. You can find countless web pages on the subject of how to avoid getting your pocket picked in Paris, and I won’t belabor the point. Of course, the vast majority of people will have no trouble with thieves at all (although I do have a friend who had his pocket picked twice during the same visit to our fair city–just rotten luck). The only thing that I would add to the bazillion web pages on not getting your pocket picked in Paris is this: don’t lay your cell phone on the table while you’re talking, or even while you’re reading emails or something–you should have it in your hands at all times, and if you’re standing in the middle of the sidewalk looking at it, you should have it tightly in your hands. Now that cell phones can be worth hundreds of dollars, picking them up off of a table on the patio outside of a cafe, or even snatching them out of someone’s hands, and running off is unfortunately a thing.

The ring hustle

british police woman with fake rings — British police officer with confiscated fake rings used in the ring scam. They use identical rings in France. Picture source: http://content.met.police.uk/cs/Satellite?blobcol=urldata&blobheadername1=Content-Type&blobheadervalue1=image%2Fjpeg&blobkey=id&blobtable=MungoBlobs&blobwhere=1283551938574&ssbinary=true.

The basic principle of this is that you and someone else find a gold ring at that same time, and they try to convince you that you should give them money in exchange for “their share” of the ring. The ring is a piece of crap. I once had the same guy try this one on me twice within twenty minutes on the same bridge. He tried it as I was crossing the bridge in one direction, and then again as I crossed back the other way–I think he might not have been very focussed that day. How exactly you both happen to discover this thing at the same time can vary, and how exactly the person tries to talk you out of your money can vary, but the basic principle is the same: ring, money.

This is pretty much a Roma thing, as far as I can tell. In Paris, you should especially watch for this one on the bridges over the Seine–why, I have no clue.

The friendship bracelet

bracelet_scam — This lady made the mistake of being polite to the guy and not ignoring him and walking off–now she’s been snagged. Picture source: https://www.corporatetravelsafety.com/safety-tips/watch-out-for-the-infamous-paris-string-scam/.

The basic principle of this is that you are offered a free friendship bracelet by a friendly guy. In fact, you don’t even have to accept it–he’ll just grab your hand and start putting it on you, if you don’t avoid him well. Once it’s on you, it’s no longer free, and he demands a lot of money for it. Part of what makes this work is that the guy uses the bracelet as a handle to keep you physically under control–in the best (for him)/worst (for you) case, by using your finger to make the thing for you (see below). This is almost entirely a West African thing, and the hotbed is the steps of the Sacré Coeur basilica. Why? I have no idea.

The shell game

Make no mistake: the people who are doing the things that I’m describing on this page are scumbags. They steal–they just mostly don’t use violence to do it. In the case of the shell game (and its card-based relative, known as 3-card Monte in English) though, I have to admit that I find it somewhat difficult to feel as much empathy for the victims as I usually do. This is despite the fact if you fall for this one, you are probably going to lose much, much more money to this con than you would to anything else on this page. More on that in a minute.

Hieronymus_Bosch_051 shell game — Hieronymus Bosch’s painting “The Conjurer,” painted between 1475 and 1480. Notice that the guy on the left in white with a black top is stealing the purse of the guy who’s watching closely. Picture source: https://commons.wikimedia.org/wiki/File:Hieronymus_Bosch_051.jpg.

The basic idea here is that the guy running the con has three cups. He’ll put something under one of them, move the three cups around, and then give free money to anyone who can guess which cup it’s under. It’s easy–you see the guy just giving money away. He gets you to put up some of your own money. You do, and all of a sudden you guess wrong. I watched a guy doing this a couple weeks ago–he was trying to get people to put up 100 euros.

The reason that I find it harder to empathize with people who get caught by this one than with people who fall for the other cons that I describe on this page is this: people have been pulling this shit for over 2,000 years. The shell game existed in Ancient Greece. It was already all over Europe in the Middle Ages. How can people not have heard of this?? I have no clue.

This is mostly a Roma thing, although I saw what appeared to be a South Asian guy doing it once. I’ve often seen it in Paris in the near surroundings of the Eiffel Tower–mostly on the Iena Bridge, and I don’t remember seeing it anywhere else. I have to say that this is the rarest of the Paris hustles–it requires a fair amount of set-up, and a number of confederates (when I was watching the other night, there were four adult males involved, one of whom was pretending to be a stranger playing the game, and the other two of which were hanging around discreetly nearby and watching–if you get pissed and try to take your money back from the guy, good luck duking it out with four adult males at the same time). It’s also super, super illegal, so although the potential benefits to the crooks are large, the potential costs are, too.

The fake petition

petition-3 — A pretty young girl who looks like pretty much every pretty young girl I’ve ever seen doing this hustle in Paris. Picture source: https://www.corporatetravelsafety.com/safety-tips/deaf-mute-scams-in-europe/.

The basic idea: a pretty girl asks you to sign a petition. For no reason that I understand, it’s typically about better treatment for the deaf, and indeed, she pretends to be deaf. Once you’ve signed, you’re pressured to donate some money for the cause. She’s not deaf, nor are the other pretty girls who are with her with their own identical petitions, nor are the other pretty girls who you’ll see in other parts of Paris with their identical petitions on the same day. In a variant of the usual approach, while you’re signing the petition, someone is picking your pocket. This is mostly a Roma thing, and it’s common in front of Notre Dame and the surrounding areas, as well as the Hôtel de Ville.

The fake deaf-mute

This one happens on the local trains. A guy gets on board and walks up and down the train leaving little printed notes on the empty seats, explaining that he is deaf/mute/whatever, and do you have a little spare change? These guys are actually the least objectionable of all of the folks who I describe on this page–they don’t pester you. I saw a variant of this in Slovenia last week–the guy went through restaurants, leaving his little cards (trilingual–Slovenian, Italian, and German) on the tables, with a couple little trinkets that you were invited to buy.

The free flower/rosemary/herb of some variety or another

This is a variety of the here’s-something-free-that-suddenly-isn’t-free-anymore scam. I haven’t actually seen it in France, but I include it for completeness. In the Spanish version, it’s a little old lady on the steps of a church. If you don’t give her money, you are threatened with a Roma curse. (I actually find this somewhat charming–who gets cursed anymore?) I ran into a wonderful version involving an attractive woman in an extremely short dress in Turkey. Wonderful mostly not in that there was an attractive woman involved, but in that I was able to participate in the ensuing mess with only as much knowledge of Turkish as you get from the Pimsleur course:

Click on the picture if you can't read it clearly. — My little adventure with a “free flower” lady in Istanbul. Click on the picture if you can’t see it clearly–it’ll get bigger.

There are indeed lots of guys wandering through the restaurants in tourist areas trying to sell you roses in Paris, but there’s no deception involved (at least, not that I’ve experienced, and I did double-check this with a local), and they’re typically not pushy (pushiness being an identifying feature of hustling in its bad sense–see above)–it’s not really a hustle (in the bad way), per se. I would call it the good kind of hustle–see a later post on the subject.

Videos of these folks in action

Here are some videos of these folks in action. I didn’t shoot these–more on why you shouldn’t try to, either, below. This is all stuff that I found on YouTube.

First, some pretty good footage of the friendship bracelet thing, shot in Italy. I haven’t seen the shoulder thing in France, but the principle is similar–the guy does whatever he can to establish a situation such that you are physically in possession of the bracelet. Other interesting points: notice the repeated use of a question that the guys know you’ve been answering automatically several times a day, and that it feels rude not to respond to: where are you from? It’s also a question that lets the guy quickly establish some sort of rapport with you. Another cute thing about this: notice the guy who keeps saying waka waka? That’s not a Sesame Street thing–it’s Cameroonian English (Cameroon is a country in West Africa with two official languages: French, and English.) It’s an exhortation–literally, it means something like “walk while working.” You can hear it in Shakira’s theme song for the 2010 soccer (football, sorry) World Cup.

There’s a lot of dead footage in the beginning of this next video, but right about at the middle there’s some great footage of an attempt to snatch someone’s bags as they’re boarding the subway. It’s a good view of how proximity to the door of a metro car is used to snatch stuff. Atypically, these young ladies were unsuccessful, but you get the picture of how it works.

Don’t try to film these guys in action

Don’t try to film any of this shit! I think it’s great that people can get footage of this kind of shitty behavior and then post it on YouTube for the edification of the rest of us, but photographing or shooting video of a criminal in action is an excellent way to get punched a couple times and to have your expensive cell phone stolen. Déconseillé, as we say in these parts.

Final words: don’t berate yourself, don’t be scared, don’t let it ruin your vacation, and don’t feel obliged to be polite to these folks

If you get snagged by the evil kind of hustler, it’s really easy to berate yourself afterwards for being a fool, a sucker. Don’t. Unless you go for the shell game, you’re not–these people are pros, they make their living this way. This kind of incident can really sour you on wherever you happen to be, too, and really cast a cloud over your trip. Don’t let that happen! These people are the tiniest, tiniest, tiniest, tiniest fraction of the people you’ll meet, and they’re pretty unlikely to be Parisians, or even French. Plus, unless you fall for the shell game thing, these guys don’t actually take that much money off of you, and there are far, far more expensive hustles being worked in China and Turkey right now. It’s also worth pointing out that there is very little violent crime in this country. In America, you can get shot to death in a road rage incident pretty much any day of your life–it’s just a fact of life in our gun-cursed country. In France, you might get robbed, but the chances of your being physically attacked if you’re not visibly Jewish are very, very low (and even if you are visibly Jewish, your chances of being physically attacked are still pretty low). So, use some common sense, be aware that all you have to do is ignore these people, or in the case of a friendship bracelet guy handing you something, feel free to drop it on the ground and walk off without a word. The truth is, these people are trying to rip you off, and you do not owe them one single tiny bit of the typical American friendly politeness to strangers. You should also realize that there are plenty of people out there on the streets of Paris trying to make a living via the good meaning of “hustle”–just getting out there and working long hours in all kinds of weather, perhaps not totally within the law, but not hurting anyone, either. We’ll talk about those in another post.

un tour de passe-passe: one French expression for the shell game–can native speakers help me with others?
l’arnaque (n.f.): rip-off, swindle, fraud, con.
arnaquer qqn: to rip off, swindle, or con someone.
c’est de l’arnaque: that’s highway robbery!
se faire arnaquer: to get ripped off, to be had.

How to flunk your rotation in informatics: insights from burrowing mammals

Trigger warning: this post contains graphic descriptions of Talpidae-phobic violence. Sorry, no French language stuff here–come back tomorrow (or so) for our usual exploration of the implications of the statistical properties of language for second-language learners.

Woodchuck scat — Woodchuck poo. If you’d like to know how woodchuck poo can be relevant to your career in informatics, read on. Picture source: http://www.harpercollege.edu/ls-hs/bio/dept/guide/gallery/evidence/scat/original/Woodchuck_Scat.jpg.

Here’s some advice on how to flunk your rotation in informatics. I’ve written this with details that are specific to my particular field–natural language processing–but the broader ideas apply to informatics in general, to dissertation-writing in most academic fields that I can think of, and outside of academia, to software development jobs, to grant-writing, or to almost anything with a deadline at which you will be evaluated at some point. Following this advice won’t guarantee that you’ll flunk your rotation, but not following it is an excellent way to improve your chances of passing.

Be afraid to ask questions

This is the biggie. Afraid that people will think you’re stupid if you ask questions? Don’t be–they’ll definitely think that you’re stupid if you don’t, and then don’t figure stuff out some other way. The absolute best students I’ve known were two people who had weekly appointments with me while they were doing their studies, specifically to ask questions. One of them is a rapidly rising star at a government research institute now, and the other is running a bioinformatics program. If you can’t get over your fear of asking questions, your chances of professional success are low. (I don’t mean to imply that I’m any good at answering questions–but, something about the nature of that interchange seems to have made some sort of contribution to their educations.)

Don’t make a schedule

As soon as you figure out what you’re doing for your project, don’t do what we do in the military—if you want to flunk your rotation. What we do in the military: write down a list of every step that has to be accomplished to get from where you are now to where you need to be at the end of the rotation. Are you going to think of everything? No–but, you’re going to think of most things. Don’t obsess about that.

Now put the due date by the last thing in your list of things that have to be done. Work backwards, estimating the time by which you will hit each of the preceding steps.

Now ask a question: is the date by which you would need to have started in order to get done on time already past? If so: go back to your advisor, because you need to modify your project–now. If not: great! So far, you’re on track!

A good way to flunk your rotation is to not have any way to estimate whether or not you’re on schedule to finish on time. If you don’t want to flunk your rotation: make a realistic schedule that lists everything that you have to do, and by when each step needs to be finished. (See the sidebar for one way to do this.) Go back to your timeline frequently, and make sure that you’re on track to finish by the due date. If you’re not on track: figure out what you need to do differently to get back on schedule. If you are on track: great! Part of the beauty of working out your timeline early is that you find out quickly if you’re falling behind, but to my mind, the real beauty of working out your timeline is that if you see that you’re on schedule, you have a license not to be anxious. No point in sweating if you’re on track to finish on time at the moment. Schedules can be anxiety-inducing if you fall behind, but that’s OK–if you’re falling behind, you want to figure that out now, not a month from now. The thing is, schedules can also be reassuring–if you know that you’re not behind, then there is no reason at all to lie awake at night worrying.

Don’t establish immediately that there’s data available on which to test your system

This is the number-one informatics-specific rookie mistake. (The being-afraid-to-ask-questions thing is an indiscriminate killer of everyone.) Suppose that your rotation project is to build a system that whacks moles. (English note: the verb to whack means “to strike with a smart or resounding blow.” (Source: Merriam-Webster.) It can also mean “to kill,” especially when talking about organized crime.) You’re going to want to demonstrate that it does, in fact, whack moles: if you can’t actually get your hands on any moles, you’re going to be asking the faculty to just take your word for it that this would be a really, really great mole-whacker, and that’s not likely to happen. If you find out two weeks before your rotation ends/your conference submission deadline/your grant submission deadline that there’s no data available with which to test your interesting hypothesis, it’s probably game over–come back next semester/next year/next shift in national scientific priorities and try again. On the other hand, if you realize very quickly that there’s this interesting hypothesis but no existing data with which to test it, and then you propose a way to create the data and an associated evaluation methodology, that’s an excellent approach to doing a rotation/writing a paper/getting a grant. You can use the data to test your hypothesis in the next rotation/paper/grant proposal, and you’ll be the first one to do so (important in academia), ’cause there was never any data around that would have let anyone do the experiment before.

Neil Sarkar, the Founding Director of Brown University’s Brown Center for Biomedical Informatics, makes a related point that is crucial for people doing rotations in biomedical informatics: “One thing to also consider is importance of knowing when an Institutional Review Board protocol must be filed… And not trying to evade the process of getting Institutional Review Board approval…” It’s important to think about this up front, and if you need this kind of institutional approval, you want to ask for it early, because these things can take an amazing amount of time just to prepare the request, and then you have to wait through the approval process, too.

An aside: I’m guessing that all of you non-informatics people out there are thinking that I’m just making things up with this whole issue of mole availability or lack thereof–click here for the search page of Jackson Labs, which exists in large part to connect researchers with mice that have very specific genetic characteristics needed for an incredible variety of experimental investigations. You say that you need some Chinese hamster ovary cells? I ask: what kind? Click here for the CHO-K1 line. 575 euros. They’re super-important in research on therapeutic recombinant proteins. You say you’re a surgeon who does kidney transplants, and you want to do a better job of getting kidneys to survive between when you take them out of the recently-departed and put them into the recipient? You need to understand metabolism at low temperatures. You say you want to understand metabolism at low temperatures? You need to understand hibernation. You say you want to understand hibernation? You need a lab full of arctic ground squirrels. How does a surgeon who does kidney transplants get their hands on a bunch of arctic ground squirrels? Go to the Arctic Circle (during the summer, obviously, ’cause they hibernate in the winter) with a bunch of carrots–see here for an article about how fun this is (warning: graphic picture of an arctic ground squirrel on an anesthesia machine), here for how to figure out where to put your traps, and here for details on things like the trade-offs associated with large traps versus small traps, the relative effectiveness of selective site trapping versus grid trapping, how to use a girth hitch sling to allow a single person to handle an arctic ground squirrel alone, and some stuff about toe amputation that we don’t need to go into. This undoubtedly sounds like a lot of work, and it is. It could be worse, though–if what your research requires is woodchucks (useful for the study of a particular kind of liver cancer called hepadnavirus-associated hepatocellular carcinoma), you may have to raise them in the lab yourself. This is a huge big deal if you’ve got a deadline, because they only breed in March and April, and then they’re pregnant for a month, and then they don’t actually have very large litters after all of that. Now, if you’re reading this, you probably are studying some forms of informatics, and thinking: this guy’s full of shit–I don’t need no stinking woodchucks. But, keep in mind that long time-lags are common in informatics research. For example, the CRAFT corpus took over three years to build, and PropBank has been growing for well over a decade. Data is precious, and sometimes it’s expensive, and it’s not always there when you need it–unlike Chinese hamster ovary cells, it’s often not possible to just go to a web site and buy what you need. So, if you don’t want to find yourself doing the informatics equivalent of scooping the woodchuck litter boxes while the rest of your classmates are giving triumphant rotation talks, the question of availability of data for testing your system has to be the very first thing that you resolve after you walk out of your new rotation supervisor’s office to go sit in your carrel with a warm feeling in your heart and visions of an endowed professorship at Stanford. Let me repeat the word available–the fact that your medical school has 10 petabytes of electronic health records with all of the data that you need in them does you no good whatsoever if you can’t get access to them.

Don’t establish scoring criteria up front

You want to have a conversation with your rotation supervisor very early in the process about what will constitute success. Suppose that your project is to build a system that whacks moles. What does it mean to have built a system that whacks moles? Does it have to be a successful system, or can it just exist? If it has to be successful: what does “successful” mean? Does it have to kill the moles, or is it OK to just tap them on the head? Maybe it’s actually preferable to just tap them on the head? If you don’t ask, you won’t know. Does it have to whack every mole, or is it OK if it focusses on whacking the moles that smell bad? If it whacks one mole one time, does that satisfy the requirements of the mole-whacking-system-building project, or does it need to continue whacking moles unto eternity, and if so, what are the requirements regarding the ability of the system to continue whacking moles when the zombie apocalypse comes and there is no more electricity? If it misses 1 mole out of 10, would that still constitute mole-whacking? What about if it misses 5 moles out of 10? Suppose that what’s really wanted is a system that whacks every mole, every time, exactly on the top of the head, with uniformly fatal results, all the way through the zombie apocalypse until the spirit of cooperation, mutual assistance, and recognition that we are all connected in a web of interdependence restores humanity to its rightful zombie-free position on the planet–but, although your system is only catching 50% of the moles and sometimes it punches them in the stomach instead of whacking them on the head, and you don’t really have a good plan for the whole what-happens-when-there’s-no-more-electricity thing, but in the process of building the system, you’ve come across a really novel approach to thinking about mole-whacking that is likely to yield real insight into the nature of moles, the nature of whacking, and how to think about speciesist violence in terms of a general framework with applicability to subterranean mammals as a whole, and possibly also some of the smaller lizards–but, not until a couple months after your project is over and grades are submitted. This might seem persnickety, but I have most definitely seen the situation where the student (or software engineer, or grant writer, or whatever) thought that they were supposed to be whacking moles in the sense of small fossorial mammals, but what their rotation supervisor was looking for was a system that whacks moles in the sense of a spy who has integrated themself into an organization, and those situations most definitely did not end in a way that led to the student feeling happy. (See above for how you can use fear of asking questions about things like this to increase the chances of flunking your rotation.)

A pithier version of the preceding, very long paragraph: the great suicidologist Ed Shneidman used to say that “the most dangerous four-letter word in the English language is only.” (If you’re not a native speaker of English: a “four-letter word” is an idiom meaning a curse word–fuck, shit, piss, etc.) The biggest warning sign of an impending rotation-failure (or comprehensive exam, or missed grant deadline, or whatever) is the word something in your topic. If your description of your topic is I’m going to do something with mole-whacking/semantic role labelling/protein structure prediction, then you still have major gaps in your conception of the project, and you have no idea what will constitute success–or a failing grade, either. Seriously: sounds simplistic, but the presence of the word something is a strong diagnostic.

Spend a lot of time obsessing about minor details early in the process

Have you been tasked with building a mole-whacker? Put a lot of time into thinking about moles with bad breath, moles with nice breath, and moles that would be really cute if only they did something about their taste in Restoration essayists. Are you going to build a system that does deep analysis of subtle differences between different kinds of change-of-state verbs? Spend a lot of time thinking about how you’re going to detect the ends of sentences. (If you’re not a language processing person: getting a computer program to recognize the ends of sentences is a lot harder than you might be thinking. But, it’s not super-crucial to the bigger problem of deep analysis of subtle differences between different kinds of change-of-state verbs.) If there’s one thing that I’ve learnt from spending a lot of time around French people, it’s that minor details are important. But, you need to have the big picture in your mind all the time, and if you have a 10-week rotation and you spend two weeks of that time thinking about how to do a perfect job of finding the ends of sentences, then you have reduced your chances of successfully completing your project quite a bit, unless it’s about improving the ability of computer programs to find the ends of sentences. (If you’re not a language processing person and you think that I’m just making this shit up: click here for a paper on the role of finding the ends of sentences in the task of finding bacteria habitats, or here for a paper on event response potentials as they relate to prospective and retrospective processes at sentence boundaries, or here for a paper on why you need a support vector machine with a linear kernel (or so the authors claim) to tell the difference between a period at the end of an abbreviation and a period at the end of a sentence in clinical documents (health records).)

Don’t differentiate between aspects of the approach that do and don’t test your hypothesis

By now you might accept that it’s important not to spend a lot of time obsessing about minor details early in the process. But: how do you know what makes something a “minor detail”? Minor details are things that have very little to do with actually testing your hypothesis. Now, you’re thinking: I’ve discussed what counts as success with my rotation supervisor, and we reached the consensus that analyzing subtle details of different kinds of change-of-state verbs means reaching an F-measure of 0.80 on the Semantics Evaluation Conference Official Subtly Different Change-Of-State Verb Test Set. What if I pick the wrong find-the-ends-of-sentences system, and that reduces my performance to 0.79, when it could have been 0.81 if only I’d picked the right find-the-ends-of-sentences system? In that case, I would suggest that you renegotiate what you’re doing with your rotation supervisor. The question with which you would start the conversation: what’s interesting about getting an F-measure of 0.80 versus 0.79? How would that change our knowledge of the world, or software for analyzing subtle differences in the various and sundry kinds of change-of-state verbs, or moles, or whatever? Can we frame the project in terms of a question of some sort that might have broader implications for how one might approach this kind of task in the future, such that my career doesn’t succeed or fail on the basis of whether or not I’m good at finding the ends of sentences?

Don’t have a hypothesis

If you would like to flunk your rotation, it’s helpful to not have a hypothesis. If you don’t have a hypothesis, then you’re less likely to know whether or not you’ve tested anything, which means that neither you nor the faculty who will be grading your rotation project will know whether or not you finished your rotation project. That’s not a guaranteed way to flunk your rotation–you’ll leave the faculty in the position of guessing whether or not you finished it, and maybe they’ll guess that you did–but, it’s a pretty good one.

Don’t know why you’re doing your project

On some level, you always know why you’re doing your project–you’re doing it because your advisor thinks that it would be a good idea. But, why? Let’s step back a bit. Suppose that you have a hypothesis in hand. From a practical perspective, you care about knowing why you’re investigating that particular hypothesis out of a universe of possible hypotheses because if you know why you’re investigating that particular hypothesis, you’re more likely to do a good job of investigating it, or so I assert. Some reasons that I assert that: we discussed above the importance of being able to differentiate between things that take up a lot of time but don’t actually test the hypothesis and things that do contribute to testing the hypothesis. In fact, if you know why you’re testing the hypothesis, then you might realize (hopefully early in the process) that your specific hypothesis isn’t actually going to contribute very much to achieving whatever it is that was your rotation advisor’s motivation for suggesting the project in the first place. That’s the practical reason. There’s a more general reason, too: you’re a graduate student. You want to get a graduate degree. In most fields, we give people graduate degrees when they have contributed some significant piece of knowledge to the stock of what we know. You can certainly contribute pieces of knowledge to the stock of what we know without having any kind of broader conceptual framework (say, a theory) for understanding why those pieces of knowledge would be relevant to someone somewhere, but it’s harder to contribute a significant piece of knowledge to what we know without some kind of broader conceptual framework. It’s that broader conceptual framework that establishes the context that defines your piece of knowledge as significant or not; your piece of knowledge consists, in some sense, of whether or not your results are consistent with your hypothesis; your hypothesis is more likely to be a useful hypothesis if you know why you’re evaluating it. There has been far more written about what makes a hypothesis a useful hypothesis (or not) than I will ever understand before I retire, but it’s worth your while to check out at least some of it. You can find relevant stuff in epistemology, or in philosophy of science, or in statistics–there’s something for every taste.

The epistemology of flunking rotations: Where I got all of this stuff

Some of this stuff comes from my own experience of flunking things–I left graduate school feeling like I knew a lot more about how to not get a PhD than I did about how to get one. I asked a number of people who teach in graduate programs of computer science, medical informatics, bioinformatics, and linguistics to look at the post, and incorporated their comments. The rest comes from years of watching people flunk rotations, as well as flunk master’s thesis defenses, comprehensive exams, prelims… Also watching people miss deadlines for conference submissions, grant submissions, software releases–and I’ve missed more than one of those myself. Learn from my mistakes–it’s a hell of a lot less painful than learning from your own!

The picture at the top of this post shows a hibernating arctic ground squirrel in the gloved hands of a researcher. It comes from https://www.independent.co.uk/news/science/arctic-squirrel-hibernation-recycle-nitrogen-b1767464.html.

Paris’s begging ecosystem

There are entire genres of begging in Paris, some unique to this city.

toblerone-hero — Picture source: https://mcfarlandcampbell.co.uk/tag/toblerone/

One evening I was on the RER (a regional train) on the way home from work when a woman of indeterminate age got on. She was eating a Toblerone. Excuse me, ladies and gentlemen, she said loudly. (If it’s in italics, it happened in French.) Could you give me some change, perhaps a euro? She pulled out another Toblerone and examined it closely, turning it from side to side. Sometimes I lure a man into a parking lot, and I bite him. She put it slowly into her mouth. Sometimes in Cameroon, I would eat a man. Another Toblerone, which she chewed on meditatively.

By this point, I was seriously questioning my ability to understand spoken French. I looked at my French coworker who happened to be sharing the train with me. Did she just say… Yep, he answered. Parisians most definitely do not speak to strangers on trains, but this time a young woman sitting next to him joined in: “She says she eats men.” (It’s pretty easy to tell that I’m not French, and she spoke English.) The lady examined another Toblerone before putting it in her mouth. I’m hungry. If you have some money, some spare change…

This was a very strange little speech to hear, and the whole box-of-Toblerone thing added a certain hallucinatory element to the experience. But, in a Parisian context, it made a certain amount of sense. Visitors to Paris usually notice pretty quickly that there are a lot of beggars here. We talked in a previous post about why there are so many beggars here, and there are perfectly good reasons for it. Although there are a lot of folks who are out there asking for money in this town, they actually fall into a finite number of classes, at least one of which is specific to Paris, and the cannibalistic Toblerone eater was an instance of one of them. Here in France we love to classify things, so let’s run through the categories. Beyond the intrinsic interest of the facts that there are categories at all and the nature of the categories themselves, it’s interesting to think about how the various and sundry categories manage to live together in an ecosystem of sorts–different kinds of beggars fill different niches in the city.

Métro: You will occasionally see someone–usually a man–get onto a métro car or a regional train and ask for money. There’s a set ritual for this. Basically, the guy makes a speech. It tends to follow a specific pattern.

Apology: Ladies and gentlemen, I’m sorry to disturb you during your trip.
Statement of problems to be solved: I am homeless/jobless/I have four children and a sick wife and need a hotel room/money for food/diapers.
Request: If you have some spare coins/restaurant tickets/a euro or two…

…and then they walk through the car with a paper cup or with their hand out. These guys don’t necessarily make much in a single car, but they typically do make something–more if they’re old, less if they’re young and look like they could be working for a living like the rest of us. Then it’s off of that car and on to the next one. In the light of the existence of this genre of begging, the Toblerone lady makes a certain amount of sense, and you have to give her credit for originality (or for insanity–I’m actually betting on the latter).

roma woman begging champs elysee — Roma woman begging on the Champs Elysée. Picture source: http://flickrhivemind.net/Tags/beggar,paris/Interesting.

Eastern European Roma women on the Champs Elysées: There’s a genre of begging which until recently I’d only ever seen in Eastern European countries. The way it works is that the beggar kneels on the bare sidewalk with his head on the concrete and his cupped hands held out to receive alms. It looks really, really painful. For the past couple years, I’ve seen Roma women doing this on the Champs Elysée. Only Roma women so far, and only on the Champs Elysées so far. Why them, and why there? I have no idea. Clearly, they’re Eastern European, but there are lots of Eastern Europeans in Paris, and I’ve yet to see any others begging like this. Occasionally the police will come by and roust them. They pick up their water bottles (this is, after all, 2016) and move on, then return later.

Disabled: One day this past winter I was on the metro on the way to work. I was bundled up like everyone else in Paris, as it was cold–hat, leather jacket, neck warmer (I still haven’t been here long enough to wear a scarf), gloves. Into the car climbed a guy in short-shorts. His legs were these skinny, twisted things–maybe as big around as my forearm, and oddly bent. He didn’t say a word to anyone–just struggled down the aisle with his hand out. For a year or so, there was a guy sitting on the ground outside my metro station all day–no feet. There’s a kid (I say “kid”–I would guess that he’s in his twenties) who has a spot outside the grocery store. He sits there, silent, his head hanging, with a paper cup in front of him. I’m pretty sure that he’s schizophrenic.

With kids: An Eastern European friend taught me that there’s a special place in hell for people who abuse their kids by using them for begging when they should be in school. As far as I can tell, it’s mostly a Roma thing in Paris. You park your family on the sidewalk under a blanket, children prominently displayed, and hold your hand out to passersby. You occasionally also see Roma women with a baby panhandling–be especially careful, as some of them do a trick such that they only appear to be holding a baby, as it’s actually supported by a sling. That’s the hand that picks your pocket. (Let me point out that the vast majority of these ladies are just begging–but, the pocket-picking thing does happen, too.)

Parisian beggar with dogs. Picture source: http://www.newsner.com/en/2015/11/12-dogs-that-love-their-owners-no-matter-how-little-money-they-have/.

With animals to pet: You’ll see a lot of people with an animal or two on their lap. Drop some money in their cup and give doggie/kittie/bunny a scratch, if you feel like it. Most weeks petting beggars’ dogs and cats is my only physical contact with another living being, so a lot of my change goes into these folks’ cups. One of my favorite guys is usually in the Latin Quarter on weekend nights. He has these two little spaniel mixes, and it’s clear that he adores them and they adore him. The last time I saw him, I leaned over to drop a coin in his cup and pet the dogs. It’s Orthodox Easter tomorrow, you know, he said. (If it’s in italics, it happened in French.) Really?, I asked. Yeah, Easter–Orthodox Easter. Cabbage, I said. Have a good night. (My French continues to suck.) I still haven’t figured out why we had that particular conversation, other than the possibility that the next day might actually have been Orthodox Easter. Lately I’ve been noticing shiftless young people with ill-kempt animals trying to do the pet-my-animal thing. Their animals look like shit–not loved or cared for at all. You can tell the difference, I think. Note: be sure that the animal is there to be petted before you try to pet it! This sounds obvious, and I guess that it would be to any non-stupid person. However: I bent over to pet a kid’s pit-bull-looking dog one day without checking him out first, and he snapped at me. I had no clue whatsoever that I was capable of jumping that far that fast–backwards, no less. Obviously, if this dog had felt like ripping my arm off, he could have–he just gave me a little warning. Learn from my stupidity.

Finally, there are plenty of run-of-the-mill beggars. If they’re young, people mostly walk right by them, because there are plenty of frail old run-of-the-mill beggars that probably need your money even more.

Now, I’m not talking here about people who hustle–“hustle” in the good sense, or “hustle” in the bad sense. With the exception of the people with animals, the people that I’m describing here are straight-up beggars. Street musicians, mimes, comedians, dancers–that’s a whole nother genre. Pick-pockets, 3-card monte, the ring scam, the bracelet scam–that’s yet another genre, and they each have their niches in the hustling ecosystem of Paris.

English notes

Short-shorts: very, very short pants. Line from an advertisement for Nair, a leg-hair remover: Who wears short-shorts? Nair wears short-shorts. How it was used in the post: One day this past winter I was on the metro on the way to work. I was bundled up like everyone else in Paris, as it was cold–hat, leather jacket, neck warmer (I still haven’t been here long enough to wear a scarf), gloves. Into the car climbed a guy in short-shorts.

bunny: an informal/children’s word for rabbit. On my first visit to Belgium, I knew just barely enough French to order a meal in a restaurant. Seeing a meat on the menu whose name I didn’t recognize, and being an adventurous eater, I ordered it. It being pre-Internet, I had to ask a coworker the next day what I had had for dinner. His response (in English): You ‘ave eaten, ‘ow you say… Bugs Bunny. How it was used in the post: You’ll see a lot of people with an animal or two on their lap. Drop some money in their cup and give doggie/kittie/bunny a scratch, if you feel like it.

French notes

Cameroun: Cameroon. Pronunciation: the e is silent, so [kamrun].

Roma: there are many ways to say “gypsy” in French. In part, I know this because my favorite neighborhood bum gave me a lecture on the topic one day, with statistics. I have very little clue as to the current social acceptability of any of them; as far as I know, Roma or Rom is OK (just as it is in the US, where the word gypsy is definitely not OK in all circles), but I’m pretty sure that all of the others have varying levels of pejorativeness. How it was used in the post: For the past couple years, I’ve seen Roma women doing this on the Champs Elysée. Only Roma women so far, and only on the Champs Elysées so far.

Who has a sagittal crest?

Before you hit your dog, remember that he can bite your hand hard enough to break it–but, he chooses not to.

Due to some WordPress layout issues, there are occasional gaps in this page. Please scroll down to get past them. Sorry!

what if i never find out whos a good boy — Picture source: https://twitter.com/m_pendar.

In America, we do love our dogs. A culturally common way for us to show our dogs affection is this: we pet them, while saying Who’s a good boy? (or Who’s a good girl?, depending on gender). In my family, we do it a little differently: we pet the dog while saying Who’s got a sagittal crest? Dogs don’t look at you with any more or less puzzlement regardless of which one you pick, so: feel free to go crazy with this one.

badger-4422 — Badger skull. The arrow is pointing at the sagittal crest. Picture source: http://www.jakes-bones.com/2010/09/my-new-badger-skull.html.

What’s a sagittal crest? The next time you run into a dog, run your hand along the center of the top of his skull. That ridge that you feel is his sagittal crest. Sagittal means along a plane that runs from the front to the back of the body. A sagittal crest runs along that plane. This sense of crest means something sticking out of the top of the head–think the plume on top of a knight’s helmet. Many animals have a sagittal crest, but not us modern humans. You see them in species that have really strong jaw muscles. A sagittal crest serves as one of the points of the attachment of the temporalis muscle, which is one of the main muscles used for chewing. If you have a sagittal crest, you can have a bigger temporalis muscle, which means that you can bite/chew harder.

gorilla skull — Gorilla skull. Picture source: http://alfa-img.com/show/new-gorilla-skull.html.

If you look at relatively close relatives to humans, you see sagittal crests on some of them. To the left, you see a gorilla. You wouldn’t want to get bitten by this guy. (Note that some gorilla species, especially their males, have really enormous sagittal crests–this is actually a pretty modest one, for a gorilla.)

pan troglodytes skull — Excellent replica of a Pan troglodytes (common chimpanzee) skull. Picture source: http://www.connecticutvalleybiological.com/product-full/product/chimpanzee-skull-pan-troglodytes.html.

Here’s (an excellent replica of) a Pan troglodytes (common chimpanzee) skull. This guy (I think it was a guy) had more of a sagittal crest than you (you don’t have any), but he didn’t have much, compared to that gorilla. Other chimps vary. Monkey species vary pretty widely regarding the presence or absence of a sagittal crest.

800px-Paranthropus_aethiopicus — An Australopithecus robustus species. This specimen is known as “The Black Skull.” Picture source: https://commons.wikimedia.org/wiki/File:Paranthropus_aethiopicus.JPG.

Some hominids that were ancestral to us had sagittal crests, but they disappeared pretty early in the course of our evolution. Here is a picture of the “Black Skull,” about 2.5 million years old. It’s from a type of Australopithecus robustus. By the time Homo erectus comes along (starting about 1.9 million years ago and lasting until about 70,000 years ago), the sagittal crest is gone. Picture below.

So: feel free to express your affection for your dog any way you want–you can’t possibly be any geeker than my son and me. Scroll down past the picture for French vocabulary.

800px-Homo_habilis-KNM_ER_1813 — Homo habilis skull, dated at 1.9 million years ago. Picture source: https://commons.wikimedia.org/wiki/File:Homo_habilis-KNM_ER_1813.jpg.

Relevant French vocabulary (see the Comments section for more):

la crête sagittale: sagittal crest
le muscle masticatoire: chewing muscle (note: the “c” in muscle is pronounced in French)
le muscle temporal: temporalis muscle
la morsure (action de mordre): bite (noun)
la morsure (marque de dents): teeth marks

Data mining, text mining, natural language processing, and computational linguistics: some definitions

Parsing, data mining, and encryption are not going to get you. That pistol in your nightstand might, though.

Every once in a while an innocuous technical term suddenly enters public discourse with a bizarrely negative connotation. I first noticed the phenomenon some years ago, when I saw a Republican politician accusing Hillary Clinton of “parsing.” From the disgust with which he said it, he clearly seemed to feel that parsing was morally equivalent to puppy-drowning. It seemed quite odd to me, since I’d only ever heard the word “parse” used to refer to the computer analysis of sentence structures. The most recent word to suddenly find itself stigmatized by Republicans (yes, it does somehow always seem to be Republican politicians who are involved in this particular kind of linguistic bullshittery) is “encryption.” Apparently encryption is now right up there with dirty bombs in terms of things that terrorists are about to use to kill us all. (“All” might be an exaggeration. I find it interesting that the United States had 33,169 firearm deaths in 2013–roughly 11 times as many deaths as on 9/11–and yet, Republicans seem to think that it’s important that we make firearms as widely available as possible. I guess they just don’t like people very much.) As a moderately technical person, this strikes me as odd, since I’ve always thought of encryption as that nifty mathematical technique (I was about to say “algorithm,” but I think the Republicans are down on that one now, too) that keeps you from intercepting my text messages, me from reading your Ashley Madison profile, and so on.

In between the Republican outrage over parsing and the current panic over encryption, we had the sudden appearance in the public consciousness of data mining. As far as I knew up to that point, data mining was a bunch of statistical techniques for finding relationships between things. Suddenly it was showing up in scary news stories–Google the phrase “data mining is evil” (you have to put the quotes around it to search for the phrase, as opposed to the individual words) and you will get 1,400 hits as of the time of writing (May 2016).

Besides being bemused by this intrusion of American know-nothingness into public discourse, I have a personal stake in the issue, because people often refer to what I do for a living as text data mining. This is a misnomer–by its nature, data mining is not something that you can do with texts. Bear with me and I’ll explain why, and then we’ll look at some French vocabulary for talking about all of this.

Data mining is basically about databases. In a database, the statistical techniques of data mining can help you do things like discover that Republicans with HBO subscriptions are more likely to consider voting for Romney in a primary than Republicans who don’t have HBO subscriptions. (Real one, if I remember the facts correctly.) You can do that because you have a table in the database that tells who’s a Republican, a table that tells who has HBO subscriptions, and a table that tells you which members of a random sample told the interviewer that they would/wouldn’t consider voting for Romney in a primary. Data mining is the science/art of figuring out what things are related (HBO subscription/willingness to vote for Romney) and what things aren’t related (making one up here: having bought an Escalade and being willing/unwilling to vote for Romney in a primary)–this among probably thousands and thousands of variables. Doing data mining research requires things like knowing particular kinds of math, understanding how to sample a population, getting computers to do complicated calculations in a way that is time-efficient—stuff like that.

With data mining, you have that database, and you know what everything is. With “text mining,” or “text data mining,” as some people call it, you have texts, and you don’t know what anything is. (By “you,” I mean a computer program.) This is usually talked about as a difference between “structured” data (i.e., the database)–you know what everything “is”–what it “means”–in some sense, its semantics. Whoops–that sentence got a little out of control. “Unstructured” data: that’s typically how we would describe text. With text, you know what nothing is–you don’t know what anything means–in a very literal sense, you don’t know its semantics.

“Text mining” could be thought of as turning unstructured data into structured data. You’ve got a bunch of texts, and you want to use it to populate a database, perhaps. Maybe you have 23 million journal articles in the National Library of Medicine, and you want to find every statement that those 23 million articles make about which genes are affected by which drugs. Maybe you have a huge collection of French fairy tales, and you want (the computer) to find every time that a stepmother is mentioned and whether the portrayal of the stepmother is positive or negative. You could think of both of those as turning unstructured data into structured data–you’re taking that unstructured data and using it to build a database about drugs and proteins, or a database about stepmothers. You can see now why we tend to prefer the term “text mining” to “text data mining”–to the extent that “data mining” is about structured data, it doesn’t really make sense to talk about “data mining” with respect to language. Where the data mining person basically just needs to know math, the text mining person needs to know something about how people write about whatever it is that you’re interested in. I do a bit of text mining. People will have really specific requests–tell me whether or not the genes from some experiment show up in the cancer literature, say; tell me if this is a suicide note or not; read this doctor’s note and tell me if this kid is a candidate for epilepsy surgery; stuff like that. It’s not really linguistics, but it pays the bills, and it suits my need to do something that might actually make the world a better place.

A related field is natural language processing. Natural language means human language, as opposed to computer languages. Natural language processing is about building tools to handle specific linguistic tasks–parse a sentence, figure out parts of speech, stuff like that. You might use a combination of different language processing programs to do a text mining task. I find this more interesting, since the questions are less about some set of facts than they are about the language itself. Where the data mining person needs to know math and the text mining person needs to know how people write about genes and drugs, or stepmothers, or whatever, the natural language processing person needs to know something about language itself–what kinds of structures sentences can have, how word frequencies are distributed, how to build linguistic resources for letting a computer process things that can’t be directly observed (e.g. semantics). I do a lot of this kind of stuff. Recently I’ve been working on coreference resolution–how to get a computer to recognize that Obama, President Obama, and Barak Obama are all referring to the same thing in the world, while Mrs. Obama and Michelle Obama are referring to something else in the world. (Recognizing that those “things” in the world are people, as opposed to, say, locations, or the names of companies, is a whole different story.)

Yet another field is computational linguistics. This is about using computational models to test theories about language. This is my favorite, but it’s the hardest to pay the bills with. I do some of this, too. Nowadays a lot of my time goes into large-scale attempts to model the semantics of biomedical language. I’m trying to investigate differences in the semantic primitives of biomedical language versus “general” English by building a large set of data-driven semantic representations of predicates found in journal articles; I’ll then compare that resource to a similar resource built for general English and look for things like whether or not the semantic primitives seem to come from the same set, whether or not given verbs have different representations in the two types of language, etc. My hope is to get a sense of the range of types of semantic variability from this particular project. You could imagine using computational linguistics work to build natural language processing tools, and then using those to carry out practical text mining tasks. You could use the text “data” mining results to do actual data mining.

gender binary examples — Mathematical representations of semantics can define how the gender binary gets manifested in English. This diagram transforms gendered word relationships into a map-like space. Pairs like girl/boy and aunt/uncle have the same “spatial” relationship. Picture source: http://www.offconvex.org/2015/12/12/word-embeddings-1/.

As you can tell from my examples, I’m very much in the world of biomedical language. There’s also a lot that you can do in the humanities with this kind of stuff. A hot topic in the future might be using mathematical representations of semantics to study things that are/are not thought of as binaries–gender, sexuality, race, political economy, whatever. However, I would not claim to do ANY of that–I can just barely explain it. For more on that kind of stuff, see this excellent post by Ben Schmidt.

In practice, even people in the field don’t always differentiate between these terms, or at least don’t draw sharp boundaries between them. My business card says that I’m the director of a text mining group, but I identify most strongly as a computational linguist. We figured that “text mining” makes more sense as a practical field of inquiry to have within a medical school (which is where I work), so that’s what we called the group when we formed it. If you go to the annual conference of the Association for Computational Linguistics, you will see almost no computational linguistics, but rather a ton of natural language processing. If you go to the annual Biomedical Natural Language Processing meeting, you’ll see a mix of text mining, natural language processing, and a bit of computational linguistics. Sometimes the distinctions really matter, though. This post started its life as a response to someone who asked me to be on a panel about data mining, to talk specifically about text data mining. When I responded that I don’t do data mining, they asked what the difference is–this blog post started out as my response.

As far as I can tell, the relevant community in France doesn’t make these distinctions in any kind of rigid fashion, either, despite the much-vaunted French penchant for categorization (see Nadeau and Barlow’s excellent book for a discussion of where it comes from). However, French does have technical vocabulary for all of these fields. Here it is:

fouiller: to excavate; to rummage through, to search (see also here)
la fouille de données: data mining
la fouille de texte(s): text mining
le traitement automatique des langues naturelles: natural language processing
la linguistique informatique: computational linguistics

Why there are so many beggars in Paris

There are historical reasons for the large number of beggars in Paris.

Le mendiant et son enfant Yves, “The beggar and his son Yves,” dated to 1317. Picture source: http://classes.bnf.fr/ema/grands./084.htm

The typical stereotype of Paris is as a beautiful, majestically historical city that just oozes romance, and indeed, Paris is all that. But, visitors are often surprised to find that it is also a city with a sometimes astounding number of beggars on the street. The reasons behind this are many, and varied, and, I think, interesting.

In the pre-modern period, the vast majority of the French (like the vast majority of everyone else in the world) were farmers. Most children didn’t live to adulthood, and you needed a lot of hands to work the farm, so people had big families.

In the 1500s, the French death rate took a relatively sudden drop. People were still having those big families, so there were a relatively large number of people making it to adulthood. The inheritance laws of the time included primogeniture, i.e. inheritance of everything by the oldest son, so lots of those people wouldn’t have a farm of their own to work. Options were limited, and if they couldn’t find other employment, a lot of people hit the road. (There’s an excellent description of the mechanics of this phenomenon in Robert Darnton’s The Great Cat Massacre and other episodes in French cultural history.)

If you hit the road in France, you’re eventually going to end up in Paris, if for no other reason than that it’s the hub of the road system (and today, the rail system). If you can’t find other employment, your options come down to begging or stealing, and most people aren’t thieves. So: begging.

Begging actually has a very long and somewhat respectable history in Europe. As Robert Cole puts it: “In the middle ages, ‘Christian charity’ perceived the poor as God’s special children and therefore deserving of alms.” Begging can be a profession, really. (Old Eastern European Jewish joke: beggar hits a guy up for money. Guy gives him some helpful hints on improving his approach. Beggar responds: YOU’RE telling ME how to beg? This would make total sense in a French context: a métier (profession) is a métier, whether you’re a doctor, an engineer, or an elevator operator.)

If you’re gonna be a beggar, though, it helps to have a schtick. Physical lack of ability to work was a good one, and Parisian beggars were known for faking such a disability, leading to their squatting areas being known as Cours des miracles (“Courts of miracles”) for their recovery at the end of the working day. (There was one just to the north of what is now the Place des Vosges, I believe.) By the 1500s, begging wasn’t viewed quite as kindly. Robert Cole again:

In sixteenth-century Paris the poor were viewed as merely layabouts who preferred to live off public welfare. Meanwhile bad harvests, plagues, inflation and religious war increased their number dramatically. Public begging was outlawed in 1536, and in 1551 laws were enacted which limited eligibility for public assistance and forbad women to have their children in tow when selling candles outside churches. To do so, went the rationale, evoked sympathy from prospective customer, which proved that such women were really only begging. A traveller’s history of Paris.

So: there have been a lot of beggars in Paris for centuries. In 2007, the European Union was enlarged to include a couple countries with large Roma populations. There have always been Roma in France, but now a lot more came (the Roma rights group FNASAT says 12,000 currently, and that’s after 10,000 being expelled in 2009 and another 8,000 in 2011; other estimates range from 20,000 to 400,000), and they are a prominent part of the Parisian begging ecosystem. (There is, indeed, a Parisian begging ecosystem, and there are actually a number of distinct genres of begging in Paris–a subject in and of itself.)

To be clear: if you don’t give charity, your life is pointless. Let me point out that this is a teaching of at least Christianity, Judaism, Islam, and Hinduism, and–for my fellow secularists in France–Rousseau, the revolutionary Constituent Assembly, National Convention, and Directory, and modern French philosophers from Sartre to Alain Finkielkraut. (All of those links are to citations on the subject, not to their biographies.) The Buddhist view of charity is especially appealing to me, as a (really bad) student of judo:

Buddhism views charity as an act to reduce personal greed which is an unwholesome mental state which hinders spiritual progress. What Buddhists believe, Venerable K. Sri Dhammananda Maha Thera.

Judo’s view of the best human relationships is mutual welfare–we’re taught that human interactions should be mutually beneficial. So, if it’s the case that charity benefits both the giver and the receiver, then it’s very judo. Seriously, give charity–if for no other reason than that you’ll feel better about humanity if you take part in it being more humane.

le mendiant: beggar.
le gueux/la gueuse: beggar (literary). A number of other, more pejorative meanings–highwayman for men, whore for women, etc. Probably obsolete, but keep it mind for when you read Tartuffe.
le clochard: beggar; also bum. (Slang.)
le/la clodo: beggar; also homeless person, tramp, hobo.

Some additions from native speaker Phildange:

le vagabond: wandering beggar, hobo.
le chemineau: same as above.
faire la manche: to beg.

Bilingual dictionaries: how to pick them, how to use them

I was in the Navy with an Armenian woman. (No, you don’t have to be a citizen to serve in the American military, and that’s probably true in most countries. In France, you can get citizenship by serving in the military–you are français par le sang versé, “French by spilt blood.” This isn’t the case in the United States–you can apply for citizenship as a member of our military, but there actually isn’t any guarantee that you’ll get it.) We’ll call her Nairi (not her real name). Like many members of the Armenian diaspora, Nairi was massively multilingual–she spoke Armenian, Arabic, and Spanish natively, and French and English as very strong second languages. (I once saw her mother test her to make sure that she wasn’t forgetting any of them.) One day Nairi came back from leave (what we call vacation in the military) with a seven-language dictionary. I admired it, and she insisted that I take it. I refused, she insisted, I refused, she insisted, I refused, she insisted, and finally, I took it. What I didn’t realize was that in Armenian culture, if someone admires something of yours, you must insist that they take it. Armenians know that they most certainly should not take it–I didn’t. Now I do. Stupid me–every time I see that dictionary on my bookshelf, I feel like a total jerk.

In a recent post, we talked about monolingual dictionaries–that is, dictionaries that list words in some language and give definitions of them in that same language. Today, let’s talk about bilingual dictionaries–that is, words that list words in some language and give corresponding words in another language. Of course, anything that we might say about bilingual dictionaries applies equally to dictionaries with even more languages, like the one that I stupidly took from poor Nairi.

I carefully said “corresponding” words just above–I carefully didn’t say “equivalent” or “the same” words. This is because it’s often the case that there isn’t a single translation from one word in one language to one word in another language. Even when there is one, it doesn’t necessarily “mean” the same thing, in some sense of the word “meaning.” To give you an example from my college French 101 textbook: a fenêtre in French is a window in English–fine so far. But, say window in English, and the referent is most likely a casement window, specifically–one that slides up and down. Say fenêtre in French, and the reference is most likely a window that opens in the middle–horizontally. (We would call this a French window in English. See this post for a list of things that we call French something-or-other in English that aren’t called anything of the sort in French.) And, as I said, there often isn’t just one. A language that I worked on in grad school has the word invert. But: invert what? If you’re inverting a hollow object, that’s one verb–if you’re inverting a solid object, it’s another verb. French has maybe two words for snow–la neige, and la poudreuse (powder snow). Depending on how you count, English has 13 or 55 or 120 (scroll down past the Inuit words) or 182 words for snow. So: not a 1-to-1 correspondence.

Having at least mentioned some of the theoretical issues, let’s look at the practical points of buying and using a bilingual dictionary. In these days of Amazon, you can use reader reviews in a way that we never could before–it’s really a nice advantage over the old pre-Internet days. However, there are also some specific things to look for.

Example sentences: you want a dictionary with example sentences, at least in the language that’s foreign to you.
Verb + preposition combinations: a good dictionary should tell you which prepositions, if any, go with which verbs. You need to know, for instance, that in English you shoot at something, you lean toward (have a preference for) something, and you stop doing something, with no preposition. Likewise, in French you need to know that you tirer sur or tirer contre (shoot “on” or shoot “against”) something, you pencher pour (lean “for”) something, and you arrêter de (stop “from”) doing something.
If you are working with language(s) that have gender, you want the gender to show up both in the Language1 -> Language2 section and in the Language2 -> Language1 section. If you look up kitchen towel and find that the translation to French is torchon, you don’t want to then have to go to the French -> English section to see whether it’s le torchon (it is) or la torchon (it isn’t).
This might seem obvious, but make sure that the pronunciation is given for the words in any language whose pronunciation isn’t obvious from the spelling–and, yes, that includes both English and French.
This takes a while, but: when you find the word that you’re looking for in the other language, you might want to look it up in the other direction. For example: suppose that you look up the English word towel in a crappy bilingual English/French dictionary. In a crappy dictionary, you might find the following: serviette, torchon. Both of those can, indeed, be used to translate towel from English to French–but, they’re not equivalent. Serviette is for a bath or beach towel, while torchon is for a kitchen towel. You want a dictionary that will distinguish between the various possible translations. It’s often useful to look the French words up in turn (or the English words, if you’re going from French to English). If you do that, you’ll find that a serviette can be a towel, but also a napkin, or a briefcase. A torchon, you’ll find, can also be a messy document, or a rag. It’s good to be on top of this kind of thing when you’re trying to choose between supposed synonyms.
Labelling of registers, or levels of appropriateness: you most definitely want a dictionary that includes slang, obscenities, informal words, etc., or you’re not going to get very far in real life. However, you also want a dictionary that labels words that are non-standard–offensive words, etc. This kind of thing can be really, really hard to catch when you’re learning a language from movies, your neighbors, etc.

The always-awesome Lawless French web site has a good page on the subject of how to use a bilingual dictionary, and it has much better examples than I do. You can find it here.

So, what are some good bilingual English/French dictionaries? Here are some options.

The best thing out there these days is almost certainly WordReference.com. It has lots of language pairs, example sentences, colloquial expressions, pronunciations, male and female forms of adjectives, plurals, a verb conjugator, and a reverse look-up feature that does exactly what I suggest you do in the last bulletted item above. The auto-c0mplete feature in the search box saves me enormous amounts of time (and guessing about spellings). There’s an excellent WordReference iPhone app. Be aware, though, that the iPhone app will not generally let you look up obscenities–you have to go to the web site for that.
For the Kindle or for the Kindle app on your phone, the Collins English-French and French-English dictionaries are quite good. They’re quite highly rated on Amazon.com. I have the Collins dictionaries on my phone, and use them whenever I don’t have Internet access and therefore can’t get to WordReference.com. The Collins dictionaries also have an advantage over WordReference: they don’t give as many super-subtle translations. The only bad thing about WordReference is that it can sometimes give an overwhelming number of other-language translations. That’s great when you want it, but when you don’t, you might prefer the Collins dictionary. As it happens, there is a Collins dictionary tab on the WordReference site, and it’s easy to click on that.
Linguee.fr is fantastic for seeing things in context. You will generally get lots of example sentences. There’s an iPhone app for that, too.
Reverso.net is another good one for seeing things in context. It sometimes has better coverage of colloquial, slang, and obscene language than Linguee does. Again, there’s an iPhone app.

I found Nairi on Facebook recently. I sent her a friend request–no response. Is it because she doesn’t remember who the hell I am? Is it because she hates me for taking her dictionary? I have no idea. Nairi, if you’re reading this: I’m sorry!

Refugees are dying and I can’t understand the word for “capsize”

Refugees and migrants are dying in shocking numbers in the Mediterranean. Here is some vocabulary that you’ll need to know to talk about the tragedy in French.

Map of the European migrant crisis as of 2015. Picture source: https://en.wikipedia.org/wiki/European_migrant_crisis#/media/File:Map_of_the_European_Migrant_Crisis_2015.png.

One of the ways that the world is sucking right now is the migrant crisis in Europe. As I write this (in April 2016), there are tens of thousands of refugees and migrants stranded in Greece. Many of these people cross from Turkey to Greece by boat, and many go from North Africa to Italy by ship. Tragically high numbers of these sink; in April of last year, five vessels sank, with a death toll of about 1,200 people.

The other day I was listening to the news on the radio. It was yet another story about the refugee crisis. The word aufrage kept coming up, but I couldn’t find it in my dictionary. Un aufrage, I kept hearing. Looking up similar stories on line solved the mystery: it was not un aufrage, but un naufrage–a capsizing or shipwreck. I had “segmented” (as linguists say) the n of naufrage as part of a separate word, coming up with un aufrage.

This isn’t an uncommon phenomenon. One of the surprises for students in introductory linguistics classes is that in speech, there are no breaks between words–if I showed you a spectrogram (a sort of recording of a sound wave) of a sentence, you would see a continuous sound. “Segmenting” that stream of speech into smaller units is something that humans do–it’s not something that’s there in the acoustics.

Occasionally speakers of a language will, over time and as a community, “reanalyze” words in a way that changes the segmentation, and eventually the pronunciation. The word uncle is a word that has undergone this process. A variant of the word in English is nuncle. Oxford describes it as archaic or dialectal, but it’s there. You can see it in Shakespeare:

Can you make no use of nothing, nuncle?

–King Lear, Act 1, Scene 4

The word is thought to have come from a segmentation of phrases like mine uncle as my nuncle, thine uncle as thy nuncle, etc.

The same thing can happen in other languages, too–any time people speak, there’s an opportunity for segmentation errors. Children who are learning their mother tongue often try out different segmentations. For example: in a past post, we looked at some bear-related vocabulary in French and English. Here are various and sundry relevant phrases:

un ours: a male bear.
une ourse: a female bear.
un ourson: a baby bear; a teddy bear.
un nounours: a teddy bear.

I once read a great blog post in which a French guy wrote about his toddler producing three different pronunciations of the word ours (male bear) in one day: ours, nours, and I believe lours (the last one would be a reanalysis of l’ours, “the bear”). (Sorry I’m guessing about that last one–I can’t find the guy’s post.)

Linguistics geekery, which you should feel free to skip: one of my homeworks in Phonetics 101 was to look at spectrograms and find indications of syllabic association, which can correspond to word segmentation, on occasion. It’s possible to do so–sometimes. For nasals in French, as far as I know, it would be restricted to some variability in when a vowel is nasalized before a nasal consonant, versus when it’s produced as a sequence of an unnasalized vowel before a nasal consonant. American English speakers, who have no contrast in nasalization versus lack of nasalization before a vowel, are unlikely to be able to perceive it, and I don’t know at what age a French kid would be likely to acquire it.

I have no clue how the current situation will or should be resolved. Obviously, if your town is being destroyed by the Syrian government, or ISIS, or whatever other assholes are causing death and misery in the Middle East these days, it makes sense that you would take your family and go elsewhere, and it’s simple human decency to shelter people in that situation. However, the situation is not clear in other ways–even the fact that the Wikipedia article on the subject is titled European migrant crisis and not European refugee crisis is a loaded choice, and one that has implications about how the people who are affected should be treated. The situation continues to evolve, with European and world sympathies tilting now one way and now the other–in favor of sheltering the affected people after a tragedy like the widely-publicized drowning of a Syrian toddler, and in opposition to it after the despicable assaults on women by crowds of migrant men last New Year’s Eve in Germany. Certainly the situation will have long-range effects on Europe. I began this post by talking about one of the ways in which the world sucks right now–the existence of this crisis. One of the ways in which the world doesn’t suck right now is that many people in many countries have been very active in welcoming refugees, providing real support services for them, and generally acting like decent human beings. This will get worked out.

It’s raining, it’s pouring, the old man is snoring: how to talk about rain in English and French

How to talk about rain in English and French.

It’s raining, it’s pouring, the old man is snoring,

He went to bed and he bumped his head and he didn’t get up ’til the morning.

–Children’s song

Adam Gopnik once described Paris as “a scowling gray universe, relieved by pastry.” The “gray” part comes from the observation that it’s very often cloudy here. Actually, one of the things that I love about Paris is that it rains here. In the US, I live in a very sunny, dry part of the country–300 days of sunshine a year. However, I grew up in a very, very wet part of the country, and I miss that. So, coming to Paris in March and seeing flowers bursting from wet earth on my walk to work through the forest is a real treat.

Being from a very wet place, I have a large vocabulary for talking about rain in English. Here are some examples of relevant verbs. These are all impersonal verbs, using what linguists call a pleonastic pronoun, i.e. it’s:

to rain: the default verb.
to pour: to rain hard–see the children’s song above.
to rain cats and dogs: to rain hard.
to rain/pour buckets: to rain hard.
to mist: to rain very lightly.
to drizzle: to rain, especially if it’s cold. (I’ve seen a couple definitions of this as “to rain lightly.”)
to sprinkle: to rain, especially for a short period of time.
to storm: to rain very hard, often with thunder and lightning.

Usage examples:

If it rains tomorrow, the game will be cancelled. (http://forum.wordreference.com/threads/if-it-rains-its-raining.1045608/)
It’s pouring outside. (https://www.youtube.com/watch?v=7Gsdme47oUw)
It’s raining cats and dogs. (http://english.stackexchange.com/questions/14273/the-etymology-of-the-phrase-its-raining-cats-and-dogs)
After lunching, Caesar went out to post some cards, and as it was raining buckets, he took refuge in the arcades of the Piazza Esedra — Caesar or Nothing by Pio Baroja
It misted in the morning, but by noon it was clear. (https://www.google.fr/search?q=%22it+misted%22&ie=utf-8&oe=utf-8&gws_rd=cr&ei=xC8fV_nGMMLYgAaktIS4Bg)
Five activities for Paris when it drizzles (Note: it is NOT Paris that is drizzling!) (http://www.eurocheapo.com/blog/paris-in-the-rain-5-activities-for-paris-when-it-drizzles.html)
My mom told me it’s sprinkling, not raining. (Two adorable kids argue about whether it’s raining or sprinkling: https://www.youtube.com/watch?v=3sKdDyyanGk)
It stormed for hours, and when the thunder and lightning finally quit, it just kept on pouring. I got scared. (Click here–very long URL.)

pleuvoir: to rain. Il pleut: it’s raining. (I always seem to confuse this with il pleure, “he’s crying.”
Il pleut à verse: it’s pouring. (Native speakers: can we do the liaison here?, i.e. il pleu tà verse?)
Il pleut des cordes: it’s raining cats and dogs, it’s pouring rain.
Il tombe des cordes: same thing.
Il bruine: it’s misting.
Il crachine: it’s sprinkling.
y avoir de l’orage: to storm.
faire de l’orage: to storm.

I’ve focussed entirely on verbs here. For lots of nouns and adjectives related to rain in English, see this great post from the EngVid.com web site.

	zipfslaw1 on Estimate your vocabulary …
	Anonymous on Estimate your vocabulary …
	zipfslaw1 on Lawless French: an interview w…
	Anonymous on Lawless French: an interview w…
	zipfslaw1 on Estimate your vocabulary …