Parallel corpora, collocations, and crazy people on the Métro

In which an encounter with a crazy guy on the subway leads to a statistical analysis of French adverbs.

One evening I was riding the metro home when a guy got into the car with some used books to sell.  A man sitting across the aisle from me asked to see them.  He flipped through one of them, then took a pen out of his jacket pocket and began circling words–in this book that the other guy was trying to sell.  Are you going to buy that?, the would-be bookseller asked the guy with the pen.  They exchanged words–the bookseller was not happy about having his books marked up.  The bookseller said something that Mr. Pen apparently thought was obvious or stupid.  Il est fort, lui, he snorted–he’s a sharp one. 

The central meaning of fort/forte is “strong,” but it can also be used adverbially.  You hear it a lot that way, and I’ve been trying to figure out exactly when you can use it in that way–it’s often the case that there are word combinations that are possible in a language, but that don’t sound right.  Rather, there are particular words that are conventionally used in very specific combinations.  Violeta Seretan of the University of Geneva gives some examples of English words that are used to describe the magnitude of various nouns.  The semantics of each of these is the same, but the words that are typically used are quite different.  We talk about big problems, heavy rain…  How about injury?  (Answer below.)  It would certainly be possible to say large problem, but it’s nowhere near as likely, and it sounds odd, as a native speaker.  For example, you could say large problem, but it seems odd.  I wanted to be able to demonstrate that this corresponds to some actual statistical tendency, not just my intuitions, so  I searched the enTenTen corpus, a collection of almost 20 billion words of written English, looking for big problem and large problem.  Here are the frequencies:

  • big problem: occurs 6 times per million words.
  • large problem: occurs 0.5 times per million words.

Big problem occurs twelve times more often than large problem–the latter is possible, but it’s not really what you would expect to hear from a native speaker.  We call these things like big problem “collocations”–combinations of words that occur statistically more often than you would expect by chance.

You can find collocation dictionaries for English, and they’re quite useful for second-language learners.  I don’t know of any for French, though, or at least not where to find them in the US, which is where I am at the moment.  (I’ve seen similar things in Canada.)  I additionally want to know how these adverbial uses of fort should be translated into English, so I need a way to figure this kind of thing out for myself.

First step: find a whole lot of French text in some easily searchable form.  I started with the French section of EUROPARL–a collection of documents from the European Parliament, translated to/from a wide variety of languages.  The French section of EUROPARL contains about 59 million words–so, a whole lot–and you can access it through the Sketch Engine web site–so, easily searchable.  A quick search showed me that fort is quite common in that data set:

Screenshot 2016-04-10 13.23.54
Fort shows up 17,130 times in French section of the EUROPARL corpus–257 times per million words.  That’s pretty frequent.

Once I know that, I know that there will be enough data to calculate the collocations–recall that this is a statistical thing, so you need plenty of data.  The Sketch Engine interface gives me a number of options for how to do the calculations (scroll down to get past the screen shot):

Screenshot 2016-04-10 13.26.44

…which I show you just so that you’ll see that there are a lot of approaches to doing this. I just went with the defaults.

The calculations yielded quite a few possibilities.  Here are some of them:

Screenshot 2016-04-10 13.30.59

If you’re a stickler for data, you might have noticed that the collocations are ordered by the log of the Dice coefficient, which you could think of as a measure of the statistical effect, I guess.  I am really looking for the most common collocations involving fort, though, so I’ll reorder by the cooccurrence count, i.e. the raw count of how often the collocations occurred:

Screenshot 2016-04-10 13.53.36

Crap–that basically tells me nothing.  Why not?  Zipf’s Law.  Remember that Zipf’s Law tells us not only that most words are pretty rare, but also that some words are really, really common, and in French, that certainly includes de (“of”), et (“and”), une (“a”), and the rest of what we’re seeing here.  (Moral of the story: don’t expect the most frequent things in a language to necessarily be the most revealing things in a language.)  If I scroll down a bit, though, I see bien on the list.  683 examples of this–a frequency of 10.25 per million words.  Bien is often an adjective, which would presumably make fort adverbial in these cases, so we’re on to something now.  Let’s check out some of those examples:

Screenshot 2016-04-10 13.58.14.png

So, now I have some cases where it would make sense to use fort, but I want to know how they would correspond to English, too.  This requires that I have access to the corresponding English text.  No problem–recall that the EUROPARL corpus is multilingual.  In particular, it is what is known as a parallel corpus, which means that it contains the same contents in multiple languages, not just similar contents (although that kind of corpus can be useful, too).  I searched for the phrase fort bien.  Here’s an example of the output:

Screenshot 2016-04-10 14.12.24

So, now I have some French/English equivalents for fort bien:

  • Étant donné les prévisions de la politique structurelle ­ que je connais fort bien With these forecasts of the structural policy – which I know very well
  • ce que Jean-Pierre Chevènement a fort bien nommé récemment… referred to recently, and very aptly, by Jean-Pierre Chevènement
  • C’est pourquoi, comme l’a déjà fort bien expliqué M. Kalas  Hence, as Mr Karas has stated to his credit
  • je comprends fort bien la préoccupation  … I have a great deal of sympathy for the unease
  • Vous savez fort bien que…  You know very well that
  • non seulement parce que le président le connaît fort bien…  …not only because the President is very familiar with it…
  • Il est fort bien d’ organiser des réunions, mais ce sont les résultats qui comptent.  Meetings are all very well, but it is the result that counts.
  • ils se tirent fort bien d’affaire.  …they are managing really rather well.
  • et je les comprends fort bien.   …which I fully understand.
  • Ils les connaissent fort bien et un par un.  They recognise each and every one of them very well.

I’m feeling good about how to use fort bien now, but I want to know about other ways that fort could be used with an adjective.  So, I’ll do another search of the parallel corpus (i.e. the matched French and English texts), but this time I’ll just search for fort, and I’ll specify that I want it to be an adverb.  Here are some of the results:

Screenshot 2016-04-10 13.39.56

Now I have some general examples of how to use fort:

  • Nous estimons fort positif que  We see it as a very positive sign that
  • Le rapporteur constate également fort justement que The rapporteur has also quite rightly stated that
  • Ce que nous faisons maintenant est probablement fort important…  What is being done may well be very important
  • …l’ Union européenne a fort justement octroyé  …the European Union was right to support…
  • nous entretenons des relations bilatérales fort satisfaisantes avec  …We have very satisfactory bilateral relations with

I don’t know every adjective with which it would be OK to use fort, but I know one more than I did when I got out of bed this morning, and I’m cool with that–one less time when I’ll have to use très, which is all that they teach us in school.

A colleague had some observations on this:

On top of being used in collocations, it also marks a style / genre which is somewhat formal or elevated (“soutenu”). This might explain why it remains frequent mostly in collocations and is less frequent (or more marked) in freer combinations. This gives the expression a literary turn or a pretense to a higher register.  Both in speech and in writing, it is “soutenu.”

Another native speaker had this to say about it:

“Fort” is used as a synonym of “très”, before adjectives or adverbs . You can use it in about any case, it’s just more elegant than “très”, but not really literary .

The Mr. Pen guy on the subway turned out to be pretty crazy, as far as I could tell.  At one point he snapped at my adorable cousin, who happened to be visiting, and I told him to cut it out.  This was followed by an initially amusing conversation between him and me that at some point degenerated into a loud tirade on his part.  I kept telling him that my French wasn’t that good and I couldn’t understand him, but he just kept going and going.  Eventually French people around us began telling him to stop being an asshole and words to that effect, so I assume that it wasn’t very nice, but honestly, I couldn’t tell you.  At some point a large and very drunk French guy got on the subway car, and started seriously getting in Mr. Pen’s face–it was clear that this was going to turn violent.  Mr. Pen was a very diminutive Haitian man, and I wasn’t going to watch him get the shit beaten out of himself no matter how bizarre he was being, so I got involved.  The train stopped, Mr. Pen jumped out, and Mr. Drunk Guy launched into an animated discussion with me about American heavy metal, punctuated by snatches of Metallica songs.  All in all, an unusual evening on the metro, but not an unpleasant one by any means–just part of life in The Big City, as we say in English.

Oh: it’s serious injury.

 

 

The Great Sardine Can Mystery

My search for a healthier breakfast leads to a three-day investigation of a one-syllable word.

2016-03-16 17.33.33
Picture source: me.

I’ve been struggling to get up the hill on the way to work lately.  I decided that this was due to my proteinless French breakfast of coffee, bread, and Nutella, and stopped at the little store just outside the train station and picked up a can of sardines for after the climb.  Zipf’s Law being what it is, this set off a three-day struggle to figure out how to read the label on the can.  I spend a lot of time in France trying to differentiate and remember the meanings of words that look alike, and this was one of those occasions.  After three days of this, I still don’t have it all figured out.  At its base, this is an issue of various and sundry words that look or sound like forms of the word arrêter.  Read on if you want to feel my pain.

2016-03-16 17.33.08
“Nothing stops you.”  Picture source: me–it’s a water bottle that I got from the cafet’.

arrêter: The basic meaning of this verb is “to stop,” which is simple enough, but there are some subtleties involving the pronominal form of the verb (s’arrêter) and “direct” versus preposition-specific forms of the verb.

arrêter: The verb can also mean “to arrest,” as in taking someone into police custody.  Scroll down–this picture is too good to shrink.

Screenshot 2016-03-27 03.15.40

Screenshot 2016-03-27 03.17.13
“VIDEO. Ukraine: Chewbacca arrested after campaigning for Dark Vador.” Picture source: http://www.lexpress.fr/insolite/video-ukraine-chewbacca-arrete-apres-avoir-fait-campagne-pour-dark-vador_1729418.html.

arrêter de: this is followed by an infinitive, and would translate as “to stop verbing.”

Screenshot 2016-03-25 00.52.09
“The ten rules for stopping smoking.” Picture source: screen shot of http://www.stop-tabac.ch/fr/10-regles-pour-arreter-1.
Screenshot 2016-03-25 00.54.53
“I’m going to stop judging myself.” Picture source: screen shot of https://www.facebook.com/pages/Jarr%C3%AAte-de-me-juger/634289229988133.

l’arrêt: a stop, as in a bus stop, a work stoppage, etc.  Also: a decision, as of a court.

Medical-care-specific: The verb arrêter can have a very specific meaning, which is to put someone on sick-leave.  The subject of the verb presumably has to be someone who is capable of putting you on sick-leave.   (Linguistics geekery: this kind of behavior, where the meaning of a verb can change substantially depending on the subject and/or object of the verb, is probably best accounted for by the Generative Lexicon theory, pioneered by James Pustevosky of Brandeis, and more recently elaborated by Elisabetta Jezek of the University of Pavia.)

Screenshot 2016-03-25 00.59.40
Mon médicin m’a arrêté: “my doctor put me on sick leave.” Picture source: screen shot of http://www.viedemerde.fr/travail/7503604, the #shitlife web site.  (Sorry–that’s what vie de merde means.)

However, just because “doctor” is the subject doesn’t mean that the verb necessarily has that meaning:

Screenshot 2016-03-25 01.02.41
“It’s because my doctor took me off of The Pill.” Picture source: screen shot of a page from the book Lettre overte d’un médecin à une société malade,” by Alain Bellaiche, taken from Google Books.

Here, the m’ is an indirect object pronoun, and it’s la pilule (“The Pill”) that is the direct object.  (That bit of extra information probably doesn’t help much–sorry!)

l’arrêté (n.m.): this is a noun meaning “order” or “decree.”  It shows up quite a bit in official communications of various and sundry sorts–see below.

Screenshot from 2016-03-25 11:34:02
“The decree concerning fighting termites.” Picture source: screen shot from http://www.ville-guyancourt.fr/Cadre-de-vie/Urbanisme/L-arrete-de-lutte-contre-les-termites.

l’arête (n.f.): another noun, meaning (among other things) “fishbone.”  This is the one that finally drove me over the edge to look up all of these various and sundry meanings.  I would’ve gotten this one a lot quicker, but it took me, like, three days to realize that there was only one R.

2016-03-16 17.33.33
“Sardines in extra-virgin olive oil–boneless.” Picture source: me.
    • un arrêt: a judgement or decision, in a legal context.  Un arrêt de la cour: a court ruling.
    • un chien d’arrêt: a pointing breed of dog.

être à l’arrêt:

      to be on point (of a dog).

Le chien est à l’arrêt:

      the dog is on point.  (Thanks to native speaker

phildange

    for these.)

Even after this in-depth investigation, I don’t understand all of the various and sundry permutations of these words with their as, their rs, their ês, and their ts.  Here’s an example that I came across–I have no clue whatsoever what it means.  (Various native speakers have suggested that it’s an error–see the Comments section.)  PS: in the end, sardines aren’t that great of a solution to the whole I-need-more-protein problem–they smell, and I do have office mates.  Time to ramp up my already-enormous cheese consumption yet again, perhaps…

Screenshot 2016-03-25 00.47.33
Picture source: screen shot of the lemonde.fr web site.

 

I guess it’s not so secret any more: why linguists study children’s language games

Why would a linguist study children’s language games? It turns out that they can tell us a lot.

Screenshot 2016-04-08 13.19.56
Picture source: screen shot from the LINGUIST mailing list.

I subscribe to a mailing list that gets me news about current events in linguistics–upcoming conferences, tables of contents of new journal issues, fellowship opportunities–and notices about newly published books.  Often I look at some of this stuff and wonder: what the hell must non-linguists think when they see something like this?  Today’s email brought me an excellent example of the phenomenon: the book notice that you see above.  How could a Berber equivalent of the Pig Latin that most of us learned as kids (unless you’re French, in which case maybe you learnt Louchebem) possibly be worth a book-length treatment?  Actually, secret languages, also known as language games (described as game-like variants on some actual language, typically used by kids to mystify the uninitiated), can be quite interesting, from a linguistic point of view.  For example: in teaching introductory linguistics, many of my fellow grad students would use an example from Pig Latin to illustrate a non-intuitive fact about the English sound system.  In English, the sound that we spell ch is actually a combination of two sounds–t, as tip, and sh, as in ship.  Say t-ship with the t and the sh immediately next to each other–tship–and you’ll find that it comes out as chip.  If you survey a large class of Ohio State undergraduates, you’ll find some for whom the Pig Latin word for chip is ipchay.  For others, it’s shiptay.  What does that tell you?  For the Buckeyes (Ohio natives) with the shiptay form of chip, it’s pretty clear that ch is, on some level, represented mentally as the sequence of two sounds that it actually is.

Googling around a bit for examples of the use of secret languages in linguistic research, I came across this paper from Ruth Day at the famous Haskins Labs.  Day developed a simple secret language (take English words and substitute an r for every l and an l for every r) and taught it to subjects.  She also put them through what are known in the psychology literature as dichotic fusion tests.

normal bimodal distribution
Normal distribution (upper left): results cluster around a single typical value, plus or minus a bit.  Bimodal distribution (lower right): results fall into one of two groups, plus or minus a bit.  Picture source: click here.

Dichotic fusion tests assess how people process sounds.  They have an unusual property.  Most tests of sound processing have what’s called a normal distribution.  This means that there’s some typical result, and the results mostly cluster around that value.  In contrast, dichotic fusion tests are bimodally distributed–rather than everyone clustering around some typical value, people fall into one of two categories.  Day found that some of the subjects were good at learning the secret language, and some of them weren’t.  She also found a relationship between how people behave on dichotic fusion tests and how adept they are at learning the secret language: people who were good at learning the secret language mostly fell into one group on the dichotic fusion test, and people who were bad at learning the secret language mostly fell into the other group on the dichotic fusion test.  She speculates that this might be related to individual differences in how “bound” different speakers are by the nature of what the pioneering structural linguist Ferdinand de Saussure called langue and parole–two very different ways of categorizing language.  It’s not immediately obvious that these two different categories exist, and you could interpret Day’s experimental findings as being consistent with the hypothesis that they do.  (And, yes–linguistics as we know it today was invented by a French-speaking Swiss guy.  Even the English-language technical vocabulary of linguistics has kept Saussure’s original French terms, langue and parole.)

Language games are sometimes presented as a form of evidence regarding speakers’ models of syllable structure.  You didn’t know that you have a model of syllable structure?  That’s the nature of knowledge about language–it’s mostly not conscious, and, as we say, “not accessible to introspection”–meaning, even if you think about the rules of language and try to figure them out, you mostly can’t.  (If you’re an English speaker: can you explain when to use the and when to use a?  Probably not, but you certainly know, on some unconscious level, how to do so, and you certainly recognize when someone who doesn’t natively speak a language that has an equivalent of the and a messes them up in English.)  The ship-tay speakers were surprised to have this pointed out.  It wasn’t something that they were consciously aware of, but on some level, they seemed to “think” of ch as a sequence of t and sh. 

The French connection: there’s a form of slang in France called Verlan.  It’s not clear whether Verlan should be considered a secret language/language game as such, versus a form of slang, but even if it should be considered a slang, it is clear that its words are formed by a language game.  Phildange explained a bit about how it works in his comments on a recent post.  From a cross-linguistic perspective, it’s quite unusual.  If you observe secret languages from around the world, they tend to work on the basis of one or two of four different kinds of phonological processes (phenomena involving doing something to sounds):

  • insertion of sounds
  • rearrangement of sounds
  • substitution of sounds
  • deletion of sounds

From the perspective of this kind of classification, France’s Verlan is unusual in that it combines a multitude of different kinds of phonological processes.  For lots of details, see this set of lecture notes from Stuart Davis of Indiana University.

I hope that no Berbers were planning on using the waw/ra? secret language to pass messages around linguists in the future, as I guess it’s not so secret any more.  Are you thinking that the book about it would make a great Christmas present for someone?  You can pick up a copy here.

Ground game: broken arms and politics

In the US, politics and judo have some things in common. Here’s some English vocabulary for talking about them.

ronda rousey mesha tate
Ronda Rousey has one of the best ground games in the world. Here she arm-bars Mesha Tate. Go to Google Images to find pictures of what Tate’s arm looked like afterwards. Picture source: http://www.mmamania.com/2012/5/4/2998793/miesha-tate-arm-injury-update-ronda-rousey-strikeforce-ufc-video.
France is the #2 judo country in the world, after Japan.  The population of France is about 66 million people, and about 550,000 of them do judo.  (For comparison: the population of the US is bout 330 million people, and about 20,000 of them do judo.)  The first person I met in France was a diminutive, beautiful woman in her 50s or so who I ran into at a judo practice.  She’s nowhere near my size, but can arm-bar me every 7 minutes or so, on average.  She’s a great example of French judo: she beats me (over and over) not with strength, but with a subtle, contemplative approach to the sport that relies on imagination and on a deep understanding of how to move in three dimensions and apply basic principles of leverage and physics efficiently–and gently.  (Sorta like the famous French diplomacy, I guess.)  In judo, we would say that she has a great ground game—the ability to fight on the mat, off your feet, where we use not the throws of standing judo, but arm-bars, chokes, and pins.

The phrase ground game has been in the news quite a bit lately.  We often hear about what a great ground game Bernie Sanders has, or about how Trump keeps winning state primaries despite not have a good ground game.  In the context of politics, your ground game is how good your campaign is at the very local tasks that require actual personal involvement–particularly, getting your supporters to the polls.  A good ground game requires two things.

  1. You have to know who your supporters are.
  2. You have to have engaged, committed volunteers everywhere.

Regarding the first: today, this is mostly a matter of data science.  Sasha Issenberg’s book The victory lab does a very good job of telling the story of the development of today’s personalized, data-driven politics.  Once, politicians and political parties put a lot of effort into trying to convince people to get behind their ideas.  Today, it’s generally thought that trying to change people’s minds is expensive and inefficient; on the other hand, getting the people who already support you to actually go to their polling place and vote is relatively inexpensive, and it’s quite effective.  In 2008, the Obama campaign was able to develop pretty good guesses about who was going to vote for their candidate (how they did it is really interesting, but somewhat sobering—see the above-mentioned book), and they focussed their get-out-the-vote effort on those people.

Regarding the second: this is the essence of the ground game.  Cruz’s win in the Iowa primaries this nominating cycle was widely attributed to his strong ground game.  One of the many, many mysteries of the Republican race for the nomination has been that Trump has done quite well despite not having much of a ground game anywhere.

 

Gender and (you got no) class

swahili-noun-classes
Swahili noun class markers. Picture source: http://www.kiswahili-mangat.com/.

Many languages have a phenomenon such that nouns belong to groups that affect things about the words with which they occur.  French is such a language.  You can more or less put French nouns into two groups, as follows:

  • For one group, the singular definite article (“the”) is le, the singular indefinite article (“a”) is un, the adjective “big” is grand, and the adjective “boring” is ennuyeux.
  • For the other group, the singular definite article (“the”) is la, the singular indefinite article (“a”) is une, the adjective “big” is grande, and the adjective “boring” is ennuyeuse.

When a language has two or three of these classes, the language is typically said to have a gender system.  So, French has two of these classes, and we call the nouns in these classes masculine and feminine nouns.  German has three of these classes, and we call them masculine, feminine, and neuter nouns.  Lithuanian Yiddish has three of these classes, but most other dialects of Yiddish have two.  English has basically no such classes–we have words that are sort of intrinsically masculine, like father, and words that are sort of intrinsically feminine, like mother, but since they don’t affect the forms of the words with which they appear (you say the mother and the father, with no differences in the word the), linguists wouldn’t call it a gender system.  On the other hand, Old English (spoken from around 450 to around 1400) had three noun classes.  (Look at the different forms of the word the in these three Old English nouns, taken from Wikipedia: sēo sunne (“the sun”), se mōna (“the moon”), þæt wīf (“the woman/wife”).)  A language on which I did research in graduate school only has two such classes, but referring to anything by the wrong class is a way to insult it.  It doesn’t matter which of the two classes it belongs to–if you use the wrong modifiers, it’s an insult.  I was terrified to ever open my mouth, and don’t speak it at all.  (My son often played in the corner of the office while I collected data.  It’s quite amazing to hear dô páráná come–correctly–out of the mouth of that blond-haired, blue-eyed, video game addict today.)

There’s nothing magic about the numbers two and three–languages can have more or less arbitrary numbers of these classes.  We tend to refer to them as genders when there are just two or three, and to refer to them as noun classes when there are more than that, but there is no difference between what we call the gender system in French, with two noun classes, and what we call the noun class system in Shona, which has twenty noun classes.  It’s a difference of numbers, not of kind–in both cases, you have this more-or-less arbitrary slicing up of the nominal lexicon (noun vocabulary) of the language into groups of nouns that affect the forms of articles, adjectives, etc. in various and sundry ways.

I say “various and sundry” because gender/noun class systems can work out in lots of different ways.  In Semitic languages, verbs agree with the gender of their subjects.  For example, he studied is lamad, while she studied is lamda.  In the first case, it’s the pattern of having the two a-a vowels that makes it the masculine form of the verb, and in the second case, it’s the a in the middle, the md coming together (versus mad in the masculine form), and the a at the end that make it feminine.  Different verbs, tenses, and numbers (that is, singular versus plural) have different forms, so don’t get excited about the fact that there’s an a at the end of the third person singular past tense feminine form of the verb–it’s not that way all the time.  For example, he goes is holekh, while she goes is holekhet. 

Does having classes of nouns in your language–or not having them–make a culture more or less sexist?  I only have anecdotes here, and–counter to what you might hear–anecdote is not the singular of data.  For what it’s worth: my undergraduate advisor always used to point out that Hebrew is about as gendered of a language as you can get (see above–even verbs have to have gender in Hebrew), and probably close to everyone in Israel speaks either Hebrew or Arabic (which has the identical system), but Israel was the fourth country in the world to elect a woman as the head of state.  In contrast, Finnish has no gender whatsoever, but has never had a female head of state, as far as I know.  (This is not to imply anything bad about Finland–there are a bazillion countries with genderless languages that have never had a female head of state.  I don’t know why my professor picked on the Finns.)

English note concerning the title of this post: using the word got (or gots) as the present tense of the verb to have is a social marker of class–that’s “class” in the sense of couche sociale.  Lower class, specifically.  Other speakers might use it for humorous effect.  “To have class” means something like to have elegance of style or manners.  So, if you say you got no class, man, part of the flavor of the expression comes from the fact that you’re using a “low (social) class” verb form to talk about “class” in the sense of elegance.

 

Academia, industry, and graduate degrees: the speech/language technology version

People often ask what the job situation in language processing and speech technology is like for people who don’t have PhDs. Here’s my perspective.

No French language stuff in this post–it’s a response to a question that I get fairly often from people who are thinking about leaving graduate programs in natural language processing or speech technology.

I was recently asked: What’s it like to work in the speech and/or language technology industries with a master’s degree?  To understand the question, you have to realize that the alternative would be to work in the speech and/or language technology industry with a PhD.  People in these fields will typically have one graduate degree or another (or both)–the question relates to differences in how you will experience the industry world depending on which of the two you have.  To understand the context of my answer, it’s probably helpful to know that I got a master’s degree, went into industry, went back into academia with a master’s degree and became a researcher, and then got a PhD.  So, I have a little bit of familiarity with academia and with industry, as well as with the experience of working when you have a PhD versus working when you don’t.

The short answer: there isn’t necessarily any difference between what it’s like to work in industry with a master’s degree and what it’s like to work in industry with a PhD.  If you can take a problem and come up with an answer by yourself, and the answer works; if you can identify useful new problems; and if you can propose, implement, and design the evaluation for some project yourself, then you’re going to be treated like a PhD, and if you negotiate reasonably well, you’re going to get paid like one.  Now, the same is actually true in academia, the primary difference being that with a master’s degree, you’re very likely to be working in someone else’s lab and writing grants for someone else, versus having your own lab and writing grants for yourself. (I did this for quite a while, and I loved it.)  This is also true of working in a private think tank like MITRE or BBN, except that you can write your own grants in places like that, at least for internal funding–external agencies like NSF and NIH are not very likely to fund you if you don’t have a PhD.

A longer answer: in industry, having a master’s degree versus having a PhD is likely to affect the position into which you are first hired, but it won’t necessarily have an impact on what your job is like.  Companies that do speech and/or language processing work are accustomed to using the level of your graduate degree as an indicator of how likely you are to be able to do independent work when they’re considering hiring you, but once you’ve talked your way into a job, they generally care much more about the results that you do (or don’t) produce than they do about your academic pedigree.  A friend whose husband tried to get a PhD in computer science, failed his comps, and then went into the robotics industry–a good example of something that’s very oriented towards writing code and towards engineering, but also very much an area where there is a need to be able to do real research in addition to development, described her ex’s experience like this.  He is a guy who did quite well in industry, eventually starting his own company, which he sold for “major bank,” as the kids say (the expression means “a lot of money”).

His fear was that he’d never be able to be the principal investigator on a grant (still the case , I believe)  and that he wouldn’t be taken seriously.   I think that was the case initially, too…certainly in terms of starting position and compensation. He had to prove himself once in the door and build credibility that was conveyed more automatically with a PhD (though I’m sure a PhD could lose that credibility if they didn’t prove worthy ).   At this point,  I don’t think he’s in any different place than he would have been with a PhD. He just had to have more to offer to get those initial doors open than he would have with a PhD.

A master’s degree won’t keep you from rising through the ranks of the industry world, and a PhD won’t keep you from getting fired–I’ve certainly seen both of these happen.

You’ll find that what you learnt in graduate school can be super-helpful in industry.  The opposite is also true–the time that I spent in industry was the best thing that ever happened to my academic career.  Another opposite is true, too–the things that you learn in grad school can hurt you in industry.  I’ll give you some examples of both.

Things you learn in graduate school that can be helpful in industry: the ability to define a problem, state it clearly, figure out how to evaluate it, and communicate what you’ve done are super-useful in industry (and in life in general!).  In fact, I suspect that a lot of my success in industry (to the extent that I had it–objectively, I have been offered a promotion in every industry job I’ve ever had, which I guess is a sort of success) was related to the fact that if I had an idea, I could communicate it more clearly than pretty much everyone else–not because I’m smarter than anyone else, but because I have a degree in English (among other things), and in the process of getting that degree, I learnt to write (reasonably) clearly, and quickly.  All other things being equal, a well-crafted email will generally trump a crappily written email.

Things you learn in graduate school that will hurt you in industry: this may be very culture-specific, but the picture that I got as a student in my linguistics graduate program was that you need to stake out a position and then defend it to the death, and you don’t get to change that position very often.  Again, this may be specific to the culture of linguistics, or even just specific to the culture of linguistics at the time when I was in graduate school–certainly the community around Chomsky and the subfield of linguistics in which he specialized (syntax) is pretty notorious for brutal fights around theoretical issues.  Be clear that this is an attitude that will hurt you in industry.  Theory can be important in industry–often a company has a stake in some particular kind of approach to a problem, at least in very broad terms–but, theoretical purity per se is not typically valued in industry.  In fact, probably the opposite is true–the path to success in industry lies more in humility about your ideas and a willingness to seriously consider the other person’s take on things than it does in defending whatever your take on a question happens to be.  Industries that don’t do this get overthrown, companies that don’t do this fail, and engineers that don’t do this get fired.

From the perspective of quite a few years of doing natural language processing and computational linguistics in an environment that has a hell of a lot more physicians and biologists in it than it has computational people, I’m starting to wonder if this isn’t just a matter of cultural mores, but of differences in philosophy of science.  The classic linguist’s philosophy of science is that of Thomas Kuhn, where the conception (if I understand it correctly, which is not a given) is that science advances when the old ideas collapse under the weight of their clearly stupid inadequacies, and the new ideas succeed by being brilliant, and right, and new–even sui generis.  In contrast, you could say that the industrial world is underlyingly more influenced by the philosophy of science of Karl Popper, where the idea (again, if I understand it correctly, and again, that’s not a given) is that science proceeds only by falsifiability.  On this model, you should be happy if your hypothesis is not supported, because now you really know something, and you can move forward.  I’m not claiming that this happens universally in daily life in industry–you bet you will run into people who will get pissed if they’re questioned about the approach that they’re taking to something, or if the testers uncover a bug in their implementation, or whatever.  But, underlyingly, in industry you want someone to find your problems–before the customer does.  I may be overthinking the issue with respect to my philosophy of science explanation–Chris Brew, who knows both the academic and the industry computational linguistics and language processing worlds very well and has been a long-term mentor of mine, sees it like this:

I’m not totally clear on the “why” of the bad thing that linguistics graduate school teaches you. There is certainly a mismatch in culture, and humility and listening is rewarded in industrial settings. But I think the central factor is not philosophy of science but willingness to step into the other person’s shoes and awareness that their perfectly valid priorities may not be your priorities.  Laser-like focus on a personal agenda also sucks in academia, but is more likely to be tolerated if you are somehow brilliant and high-status.  “Not a team player” is the standard industry complaint about ex-academics.

Things you learn in industry that will help you in graduate school: see the entire preceding section.  As far as I can tell, picking a position and sticking to it come hell or high water won’t actually help you do better science in academia any more than it will help you build good software in industry.  A little humility about your ideas can go a really long way towards helping you understand whatever it is that your science is about understanding.  There are really practical things that industry will help you with once you get back to graduate school, too (or even if you don’t).  One of these is deadlines.  Industry and academia are both pretty deadline-driven, and that was never apparent to me in graduate school, or at least not until it was too late.  Some time in industry helped me understand the role of deadlines, and also helped me develop methods for making sure that I never missed them–methods that worked for me, at any rate.  Another practical thing is the importance of all of those things that they teach you in software engineering classes–documenting your code, testing your code, defining requirements at the beginning of a project (or having a solid plan for how you’ll do it iteratively throughout the project, if you do something like the Agile method of software development), testing your code, taking usability issues very seriously, and testing your code.  For me, thinking a lot about testing my code led to me thinking a lot about how similar the theoretical bases of software testing are to the theoretical bases of linguistics, and that ultimately led to a bunch of publications on the subject and to me doing my dissertation on approaching software testing as a problem in descriptive linguistics.

Another important lesson to take back to academia from your time in industry is the importance of edge cases and the phenomena in the “long tail.”  You probably know the joke about what happens if you ask a phonetician, a phonologist, and a syntactician if all odd numbers are prime–the phonologist says “one is an odd number and one is prime, 3 is an odd number and 3 is prime, 5 is an odd number and 5 is prime, 7 is an odd number and 7 is prime, 9 is an odd number and–9 is not prime, but if we say that it’s not prime, then we miss the generalization that a lot of odd numbers are prime, so let’s just assume that 9 is prime.”  In linguistics, we tend to like generalizations, and generalizations by their nature tend to cover most of the data points in question, not just a few of the data points in question, or they wouldn’t be generalizations.  The infrequent phenomena that don’t fit our analysis but that don’t seem to be very primary, we tend to leave to the side, at least in non-empirical approaches to linguistics.  This does not fly in the industry world, ever.  You have to take care of every case, and that includes the special cases, and if you need some special code for them, some special ad hoc solution for them, then that’s just the way it is.  I don’t care if your F-measure is 0.98 or your word error rate is not statistically signicantly different from zero–show your product to a potential customer, and the first thing they’re going to do is to ask it to run on their name, or the name of their company, or their birthday, and if that doesn’t work, then you will be shown the door.  I’ve seen this over and over–in industry, you have to account for everything.

It turns out that this is a good attitude to take back to academia with you.  I’m reminded of an anecdote involving the phonologist Michael Broe.  I once saw him give him a talk that was focussed on an analysis of “regularly irregular” forms in some language or another–that is, things that are irregular, but that are irregular in a way that is similar to the way that some other things are irregular.  Think about mouse/mice and louse/lice in English, or the very few French verbs in the ir class that take the same present indicative inflectional morphemes as er verbs, or what have you.  Michael was asked why he was bothering to work on these regularly irregular forms when they’re so uncommon in the language that he was interested in.  I’ve never forgotten his answer: It depends on what you think the goal of phonology is–is the goal of phonology to understand patterns in sound systems, or is the goal of phonology to understand frequent patterns in sound systems?  Many people in my field have some story like this: a physician or a biologist asks you to build a system to do something or other.  You build one, and you have an amazingly high F-measure, or an amazingly low word error rate, or whatever.  You proudly demo your system for the physician or the biologist.  On the entire computer screen, there are tons of correct outputs, and one fucking error.  The physician or biologist points at the error, shrugs, and says, OK–let me know when it works.  If you want your research to actually fix some problems in the world, and you want those solutions to be taken seriously by the people who might actually have a use for them, then you need to think about those edge cases, those exceptions, those rare events–even if taking care of them means sacrificing some purity of theory or some elegance of design.  Those edge cases, “exceptions,” and rare events are perhaps the things that uncover the flaw in your theory, and you should be very happy to come across them.

This little essay started with a very specific question: What’s it like to work in the speech and/or language technology industries with a master’s degree?  I want to generalize it a bit, and answer a more general question: What’s it like to work in industry with a master’s degree?  The short answer to this more general question is the same as the short answer to the more general question: which graduate degree you have doesn’t necessarily make any difference with respect to what it’s like to work in industry.  However, the long answer is somewhat different.  This is probably counter-intuitive to academics, but in industry in general, it can actually be easier to get a job with a master’s degree than with a PhD. Now, I’m talking here specifically about high-tech industries where you’re basically being hired to write computer programs of one sort or another, versus ones where you’re being hired to do things that are closer to the research and development end of the continuum of high-tech jobs.  In these environments, having a master’s degree or not isn’t necessarily considered important, one way or the other–they want to know if you have the technical skills that they need and whether or not they think you’ll be non-painful to work with, and that’s about it.  On the other hand, having a PhD is often not looked on kindly by private companies.  Your potential future co-workers may be–often are, in my experience–quite suspicious of people with doctorates, suspecting that they might be strong on theory but weak on implementation, and–worse–rigid defenders of whatever their position happens to be, unwilling to seriously consider alternate approaches.  I didn’t just make up the phenomena that I describe for several paragraphs above!  Now, this is not true of the speech and/or language technology industries, where there’s a long tradition of the interesting, innovative, and successful work coming from PhDs.  But, in the broader industrial world, the skepticism-about-PhDs phenomenon is widespread.

 

 

 

 

Getting shot in the leg with a .22 is better than being hit in the head with a 2×4

English has a number of words that are made of numbers. Here are some of them.

No French in this post–this is all about obscure English vocabulary that you can bet Zipf’s Law will bring into your life sooner or later.

I recently wrote a post about what we call in the US 3x5s (pronounced “three by fives”), and that got me thinking about words in English that are formed in similar ways.  There are a number of them, and if you can use them, it will definitely add an American flavor to your English.

  • 2×4 (pronounced “two by four”): a kind of wooden board that measures about 2 inches by four inches, and about six feet in length.  In America, 2x4s are commonly used in the construction of homes and the like.
  • 4×4 (pronounced “four by four”): a kind of truck or similar vehicle that can provide power to all four wheels simultaneously.  (More traditionally, cars would only power the front axle or the back axle.)
  • 8×10 (pronounced “eight by ten”): a particular size of photographic print, measuring eight inches by ten inches.
  • 24/7 (pronounced “twenty-four seven”): absolutely constantly.  24 hours in a day, and seven days in a week, so 24/7 is all the time.
  • 7-11 (pronounced “seven eleven”): originally the name of a convenience store that was once open from 7 AM to 11 PM.  Today you can use it to refer to pretty much any 24-hour convenience store, I think.
  • 69 (pronounced “sixty-nine”): a verb referring to a specific sexual act.  Consider the relationship between the numbers 6 and 9 and you can probably figure it out for yourself, which will save me from feeling like I have to put a trigger warning on a blog post about numbers.
  • soixante-neuf: same thing.  Yes, we can use it in English, and if you’re sleeping with people who are educated enough to know what it means, then you probably already know what it means yourself.
  • 10-4 (pronounced “ten four”): I heard you, I got your message; there’s also some implication that you agree.  Often heard in the contexts 10-4, good buddy, or that’s a big 10-4, or, if you wear a cap with the name of a feed company on the front, or just listened to a lot of AM radio in the 1970s, that’s a big 10-4, good buddy.  That’s how we said it when I was a little tyke, at any rate.  OK: a teenager.
  • .45 (pronounced “forty-five”): a kind of pistol, known for its “stopping power”–that is, if someone is charging you, a shot from one of these things will keep them from moving forward if it hits them.  The projectiles are short, but very big around, and heavy.  It’s not very accurate at a long distance, but at a short distance, it’s very effective at what it’s intended for.
  • .32 (pronounced “thirty-two”): another kind of pistol.  They’re also not very accurate, as they usually have a pretty short barrel, and they really don’t have any use other than killing people at close range, as far as I know.
  • .36 (pronounced “thirty-six”): another kind of pistol.  They sometimes have longer barrels, in which case you can use them to kill people at close range, and also a bit further away.
  • .25 (pronounced “twenty-five”): another kind of pistol.  I don’t recall ever actually seeing one.
  • .22 (pronounced “twenty-two”): another kind of pistol, and also a small-caliber rifle.  The bullet is quite small, and unless you get shot in some place really vital–head, heart, an artery–it may not do that much damage.  On the other hand, I did once see a young guy who shot himself in the head with one of these–he didn’t die immediately, but he sure as hell died eventually.  Again, there is not much that’s actually useful about these things…

You can use the number-by-number construction productively (in the linguistic sense of the word productive, which means that the construction can be used to produce new things) to talk about the sizes of pieces of wood in general.  However, if you’re not talking about 2x4s specifically, the context needs to be clear if you want to be understood.  The assumption is that the pieces of wood in question will be 6 feet long unless otherwise specified, so if you ask for a 3×6 (pronounced “three by six”) in a lumber yard, people will know what size board to give you.   For a version of this kind of construction that is also productive, although quite obscure, see this entry from the Urban Dictionary for a description of how it is used to refer to pairs of male characters in the Gundam Wing anime series.

Having gotten the basics out of the way, here’s a useful expression: to get/be hit with a 2×4.  You know what a 2×4 is by now–a solid wooden board.  If someone smacks you upside the head with it, you will have been smacked really, really hard.  To get/be hit with a 2×4 means to be stunned by something that you’ve learnt.

Here are some real-life examples.  This woman wrote on her blog about needing to be forced to face the facts that she was (a) eating too much, and (b) not exercising enough:

Screenshot 2016-03-29 08.41.57
Picture source: screen shot of http://www.sparkpeople.com/mypage_public_journal_individual.asp?blog_id=6049756.

This blog post suggests that if you “get hit with a 2×4,” the answer is to just surrender to whatever God’s wishes for you might happen to be:

Screenshot 2016-03-29 08.48.32
Picture source: screen shot of http://www.derekandsarahgill.com/blog/when-you-get-hit-with-a-2×4/7_8_13-what-to-do-when-you-get-hit-with-a-2×4/.

Here’s a story from the Washington Post (a reputable and very famous American newspaper known for its coverage of national politics) about Chris Christie, describing the experience of the explosion of the Bridgegate scandal during the period before his unsuccessful run for the Republican presidential nomination:

Screenshot 2016-03-29 08.51.53
Picture source: screen shot of https://www.washingtonpost.com/blogs/right-turn/wp/2014/01/20/who-got-hit-by-a-2×4/.

Here, a bigwig of the investment world talks about the effects of getting two pieces of bad news about the financial world, one right after the other:

Screenshot 2016-03-29 08.57.18
Picture source: screen shot of http://www.theglobeandmail.com/news/national/markets-staggered-again-by-us-economic-fears/article4143354/.

So, now you see why it would, in fact, probably be better to get shot in the leg with one of those tiny little .22s than to get hit in the head with a 2×4.  Native speakers of English (or non-native speakers who just like to collect funny words), do you have any other all-number words to add to the list?

Full-screen and coffee breaks: conferences in France

There is essentially nothing that I do in France that doesn’t involve an encounter with Zipf’s Law.  One thing that I find quite useful here in France is to go to talks, conferences, and what are called journées–literally “days,” but in practice, a day-long mini-conference on some subject or another.  It’s a good way to learn the technical vocabulary of my field in French, and also to have casual conversations with my peers about it.  The other day, I went to a journée on natural language processing (what I do for a living) and artificial intelligence.

As far as I can tell, French researchers (at least in my field) primarily publish in English.  My field is much more oriented around conference papers than around journal papers–our conferences are peer-reviewed and often quite competitive, while our journals are more oriented towards essentially archival coverage of long-term research projects.  So, the latest and greatest research shows up in conferences, not journals.  The conference papers are published, and they’re cited quite a bit more than journal articles.  Although my French colleagues do primarily publish in English, there are also French conferences and journals in my field.  The French conferences and journals ask for papers written in French from Francophones, but allow non-Francophone scientists to submit work in English.  Being able to read some French has opened up quite a bit of stuff to me that I wouldn’t otherwise have been able to read (and cite).  I especially enjoy some of the work in French on lexical semantics; it isn’t necessarily any different in terms of topics, approaches, or the flavor of the results, but some of it is written so much more clearly than similar stuff that I’ve read in English.

One thing that still surprises me about French conferences is that during the question-and-answer period after a talk, the speaker and members of the audience address each other as tu, using the informal pronoun.  You can read more about this  phenomenon of French conference participation here, along with some speculations about where it comes from.

For official purposes, the French system often differentiates between French conferences and what the paperwork refers to as “international” conferences, which in practice seems to mean any conference outside of France.  (That’s not obvious–for example, a conference in Germany, attended primarily by a local audience, apparently would count the same as a conference like the Association for Computational Linguistics annual meeting, which is attended by people from all over the world.  I suppose that evidence of an international reputation is, indeed, supplied by presentation of your work anywhere outside of your home country.)

Just following the schedule gave me trouble, which doesn’t exactly make me feel bright.  Here are some really basic words that I came across in the course of the day:

  • la pause-café: this is what WordReference.com gives as the translation of “coffee break.”
  • plein écran: full screen.

How to sound French: March 2016 edition

This winter, the expression on everyone’s lips seems to be pas mal.  We learn this in college as meaning not bad.  However, colloquially it can also mean something like a lot.  In fact, I’ve been hearing it quite a bit even from a friend who prefers the 17th-century language of Molière to the language of today and doesn’t speak very colloquially at all.  You can find good examples of how to use pas mal in this sense on Laura Lawless’s Lawless French web site.  Here are some more:

Screenshot 2016-03-26 01.47.21
“I saw Manon this morning and we talked a lot it’s cool” Picture source: Twitter screen shot.

If you use pas mal to modify nouns, it’s a quantity term, and is followed by de:

Screenshot 2016-03-26 01.50.15
“I think that we have a lot of things in common” Picture source: Twitter screen shot.

Be aware that even the expression as we learn it in school–that is, with the meaning not bad–can be difficult to understand, with the intended meaning being conveyed in part by intonation.  There’s a very nice video on the intonational subtleties of the expression here, on the Français avec Pierre YouTube channel.

Spleens, 3x5s, Molière, and French grad students

2016-03-24 07.16.50
3x5s or index cards. The name “3×5” comes from their size, which is 3 inches by 5 inches.

My morning routine includes studying French vocabulary, which means flash cards.  I make my flash cards from what we call in English index cards or 3x5s (pronounced “three by fives”–they take that name from the fact that they normally measure 3 inches by 5 inches).  Recently I’ve been amusing my younger coworkers by sharing my current vocabulary flash cards, and I have been impressed beyond belief by the breadth and depth of the English vocabulary that these kids have.  “Talon”?  No problem.  “Greenhouse”?  They’ve got it.  Yesterday I ran into the word rate, “spleen,” in the play Le malade imaginaire, a 17th-century French play by Molière.  One of them explained to me the various and sundry forms with which the English word “spleen” can be translated into French.  The word has at least five meanings in English.  The most common meaning is the internal organ that most vertebrates have, located on the left side in humans near the stomach and playing a role in a variety of processes, including ridding the body of old red blood cells and being involved in the immune response.  The other meaning, which is not nearly as common but is still found in the language, is given by Merriam and Webster as “feelings of anger or ill will often suppressed.”  These get split into two different words in French.

  • la rate: spleen (the internal organ).
  • le spleen: melancholy or ennui, and archaically, the same anger-related meaning as in English.
Illu_spleen
The spleen, or la rate. Picture source: Public Domain, https://commons.wikimedia.org/w/index.php?curid=1394146.

It was quite impressive to hear a computer scientist explain the 17th-century meaning of the word–that’s not something that I would expect an American computer science grad student to be able to do for the English equivalent.  I’ve been reading Molière, and apparently she has done so, too–again, I wouldn’t expect an American computer science grad student to be familiar with Shakespearean vocabulary.

It amazes me that there seems to be no French equivalent to the 3×5.  (I saw friends exchange the sidelong glances that I inspire so often here by accidentally saying inappropriate things when I referred to them by the Canadian term, fiches vierges–it can mean “blank cards,” but also “virgin cards.”)  I only survived my education by using these things to obsessively memorize pretty much every term, equation, and random fact that I was taught.  Considering the very demanding nature of the French educational system, I’m baffled by how French students manage to pass the  exams that are required to progress through the system without some equivalent of index cards.  I throw several packs of them into my luggage every time that I come to France, and can’t imagine learning as much as I do without them.

Ukrainian Humanitarian Resistance

Resisting the russist occupation while keeping our humanity

Languages. Motivation. Education. Travelling

"Je suis féru(e) de langues" is about language learning, study tips and travelling. Join my community!

Curative Power of Medical Data

JCDL 2020 Workshop on Biomedical Natural Language Processing

Crimescribe

Criminal Curiosities

BioNLP

Biomedical natural language processing

Mostly Mammoths

but other things that fascinate me, too

Zygoma

Adventures in natural history collections

Our French Oasis

FAMILY LIFE IN A FRENCH COUNTRY VILLAGE

ACL 2017

PC Chairs Blog

Abby Mullen

A site about history and life

EFL Notes

Random commentary on teaching English as a foreign language

Natural Language Processing

Université Paris-Centrale, Spring 2017

Speak Out in Spanish!

living and loving language

- MIKE STEEDEN -

THE DRIVELLINGS OF TWATTERSLEY FROMAGE