Doing computational lexical semantics with your web browser: An approach to using data to build semantic representations

Here’s how you can do computational lexical semantics in the comfort of your own home–and how to talk about it in French.

A lot of my work involves something called lexical semantics.  Lexical semantics is the study of how words mean things.  (That means that there’s some interaction with the question of how sentences mean things, since part of the meaning of a sentence comes from the words that it contains, but in lexical semantics, the focus is on the words and how they contribute to and interact with the semantics and the syntax (the phrasal relationships) of a sentence.)  In particular, I do something called computational lexical semantics.  That means that I use large bodies of data as a crucial part of my work, and I evaluate my work in part by trying to use it as the basis of computer problems.  If that doesn’t work, then I figure that what I’ve done needs to be improved.

My advisor is one of the world experts on computational lexical semantics.  (I won’t name her, since I try to keep this blog anonymous.)  As far as I know, she was the first person to demonstrate that large bodies of naturally-occurring data can, in fact, be used to test theories of lexical semantics.  This was important because semantic theories often haven’t really been tested in any way that would count as a “test” in science, and as we’ve seen in other posts, linguistics is the scientific study of language.  She often says that semantics is not a suitable subject of study for linguistics, since it’s so subjective.  I’m never sure whether or not she’s kidding; regardless of whether she is or not, one of my professional ambitions is to take the subjectivity out of computational lexical semantics.

Part of my approach to that has been to try to develop a systematic methodology for developing semantic representations of words.  In particular, I work with verbs, and with nouns that are derived from those verbs–for example, the verb phosphorylate (I specialize in biomedical language) and the related noun phosphorylation, or the verb receive and the related noun receptor.  (You’ll notice that there are different relationships between the verb and the noun in the two examples–phosphorylation is a noun that refers to the action of the verb, while receptor is a noun that refers to the thing that does the receiving.)

One of the bedrocks of my approach is that I try to base my representations of the meanings of words on data that I didn’t come up with myself.  (Note that I didn’t invent this idea, or any of the other aspects of the approach that I describe here—this is just my recipe for putting them all together.)  I mostly work with scientific journal articles.  There are two parts to what I do:

  1. Coming up with the representation of the meaning of the verb (or noun).
  2. Coming up with examples that let me test the representation, both by providing examples of the different effects that I think that the meaning of the verb has on how it behaves on sentences, and also doing a quick check to make sure that I don’t see any examples that argue against my representation of the meaning of the verb.

This is a pretty iterative, complementary process–I typically start out by looking at a bunch of examples of the verb to get a general sense of how it works, then write up a quick representation of the semantics, and then look for examples more systematically to see if my representation works.  Some of the goals that I keep in mind when I’m searching for these examples are:

  1. I want to know whether or not humans can be the agent of this verb.
  2. I want to be sure to get the full range of prepositional complements, as these can mark a variety of semantic relations.
  3. I want to get a variety of semantic classes as the subject and the object of the verb.
  4. If there is a deverbal nominalization, I want to get that, too.

Knowing whether or not humans can be the agent of the verb is important to me for a number of reasons.  People often question whether or not humans can perform the actions of particular verbs in the biomedical domain.  For example, Wikipedia describes the action of the verb phosphorylate as ” the addition of a phosphoryl group (PO32−) to a molecule. ”  That doesn’t sound like something that a human could do, right?  But, you can find sentences like this:

In order to determine the number of phosphorylated sites in human cardiac MyBP-C samples, we phosphorylated the recombinant MyBP-C fragment, C0-C2 (1-453) with PKA using (gamma32)P-ATP up to 3.5 mol Pi/mol C0-C2.

–Source: http://www.ncbi.nlm.nih.gov/pubmed/18573260

I’m trying to represent the semantics of the language, not the semantics of phosphorylation, so I need to take into account all of the data about the language, and that includes this kind of counter-intuitive use of the verb.  Why do we care about humans so much, though?  It’s because humans are the prototypical example of things that act with volition, or as a result of their will–what we call agents in the English-language terminology of linguistics–and agents get represented in a special way in lexical semantics.  So, we need to know if there can be a human agent for these biomedical verbs so that we can know if they can have agents at all, essentially.

Getting examples of the full range of prepositional complements (e.g. phosphorylate at, phosphorylate to) is important to me because different prepositions sometimes mark different aspects of the semantics of the verb.  For example, when we investigate phosphorylate at, we see that the semantics of phosphorylation involve a specific location on a molecule, and when we investigate phosphorylate to, we see that the semantics of phosphorylation involve something becoming something else–the to marks not a location at which the molecule ends up, but what the molecule becomes–like I had it converted to a round-trip ticket, in “normal” English.

Getting examples of a variety of semantic classes as the subject and the object of the verb is important to me for two reasons.  One reason is that I’m doing computational lexical semantics, specifically, which, as you might recall, means that I test my semantic representations by trying to use them as the basis of a computer problem.  I know that it can be important to know what kinds of things are taking part in the action of a verb in order to know how to interpret both the verb itself, and the sentence that it occurs in.  Imagine these situations: the author finished the book, the student finished the book, and the goat finished the book.  In the first, this means that the author completed the writing of the book.  In the second, this means that the student completed the reading of the book.  In the third, this means that the goat finished the eating of the book.  Can there be other interpretations of these sentences?  Of course–authors also read, students also write, and in a work of fiction, you could certainly imagine a goat reading a book.  But, none of these are the intuitively obvious interpretations of those sentences, and the reason for that is the expectations that the different subjects—author, student, goat–lend to our interpretations of the sentences.)  The other reason that I want to get a decent range of the types of semantic classes that can be the subjects and the objects of a verb is that I work with ontologists quite a bit.  I find that their models of the domain often don’t objectively seem to have taken full advantage of what the literature of the domain has to say about how those models would need to look if they’re going to be adequate, and collecting examples of lots of different semantic classes taking part in an action is my stab at being helpful.

So, how does one going about doing this with a minimum of subjectivity and a maximum of data-centeredness?  I follow roughly the following steps, pretty much in this order, allowing for some going back and forth between them as I fine-tune things:

  1. Look at what other people have done.  I didn’t always do this, as I wanted to see how different what I came up with was from what other people had come up with, but by now I have a decent feel for what kinds of differences there are likely to be (they’re related both to the different content matter and to the different writing styles that my work and previous work are based on), and I usually start by looking at the representations in the Unified Verb Index.  (Search for the verb of your choice.)
  2. Look at some random examples of the verb in use to get a general sense for how well the representation in the Unified Verb Index matches up with biomedical data.  I use the Sketch Engine interface to do my search for random examples, but you can use Google, specialized textual search tools, or whatever is easy for you.
  3. Look for examples of human agents.  I usually go to Google for this one, as the data that I have uploaded to Sketch Engine doesn’t have very many humans, in general.  My two tricks:
    1. I use Google’s site: operator to search just within the National Library of Medicine’s web site.  That way I can be almost positive that I’ll get examples of how the word is used in the biomedical domain.
    2. The first thing that I try is a Google exact phrase search with we plus the past tense of the verb.  You mark a phrasal search by putting the exact phrase that you’re looking for in double quotes.  So, my search for we phosphorylated looked like this: site:http://www.ncbi.nlm.nih.gov/pubmed/ "we phosphorylated"
  4. Look for the full range of prepositional complements.  I do this with Sketch Engine’s word sketch function.
  5. Look for a variety of semantic classes as the subject and the object of the verb.  Again, I use Sketch Engine’s word sketch function for this.

Then it’s time to see if the semantic representation actually covers everything that I’ve found using the strategy above.  If it does, then we’ll do a larger-scale project of marking up all of the examples of the verb in some large body of data, followed by trying to write a computer program that can make use of the representations and the examples to learn how to identify the semantics of the verb when shown new examples.

Here is some of the vocabulary that you will need if you’re going to talk about this kind of stuff in French.  Here is some data from the French Wikipedia page about semantics.  This will give us some of the vocabulary of semantics in general–then we’ll move on to lexical semantics.

La sémantique est une branche de la linguistique qui étudie les signifiés, ce dont on parle, ce que l’on veut énoncer. Sa branche symétrique, la syntaxe, concerne pour sa part le signifiant, sa forme, sa langue, sa graphie, sa grammaire, etc ; c’est la forme de l’énoncé.

  • la sémantique: semantics
  • le signifié: the “signified,” the concept or mental representation that is the locus of meaning.  (I should point out that it is unfortunately rare for English-speaking linguists to use this old Saussurean terminology, at least in my corner of linguistics.)
  • énoncer: to formulate, state, or pronounce (definition from Wikipedia.org).
  • la syntaxe: syntax.
  • le signifiant: the “signifier,” the spoken (or, in my field, written) form that corresponds to the signifié or “signified.”  (See above about unfortunate tendencies to not use Saussurean terminology.)
  • la graphie: written form (definition from WordReference.com).
  • un énoncé: in linguistics, this usually corresponds to the technical term “utterance,” but since we’re talking specifically about syntax here, it may be better translated as “wording” (see WordReference.com).

Now let’s move on to some vocabulary that’s more specific to lexical semantics.  We’ll take this material from the book Introduction au TALN et l’ingénierie linguistique, by Isabelle Tellier.

La sémantique lexicale est l’étude du sens des “mots” -ou plutôt des morphèmes- d’une langue. Cette définition est en réalité assez problématique, puisque la notion même de “sens” n’a rien d’évidente. Le problème tient précisément à ce que, pour définir le “sens” d’un mot, on recourt en général à d’autres mots. Pourtant, la consultation d’un dictionnaire d’une langue donnée est de bien peu d’utilité si on n’a pas déjà d’un minimum de connaissance de cette langue. Comment échapper à cette “circularité du sens” ? Nous evisageons dans ce chapitre (et le suivant) diverses tentatives qui peuvent être regroupées en trois familles…

  • la sémantique lexicale: lexical semantics.
  • le morphème: morpheme.
  • avoir rien d’évidente: I don’t know!  Can someone help out with this?
  • tenir à qqch:  to come from, stem from, arise from.  (Note: tenir à has a bazillion other meanings–see WordReference.com for this definition and many others.)
  • recourir à: to resort to, appeal to.  (Definition from WordReference.com.)

Want to learn more about the kind of approach to (computational) lexical semantics that I’m talking about here?  Check out my advisor’s book on the subject–Martha Palmer, Daniel Gildea, and Nianwen “Bert” Xue’s Semantic role labeling.  (I’m not telling you which of these people was my advisor–still anonymous!)

6 thoughts on “Doing computational lexical semantics with your web browser: An approach to using data to build semantic representations”

  1. This seems like a comprehensive method for discovering all the important patterns of a verb’s behavior. It would be great if your results could find their way into PropBank and VerbNet in order to expand coverage there. The example you use, phosphorylate, points up a feature of many change-of-state verbs (many of which occupy classes in VerbNet in the range 40.* to 45.*. However these verbs begin their lives (whether as transitive or intransitive) they often end up being ergative because speakers find it convenient to coerce them into the behavior they haven’t displayed previously (rather than, e.g., inventing a new verb that no one could guess the meaning of). It poses a challenge for classifying them, and there are inconsistencies among VN, PB, and dictionaries. “Germinate” is a bit like “phosphorylate,” representing a change of state that is entity specific. PB treats it comprehensively, but VN doesn’t account for its transitive behavior. Dictionaries I looked at treat “phosphorylate” as transitive only, but your discussion suggests that it can also be intransitive.

    Liked by 1 person

    1. Thanks, Orin. As you point out, I’ve let intransitivity fall by the wayside, and it’s really interesting. I haven’t figured out a way to search for intransitives with Sketch Engine. Do you know of one?

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s