If you’re a regular reader of this blog, you probably have some interest in language–language in the abstract, or le langage in French, as opposed to (or in addition to) any particular language, or la langue in French. I’m reblogging here the reading list for a week of a course on language processing that I’m teaching at the moment. The theme of the week is data in language processing: what you (might) mean when you talk about “data” with respect to language; what kinds of data there are; where that data comes from; and how to make some data if you can’t find the kind of data that you need.
I’m posting this particular reading list because I often suspect that many people who know that I’m a linguist imagine that I spend my days sitting around discussing how funny irregular verbs are, or how cool it is that French has three verbs that mean “go back,” or whatever. What you’ll find on this list has very little to do with coolness or lack thereof, and a lot to do with data formats, data set sizes, statistics, and a bit on ethics. Personally, I find this stuff fascinating–but, it’s often worth getting a glimpse at what we call in my field “the sausage-making process.” Enjoy! (Or go watch the latest episode of “The Walking Dead”–it’s pretty good.)
Here are some suggested readings for Week 5. Remember that I do not distribute my lecture notes. Note also that you are responsible for all of the material on which I lecture. These readings are not required, but they are intended to cover everything that I talk about in our lectures (modulo the caution in the preceding sentence). All of them are available for free on line except for the books (although the Good and Hardin book is available for free, as well). All of them should be available in an academic library. Feel free to contact me if you have trouble finding a copy of either.
- Banko, Michele, and Eric Brill. “Scaling to very very large corpora for natural language disambiguation.” Proceedings of the 39th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2001.
- Miller, Greg. “A scientist’s nightmare: software problem leads to five…
View original post 360 more words