Computational linguistics and misinformation

Computational linguistics takes on the infected swamp that the World-Wide Web has become

In the late 1990s, I worked at a start-up. At the time, it was one of the 25 largest web sites in the world.

Why “largest web site,” and not “biggest web site?” English tends to use “big” to refer to physical objects, and “large” to refer to abstract concepts. Note that I said “tends to”–this is a statistical tendency, not an absolute.

Like a lot of people working for internet-related businesses or causes, we thought that we were making the world a better place. The World-Wide Web was going to democratize so much–access to information, democratization of everyone’s ability to communicate their message to a broader world.

20+ years later, we all realize that everyone includes a lot of assholes. From a former president of the United States to random evil-doers in the former Soviet Union, there are people who use the technologies that so many of us well-intentioned people worked so hard on to spread hate, to attack democracy, to spread lies.


Misinformation: things that are not true. Disinformation: deliberately created untrue things. Unlike a simple mistake, misinformation is widely spread about. Unlike a lie, disinformation is widely spread about, too. “Diffused,” if you prefer a technical term. “Propagated.”


People like me who had a hand in developing the kinds of technologies that assholes use to propagate misinformation and disinformation have–belatedly, I would say–begun to try to address the kinds of problems that we helped create. One of these is a shared task on detecting online misinformation. A “shared task” involves a bunch of computer-sciencey-types getting together to define a task–say, finding emails that would be relevant to a court case. They come to an agreement about the definition of the task, about the right contents for a shared data set on which to evaluation performance on that task, and a metric for evaluating performance on it. You put together a schedule, everybody goes off and builds a computer system for doing the task, you distribute the data, and on some agreed-upon date, everybody submits their systems’ output to the people who organized the task. Then everyone gets together for a workshop in which we compare systems, compare outputs, and see what we can learn from those comparisons.


A day or two ago, an email appeared in my inbox about just such a shared task. Its goal is to deal with misinformation on the Internet. That’s a pretty goddamn big thing to take on, though, isn’t it? So, the participants agreed on a subpart of the misinformation problem that is a bit more tractable:

The TREC Health Misinformation track fosters research on retrieval methods that promote reliable and correct information over misinformation for health-related decision making tasks.

https://trec-health-misinfo.github.io/

Right away, we know some of the ways that the organizers have defined their hopefully-tractable task definition:

  1. The word retrieval suggests that participants will be given a set of documents, and that their output should be documents from that set. This mimics the basic structure of the World-Wide Web: a set of documents (on a loose definition of the term “document”) that users search in order to find information.
  2. The word health-related suggests that participants will not need to be able to deal with every possible kind of misinformation–only health-related misinformation. This makes the task considerably more (potentially) achievable, and given the amount of misinformation that has recently been spread on health-related issues such as the current global COVID19 pandemic, there is potential benefit to the world as a whole if it can be accomplished. (Notice how I snuck in there the inference that health-related is a word, not a something…more than a word? I don’t actually think that–just showing you how discourse works.)
  3. Promote reliable and correct information over misinformation refers to a common aspect of any “retrieval” task (see #1 above): your system is expected to present not just a set of documents, but a ranked list of those documents. Think about it like the page of results that Google gives you when you do a search: you want the most relevant web page to be at the top of the page, not at the bottom, right? So, that’s what the shared task organizers are asking your system to do: rank correct information over misinformation. Of course, if all of the web pages that your system presents to the user are correct, then that is wonderful. (Normally only the top results are considered in terms of scoring your system’s performance.)

Want more details? See the TREC Health Misinformation Track web page. Note that all opinions expressed in this post are mine, and they especially do not represent those of TREC, the Text Retrieval Conference, an organization that has run shared tasks for…over twenty years now, wow… And if you feel like slapping a computational person of my advanced age for having helped to create the stinking swamp that the World-Wide Web has become: go for it. But, also recognize that computational linguists are trying to do something to…wait for it…drain that swamp.

The picture at the top of this post is from an article published by the New York Times on April 13th, 2020.

2 thoughts on “Computational linguistics and misinformation”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s