I know, I know–you think that computational linguists spend their time sitting around discussing morphological typology and its implications for the monogenetic versus polygenetic hypotheses of the origin of language. And we do–after work, over beers.
For most of us, though, our professional life consists of trying to get computers to do things that involve language in some way. One of the biggies is helping people find information. To teach a computer to do something like that, you need to have access to a lot of data to test the system that you are building.
Enter the covid-19 coronavirus. There is suddenly an enormous amount of research being done on something that we did not even know about just 6 months ago. At the same time, there is a lot of research already published on other coronaviruses, and it would be idiotic to try to do research on a novel coronavirus without taking advantage of what we already know about the others. But, how can anyone go through the 15,000+ papers on coronavirii (spelling?) that are already in the US National Library of Medicine’s PubMed/MEDLINE database?
Enter computational linguists. Sometimes considered a branch of artificial intelligence, we work on computer programs to do things like summarize large sets of publications. There are lots of things that you have to be able to do in order to do that–figure out what’s being talked about (coronavirus and medications? Coronavirus and transmissibility? Coronavirus and respiratory failure?); tell the difference between a positive statement, a negated statement, a speculative statement, and a negated speculative statement:
- Positive: The person-to-person transmission routes of 2019-nCoV included direct transmission, such as cough, sneeze, droplet inhalation transmission, and contact transmission, such as the contact with oral, nasal, and eye mucous membranes. (Source: this paper)
- Negated: The person-to-person transmission routes of 2019-nCoV did not include indirect transmission over the Internet. (I made this sentence up)
- Speculative: This other coronavirus might be specific to deer species. (From this paper published in 1995 about a different coronavirus)
- Negated speculative: This other coronavirus might not be specific to deer species. (I made this sentence up)
…and many other tasks that all have to be handled in order to solve the problem of summarizing those 15,000+ papers–and many other problems in getting computers to understand human language, too.
Like I said, though: in order to test our systems, we need data. Enter a number of the big players in computational linguistics, who have created, and made freely available to the public, a large dataset of relevant papers. Their hope? That computational linguists around the world will dive into them, using them to develop and test tools for dealing with all of those papers. Here’s an excerpt from the White House’s web site describing the effort to create and release the data, followed by a French-language appeal to the francophone computational linguistics community to work on it sent out by my colleague Pierre Zweigenbaum. C’est parti…
“One of the most immediate and impactful applications of AI is in the ability to help scientists, academics, and technologists find the right information in a sea of scientific papers to move research faster. We applaud the OSTP, WHO, NIH and all organizations that are taking a proactive approach to use the most advanced technology in the fight against COVID-19,” said Dr. Oren Etzioni, Chief Executive Officer of the Allen Institute for AI. “The Allen Institute for AI, and particularly the Semantic Scholar team, is committed to updating and improving this important resource and the associated AI methods the community will be using to tackle this crucial problem.”
“It’s difficult for people to manually go through more than 20,000 articles and synthesize their findings. Recent advances in technology can be helpful here. We’re putting machine readable versions of these articles in front of our community of more than 4 million data scientists. Our hope is that AI can be used to help find answers to a key set of questions about COVID-19,” said Anthony Goldbloom, Co-Founder and Chief Executive Officer at Kaggle.
Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset
Voici une série de tâches de recherche d’information / extraction d’information / fouille de textes / recherche de réponses à des questions lancées le 16 mars  sur un sujet d’actualité :
dans une base de 29000 articles (dont 13000 en texte intégral) concernant le coronavirus (bien sûr pas seulement le « nouveau ». Les questions sont listées sous la rubrique “Tasks”, et chaque question générique est déclinée en questions spécifiques. Voir par exemple “What is known about transmission, incubation, and environmental stability?”
Par ailleurs, un corpus (LitCovid) sur le Covid-19 est mis à jour en continu à la National Library of Medicine :
https://www.ncbi.nlm.nih.gov/research/coronavirus/ (1263 articles à
l’heure où j’écris ce message contre 1120 deux jours avant).
Le DBCLS à Tokyo a mis en place dans sa plateforme de gestion d’annotations un espace pour centraliser les informations extraites sur le corpus LitCovid sous forme d’annotations :
Tous les spécialistes de TAL sont donc encouragés à appliquer leurs méthodes sur ces données et à les faire tourner sur Kaggle (CORD-19), à les appliquer au corpus LitCovid et à déposer les annotations sur PubAnnotation.
 “Today, researchers and leaders from the Allen Institute for AI,
Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for
Security and Emerging Technology (CSET), Microsoft, and the National
Library of Medicine (NLM) at the National Institutes of Health released
the COVID-19 Open Research Dataset (CORD-19) of scholarly literature
about COVID-19, SARS-CoV-2, and the Coronavirus group.”