One month after the debut of the COVID-19 Open Research Dataset, or CORD-19, the database of coronavirus-related research papers has doubled in size – and has given rise to more than a dozen software tools to channel the hundreds of studies that are being published every day about the pandemic.
In a roundup published on the ArXiv preprint server this week, researchers from Seattle’s Allen Institute for Artificial Intelligence, Microsoft Research and other partners in the project say CORD-19’s collection has risen from about 28,000 papers to more than 52,000. Every day, several hundred more papers are being published, in peer-reviewed journals and on preprint servers such as BioRxiv and MedRxiv.
CORD-19 aims to make sense of them all, using the Semantic Scholar academic search engine developed by the Allen Institute for AI, also known as AI2.
“We commit to providing regular updates to the dataset until an end to the crisis is foreseeable,” the project’s organizers say.
Coronavirus Live Updates: The latest COVID-19 developments in Seattle and the world of tech
Since mid-March, the dataset has been viewed more than 1.5 million times and downloaded more than 75,000 times. But it’s not just a question of quantity: CORD-19 has sparked the development of spin-off projects aimed at visualizing and organizing COVID-19 research to answer key questions about the pandemic and how to stop it.
One of the highest-profiles is the Text Retrieval Conference-COVID, or TREC-COVID, launched last week by the Commerce Department’s National Institute of Standards and Technology and the White House Office of Science and Technology Policy.
Among other organizers of TREC-COVID are AI2, the National Library of Medicine, Oregon Health and Science University and the University of Texas Health Science Center at Houston. The goal of the project is to assess systems on their ability to rank COVID-19 research papers based on their relevance to topical queries – for example, “How does the coronavirus respond to changes in the weather?”
“AI experts worldwide are responding to the White House’s call to action, developing approaches that help scientists gain insights from thousands of articles of COVID-19 scholarly literature,” Michael Kratsios, U.S. chief technology officer, said in a news release. “The TREC-COVID program expands upon these efforts by creating powerful and accurate search engines that extract knowledge from this literature, tailored to the needs of the health-care and medical research communities.”
Another partner in CORD-19 is the Kaggle online data science community, which is conducting a text-mining competition to extract answers to key research questions surrounding the pandemic. More than 550 teams are participating in the competition, and they’re already finding new ways to blend machine-based analysis with human-based curation.
“A few Kagglers are collaborating with a group of medical students to create a semi-automated living literature review page,” said AI2’s Lucy Lu Wang, a member of the CORD-19 team. “The machine-learning experts are creating systems to extract answers out of the CORD-19 dataset, and the medical students are helping to evaluate those results and present them in a form that’s suited for public consumption.”
Wang and the other team members say they’ve faced a few obstacles in their efforts to build the database. One has to do with access to research. “Though many publishers have generously made COVID-19 papers available during this time, there are still bottlenecks to information access,” the organizers say in their report.
Securing release rights to papers that haven’t yet been available for CORD-19 is one of the top items on the organizers’ to-do list, with the National Institutes of Health’s PubMed Central COVID-19 Initiative taking a leading role.