People are dying because scientific knowledge as research literature is hidden behind paywalls. We intend to liberate this knowledge. People outside of universities working in areas impacted by the COVID-19 virus, from healthcare professionals, to government, and in many industry sectors do not have access to scientific literature.
5 years ago the Liberian government shocked us by revealing that the Ebola epidemic had been predicted in an obscure paywalled scientific journal 30 years ago.
The text was clear:
A regular literature search for “Liberia” and “Ebola” would have alerted policymakers. We set out to build a universal warning system from the literature. How? We read the FULL TEXT of all published science and index it for key facts using Wikidata.
Scientific knowledge is one of the few tools we have in an otherwise near empty toolbox to fight COVID-19. Currently we are lacking other tools in: technology, medicines, healthcare, industry, etc. Healthcare professionals and many other sectors are currently denied access to the world’s research knowledge to develop solutions.
Two things need to happen:
- Immediate and permanent Open Access to all research via an EC programme and economic package. It is important that all research is made open as the effects of the virus have touched so many sectors that it is pointless drawing boundaries;
- Research publishing needs to be modernised to be machine readable to increase speed of research cycles. If action is not taken the suffering will persist for longer. Both points can be actioned immediately and the benefits will be enormous.
We offer a single point of entry to scientific literature for non-academics that makes it easy for them to search. We aim to provide a way to get quick results to simple questions and to be able to ask ‘scoping questions’ that way the user can get an idea if their question is being addressed by researchers.
ContentMine data mines open research repositories such as Europe PubMed Central and indexes the results with WikiData terms. This in turn makes the research more powerful as the content is then indexed against this global open classification system. The integration of ContentMine and Wikidata indexing has two valuable aspects: 1. Wikidata makes research from professionals such as clinicians intelligible at different levels, closing the knowledge loop between researchers and practitioners; 2. Wikidata terms are available in over 100 languages meaning searches can be multilingual. For example if the users searches for ‘Middle East respiratory syndrome’ in English they will get results for ‘Síndrome respiratorio de Oriente Medio’ in Spanish.
Impact on the crisis
The majority of research literature is not Open Access and is not accessible by the public, the figures could be as much as 70% closed access. It is also important to remember that it is the public who has paid for the majority of this research yet they are not permitted to freely access it.
On top of the Open Access problem publishing also needs digital modernisation. Interoperable and machine readable systems need to be implemented to replace the outdated PDF.
Movements have been made with Open Access such as open repositories and preprints but more needed.
Nearly all sectors are impacted by COVID-19 and it is difficult to make boundaries or demarcations about what research isn’t needed so that it is better to make all research open.
In terms of scaling, ContentMine could be implemented almost immediately on the globally available Open Access content.
Two example that Contentmine has worked with over the hackathon are:
- Alternative medicine partner Holistic Health. The idea is to recommend patients alternative / supplementary treatments than those prescribed by doctors (pure clinical), but always based on scientific data for example, any kind of psychological support depending on the problem, which kind of physical activity will be better depending on the comorbidities, so we could improve quality of life of patients.
- Global book sprint on ‘contact tracing and tracking’ to compare public health systems by Health Sprints. The project documents the German system for ‘contact tracing’ as can be seen on this MOOC and on ‘crisis management’ on this GitHub based OA book. We want to use Contentmine in the rapid publishing process. GitHub Project https://tinyurl.com/y8zj9n79 | German contact tracing MOOC https://tinyurl.com/ya3mefr2 | German Crisis Management book https://tinyurl.com/yagqmpsu
How we built it
Contentmine is fully open-source and integrates with a variety of technologies and other services. Open source modules that download search results in bulk, analyze them on your own machine and link to Wikidata. Integration in the Open Science ecology of open tools and content.
Challenges we ran into
Most scientific literature is unacceptably hidden behind paywalls.
The European Commission and individual EU countries have begun moves to being full Open Access, but it must now be immediately accelerated. As an example Switzerland is implementing 100% Open Access by the end of 2020. Plan S is one model for making research literature open, but it is not a global solution to what is a global problem in that its reliance on Author Payment Fees which are prohibitive to the Global South and will be counter productive — reducing knowledge exchange. AmeliCA policies of free to publish, free to read should be used instead.
The scientific literature is not fit for modern purposes, for example PDF and archaic gateways. We're changing that.
- The first system to annotate the scientific literature corpus with Wikidata (the new universal system for scientific and medical knowledge).
- Andrew Jackson - thesis analysis
- Andrew Whitehouse - scraper download with Lezan Hawizy evaluating
- Lezan Hawizy - Ferret scraping system
- Peter Murray-Rust - increasing the sources being mined
- Richard Light - building dictionary system
- Clyde Davies - interfacing DOAJ and searching 4 million abstracts
- Nick England - Containerization
- Jordi Soriano Mesa - partner use case
- Remko Popma - built platform
- Simon Worthington - Coordination and Health Sprints use case
Thanks also to our EUvsVirus mentors: Mariló Vallecillo, Dmitri Zaitsev, and Tautvydas Strazdas.
A prototype is available on GitHub at the following address: https://github.com/petermr/openVirus. The prototype is a command line tool and functions as a proof of concept and demonstrates the core functions needed and has been tested to search major open research literature repositories.
With a team of two the product could be ready as a production ready tool in three months and be given a UI for the general public to make use of it. Additionally a UX person and a full stack web developer for three months would be needed.
Contentmine is produced by a non-profit company as open-source software and has contributions from a variety of public sector, university and private sector partners.
As with any open-source project the software is available for anyone to use gratis and to build on top of it as infrastructural commons. Contentmine would provide paid for service using the software as can any other company or organisation. This model of open-source has proven itself to be self-sustaining in many examples where private and public sector innovation mesh.
Contentmine would need in the region of 80,000€ for a full implementation and then a consortium of partners could contribute to running costs and ongoing development in the model of ORCID or DataCite.
After the crisis
The value of open research literature and a modernised scholarly publishing is obviously not only limited to the COVID-19 pandemic. All sectors of industry, civil society and government will only benefit from a similar root and branch reform. The case for the free flow of knowledge is not a new one but since we can assume there will be more crises coming down the line then we need to be better prepared and investment now will protect against the heavy price paid for such inaction as we are seeing now.
“He who receives an idea from me, receives instruction himself without lessening mine; as he who lights his taper at mine, receives light without darkening me.”
Thomas Jefferson, Letter to Issac McPherson, “No Patents on Ideas,” 13 August 1813.