Faculty members from the University of Tennessee Health Science Center have led the development of PubSeq, a global digital data repository of COVID-19 RNA sequences. PubSeq is open to all interested individuals, primarily researchers working to understand the CoV-2 coronavirus and the pattern of spread of its numerous variants (including the Delta variant).
“We created an online resource where researchers can upload RNA data and associated metadata from the coronavirus, so it gets collected into a database, and people can query those data,” said Pjotr Prins, PhD, a bioinformatician and assistant professor in the Department of Genetics, Genomics and Informatics at UTHSC.
Dr. Prins explained that in the first days of the pandemic, sequencing information was available only at large repositories and only under relatively strict access control. “Our initiative started as a response to the fact that data about the virus was not sufficiently public, especially in the first part of the epidemic,” he said. “If a researcher sequenced the virus, like we are doing in Memphis, they send the data to the central repository, but first they have to register, and when they register, they have to agree to a specific license, and that license essentially means that they cannot share their data, other than with people who part of that repository team. This is not truly open science.”
Feeling that to fight the virus effectively, information had to be shared much more rapidly and without impediments, Dr. Prins and other scientists held an international “biohackathon” last year with the goal of creating an open source for storing sequence data on the coronavirus.
More than 150 scientists contributed software to the biohackathon and PubSeq. It is now linked to other major resources, such as the NCBI (National Center for Biotechnology Information) GenBank and the EBI/ENA (European Bioinformatics Institute/European Nucleotide Archive), and adds value by providing quality control and by normalizing metadata, including information on geographical location. PubSeq is an ongoing global and open initiative, and anyone can contribute to both software development and data repositories.
“Currently, we are working with Joep de Ligt, PhD, who leads the pandemic sequencing effort in New Zealand, on uploading of viral sequence data from the handheld Oxford Nanopore sequencer (new generation sequencing technology) into PubSeq, using inexpensive hardware for primary analysis,” Dr. Prins said. He said the group is also working with Peter Amstutz, principal software engineer for Curii Corp., and Michael Crusoe, cofounder and lead at the Common Workflow Language project, to expand online reproducible workflows of this data. “This can only happen when data is public and abides by FAIR principles (of scientific data management and stewardship). We hope this effort will help early identification of new viral variants around the globe in this and future pandemics.”
The PubSeq website is hosted in Memphis with the GeneNetwork program, a free scientific web resource developed at UTHSC. Amazon has also made PubSeq part of the Open Data Program, further extending its reach and availability. PubSeq now has more than 86,000 viral sequences for researchers studying the virus.
“Dr. Prins and colleagues at UTHSC are at the vanguard of making sure that the results on COVID that are paid for mainly by the public purse are freely available to those developing vaccines, treatments, and rational policies to reduce risk of infection,” said Robert Williams, PhD, chair and professor in the UTHSC Department of Genetics, Genomics and Informatics.