A new paper published last month in Bioinformatics has presented a new statistical tool that can map population data faster and more accurately than previously (1). The new software, called TeraPCA, can accurately analyse huge genomic datasets much faster, to potentially unlock secrets of the human genome in less time.
Genomes of human populations are being sequenced at a rapid rate, presenting an incredible opportunity to study human populations and importantly, disease. By collecting more and more data, scientists are hoping to uncover the complex mechanisms underlying human disease, with a view to developing targeted gene therapies and other preventative measures.
Projects like the Human Genome Project and the UK BioBank have made, and continue to make, huge contributions to our understanding of human genetics. Further, as the price of gene sequencing continues to decline, many more individuals are opting to have their genomes sequenced. To this end, direct-to-consumer genetic testing, provided by companies like 23andMe, has massively expanded in the past decade providing yet another important source of data.
Although high-throughput technologies, or next-generation sequencing (NGS), have allowed scientists to sequence the human genome on epidemiologic scales, challenges are inevitable. Current genome sequencing approaches make trade-offs between frequency and type of errors and throughput or how fast they can sequence.
Most differences in the human genome are due to single nucleotide polymorphisms, or SNPs, which occur nearly once in every 1,000 nucleotides. This means around 4 to 5 million SNPs in every genome, and considering data has already been collected on millions of people, that’s a lot of data.
All this new genetic information could lead to a considerable bottleneck, as scientists grapple with how where to store, manage, and analyse all the new information they are generating at a rapid pace.
Typically, to study the human genetics of populations, scientists use something called Principal Component Analysis (PCA). The technique analyses huge numbers of variables that are reduced to a set of principal factors. But achieve this, requires a large amount of computer memory presenting both cost, time, and technological challenges.
So, the researchers set out to develop a so-called out-of-core algorithm to process data that is too big to be processed by a computer’s main memory, but much faster than existing approaches.
According to the authors, they were able to analyse what would normally take a couple of days in as little as five or six hours with their new programme, called TeraPCA. They achieved this by “multithreading” — performing several computations at the same time — and reducing the precision of the output from 16 digits to a more reasonable accuracy of 2 to 4 decimal places. And the results were just as accurate.
The researchers demonstrated PCA of large-scale datasets, implemented in the C++ programming language, using something called the Randomized Subspace Iteration method. The code is now available on Github, a global open-source software development community.
However, other potential problems still lurk. In another recent study published earlier this year in Frontiers in Microbiology, scientists uncovered a troubling number of errors in publicly available genomic data (2). This seems inevitable given the immense amount of data being created by next-generation sequencing technologies The authors, Broschat and Brayton are working to develop a new tool that can find annotation errors in biological datasets.
Moreover, improving DNA sequencing technologies and continued development of novel data-sorting algorithms will help maximize the potential benefits of this relatively new and exciting source of information – for both individual patients and entire populations.
(1) Bose, A. et al. TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes. Bioinformatics (2019). DOI: 10.1093/bioinformatics/btz157
(2) Lockwood, S. et al. Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues. Frontiers in Microbiology (2019). DOI: 10.3389/fmicb.2019.00383