Researchers pioneering long-read sequencing studies explain why long reads matter

New technologies are filling in gaps in the human genome and opening major areas for discovery. UAB researchers explain the pros and cons and how they are using long reads at UAB.
Written by: Matt Windsor
Media contact: Hannah Echols

Long Road StreamZechen Chong, Ph.D. (left), has developed a tool called Inspector that improves whole-genome assembly in long-read sequencing. Robert Kimberly, M.D. (right), is using long-read sequencing in partnership with Chong and HudsonAlpha to study structural variations in the genomes of patients with lupus.Accurately mapping genetic variation between people is crucial to uncovering the causes of rare diseases and the increased susceptibility to a range of conditions within population groups. Until last summer, a surprisingly large proportion of the human genome remained uncharted, partially due to limitations of short-read sequencing. A new software program developed by University of Alabama at Birmingham researchers helps map out the human genome by improving the accuracy of long-read genomic sequencing.

The standard next-generation genome sequencing used today, including the vast majority of research and clinical sequencing at UAB, is done on machines that work via short reads. The preparation process chops up the deoxyribonucleic acid in a sample into strands roughly 150 base pairs long, or less, and the sequencing machine reads the base pairs found on each strand. Software then assembles the tiny chunks into a complete picture. The process works fine for much of the genome. But for regions with long stretches of repeated bases — such as the sequence GAGAGA repeated a few thousand times — or small insertions or deletions, it is difficult or impossible to determine the proper order from short reads alone.

“Long reads can generate more accurate assemblies than short-read technologies, especially when there is no reference genome to check against or in repetitive sections of the genome and regions with complex genetic rearrangements,” said Zechen Chong, Ph.D., assistant professor in the UAB Marnix E. Heersink School of Medicine Department of Genetics. “The downside of long-read sequencing is higher error rates and a lack of effective tools for accurately evaluating assembly results.”

The high error rates are why Chong’s lab’s Inspector software for assembling long-read de novo genomes, described in a November 2021 article in the journal Genome Biology, is generating attention in the field and was celebrated as one of Heersink’s Featured Discoveries in February 2022. Inspector largely reduces assembly errors and thereby improves the assembly quality, according to Chong.

Robert Kimberly, M.D., director of the UAB Center for Clinical and Translational Science, says Chong’s work is an important step forward in the sequencing field. Kimberly’s lab, which has a major focus on lupus research, is working with Chong and with HudsonAlpha Institute for Biotechnology in Huntsville, Alabama, to use long-read sequencing to study structural variations in patients with lupus. This is the same category of work that has been done using genome-wide association studies for more than a decade, Kimberly points out.

“The difference is those studies are focused — necessarily, because of the nature of the technology — on changes in individual bases of nucleic acids,” Kimberly said. “Long-read sequencing gives you a much better understanding of structural variations — insertions, deletions and duplications of genetic material on a given chromosome. The larger read length gives you more real estate on the chromosome. Structural variation in relationship to disease phenotypes is a major area for discovery. 

Generating long reads is one thing. Analyzing them is another problem entirely, and one where Chong’s Inspector tool shines. 

In evaluations reported in the Genome Biology paper, Inspector outperformed two other long-read assembly evaluators on a simulated genome task. Chong and his team have uploaded the source code for Inspector to GitHub to allow open access, and they have “addressed dozens of questions regarding usage of Inspector from users through GitHub and email,” Chong said.

“A major challenge for reference-based analysis is distinguishing true variations from assembly errors,” Chong said. “Inspector is the first tool to facilitate the discovery of long-read assembly errors, including both small- and large-scale errors. Accurate assembly results are the basis for variant discovery, genome annotation and subsequent functional discoveries. It improves whole-genome assembly by identifying and correcting assembly errors and is not affected by genetic variants.”

Read more about Chong’s lab and Inspector here.