BioTechniques gives a broad introduction to how next-generation sequencing works and the technologies involved.
Over the past six decades, researchers have developed sequencing techniques and technologies to determine nucleic acid sequences in biological samples. This work has culminated in (NGS), which researchers and clinicians use to diagnose, monitor, and manage disorders and diseases by identifying germline or somatic mutations. NGS also proves useful in metagenomic studies. And in 2020, researchers utilized NGS methods to characterize the SARS-CoV-2 genome and monitor the spread of COVID-19 on a global scale.
Here, the life sciences journal BioTechniques explains the:
- History behind NGS
- Sequencing methods available to researchers
- Key stages of NGS
- Difference between short- and long-read sequencing
- Difference between whole-genome and whole-exome sequencing
- Data analysis that follows NGS
- Bottlenecks of NGS.
The History Behind Next-Generation Sequencing
In 1953, Watson and Crick determined the structure of DNA using Rosalind Franklin’s DNA crystallography and x-ray diffraction work. Then, Robert Holley sequenced the first molecule (tRNA) in 1965. Various research groups have since adapted these methods to advance DNA sequencing.
In 1977, Frederick Sanger and his colleagues developed the chain-termination method, also known as Sanger sequencing. And by 1986, researchers had developed the first automated DNA sequencing method. This marked the beginning of a golden era for the progression of sequencing platforms, including the capillary DNA sequencer. Such advances in sequencing saw researchers complete the human genome project in 2003 and launch the first commercially available, second-generation (2G) NGS platform in 2005. This platform made it possible to amplify millions of copies of a DNA fragment parallel.
Although 2G NGS shares some similarities with Sanger sequencing, 2G NGS has a much higher sequencing volume. This means that, with NGS, researchers can process millions of reactions in parallel, resulting in higher sensitivity, throughput, and speed at a lower cost. Theoretically, researchers could now conduct the genome sequencing research projects that took years to complete with Sanger sequencing within hours using NGS.
Next-Generation Sequencing Methods
Many individuals think of 2G technologies when they hear the term ‘NGS,’ but this term also encompasses third- (3G) and fourth- (4G) generation technologies that have evolved over the years.
2G Next-Generation Sequencing
Although 2G sequencing methods share many features, we can divide them according to their underlying detection chemistries, such as sequencing by ligation (incorporating nano ball) and sequencing by synthesis (SBS), which divides further into pyrosequencing, reversible terminator, and proton detection.
2G NGS technologies offer many advantages over previous sequencing techniques, most notably the ability to generate sequencing reads quickly and sensitively at a low cost. That said, these technologies aren’t without their weaknesses. Drawbacks include the incorporation of incorrect dNTPs by polymerases and poor interpretation of homopolymers, both of which result in sequencing errors. Plus, the short read lengths create a need for deeper sequencing coverage. And all 2G NGS techniques require researchers to complete PCR amplification before they begin sequencing.
3G Next-Generation Sequencing
The advent of 3G NGS avoided the need for PCR. Instead, researchers can sequence single molecules without completing amplification steps first. Researchers can obtain sequence information with DNA polymerase by monitoring the incorporation of fluorescently labeled nucleotides into DNA strands with single-base resolution.
Depending on the exact technique and tools utilized, the benefits of 3G sequencing can include longer read lengths, non-biased sequencing, and real-time monitoring of nucleotide incorporation. That said, challenges can arise from the large quantities of sequencing data, high error rates, low read depth, and high costs.
4G Next-Generation Sequencing
4G NGS involves combining the single-molecule sequencing of 3G sequencing with nanopore technology. Like 3G technologies, nanopore technology doesn’t require amplification and uses the concept of single-molecule sequencing. However, nanopore technology passes the single molecule through nanopores.
4G technologies enable the fastest whole-genome sequence scans. That said, they are more costly and error-prone than 2G techniques. As a result, there are fewer data available for this technique.
The Stages of Next-Generation Sequencing
Whichever 2G NGS method a researcher follows, they must complete four main steps and tailor these to the target RNA or DNA and their chosen sequencing system.
1. Sample Preparation
The researcher extracts nucleic acids (DNA or RNA) from the selected samples, which may be blood, bone marrow, sputum, or similar. They quality-control check the extracted samples using standard methods, such as gel electrophoretic, fluorometric, or spectrophotometric. If they are working with RNA, the researcher reverse transcribes the samples into cDNA. (Some library preparation kits include this step.)
2. Library Preparation
The researcher performs random fragmentation of the cDNA or DNA, usually by sonication or enzymatic treatment. The optimal fragment length depends on the platform the researcher uses. They may need to run a small amount of fragmented sample on an electrophoresis gel when optimizing the process.
They can then end-repair and ligate the fragments to adapters (smaller, generic DNA fragments). Adapters have defined lengths with known oligomer sequencers. This makes them compatible with the applied sequencing platform and identifiable when researchers perform multiplex sequencing. Multiplex sequencing allows researchers to pool and sequence large numbers of libraries at the same time.
The researcher then performs size selection, either by gel electrophoresis or by using magnetic beads. This process removes any fragments that are too short or too long for the selected sequencing platform and protocol. Next, they achieve library enrichment/amplification through PCR. They might then apply a “clean-up” step (perhaps using magnetic beads) to remove undesired fragments. This improves sequencing efficiency.
To complete this stage, the researcher can use qPCR to perform a quality control check on the final libraries and confirm the quality and quantity of the DNA. This enables the researcher to prepare the correct concentration of the sample for sequencing.
3. Sequencing
Depending on the platform and chemistry, the researcher may perform clonal amplification of the library fragments before sequencer loading (emulsion PCR). Alternatively, they may perform this amplification on the sequencer itself (bridge PCR). They detect and report the sequences according to the chosen platform.
4. Data Analysis
Finally, the researcher analyzes the generated data files. The analysis method they select should depend on the workflow and the aim of the study. Paired-end and mate-pair sequencing are ideal for downstream data analysis, especially for de novo assemblies. They link sequencing reads that are read from both ends of a fragment (paired-end) or have been separated by an intervening DNA region (mate-pair).
When selecting a library preparation and sequencing platform, researchers should consider:
- The research question posed
- The sample type
- Whether they need to sequence the whole genome or only specific regions
- Whether short- or long-read sequencing would be more appropriate
- Whether they need to look at the genome or transcriptome (DNA or RNA)
- The most appropriate extraction method
- The read depth (coverage) needed
- The required sample concentration
- The read length required
- Whether they need to use single-end, paired-end or mate-pair reads
- Whether they could multiplex samples
- Whether they need any bioinformatic tools.
Short-Read Versus Long-Read Sequencing
Researchers can choose between two main approaches in NGS technology: short-read sequencing and long-read sequencing. Each of these offers its own benefits and limitations.
Short-read sequencing is a low-cost solution that enables researchers to sequence fragmented DNA and achieve higher sequence fidelity. That said, researchers can’t use short-read sequencing to resolve structural variants, phase alleles, or distinguish highly homologous genomic regions.
On the other hand, long-read sequencing enables researchers to sequence genetic regions that are difficult to characterize with short-read sequencing because of repeat sequences, resolve structural rearrangements or homologous regions, read an entire RNA transcript, determine the specific isoform, and assist de novo genome assembly. That said, long-read sequencing has a lower per read accuracy, and bioinformatic challenges can arise from coverage biases, scalability, high error rates in base allocation, and the limited availability of appropriate pipelines.
Whole-Genome Versus Whole-Exome Sequencing
Whole-genome sequencing is the most commonly used form of NGS and involves the analysis of a genome’s entire nucleotide sequence. Meanwhile, whole-exome sequencing is a form of targeted sequencing that only addresses protein-coding exons. Whole-exome sequencing is much more cost-effective than whole-genome sequencing because of its lower sequencing burden and the lower volume and smaller complexity of the resulting sequencing data.
That said, sequencing only a fraction of the genome reduces the opportunity for novel discoveries, and researchers may miss key information. Therefore, that tends to reveal a complete picture. Its rapidly declining costs are also making this option ever more prevalent.
Next-Generation Sequencing Data Analysis
All NGS technologies generate large amounts of output data. As a result, data analysis usually involves a raw read quality control step, pre-processing and mapping, post-alignment processing, variant annotation, variant calling, and visualization.
Assessing the raw sequencing data enables the researcher to determine the quality of the data and lay the foundations for downstream analyses. These assessments can provide a general view of the number and length of reads and identify any contaminating sequences or reads with low coverage.
FastQC is one of the most well-established applications for computing control statistics of sequencing. However, researchers need additional tools for further pre-processing, like trimming and read-filtering. Trimming bases at the ends of reads and removing leftover adapter sequences tends to improve data quality. Meanwhile, researchers can use modern tools like fast-p to complete quality control checks, read filtering and base correction on sequencing data, and combine features from traditional applications 2-5 times faster than standard applications can alone.
The next stage depends on the existence of a reference genome. If there is a de novo genome assembly, the researcher aligns the generated sequences into contigs using their overlapping regions. They may achieve this using processing pipelines, which may include scaffolding steps to increase the assembly continuity.
Where the generated sequences are mapped to a reference genome or transcriptome, it is possible to identify variations compared to the reference sequence. Researchers can choose from a huge range of mapping tools to handle large volumes of data while mapping reads. They can then analyze the reads in an experiment-specific manner. During this analysis, the researcher can identify indels (insertions or deletions of bases), haplotypes, single nucleotide polymorphisms (SNPs), inversions, and differential gene transcription in RNA sequences.
Finally, the researcher can complete the visualization stage, during which data complexity can pose challenges. Depending on the study and research question, researchers can choose from several visualization tools. For example, the Integrated Genome Viewer (IGV) and Genome Browser are popular tools when reference genomes are available. Meanwhile, the Variant Explorer is ideal for whole-genome and whole-exome experiments as it can sieve through thousands of variants and allows researchers to focus on the most valuable findings. Meanwhile, the VISTA tool enables researchers to compare genomic sequences.
Next-Generation Sequencing Bottlenecks
Although NGS has made it possible to study and discover genomes to a whole new level, the complexity of the sample processing has exposed bottlenecks in the ways that researchers manage, analyze, and store data. For example, NGS requires huge computational resources for the assembly, annotation, and analysis of sequencing data. The huge amount of data generated by the analysis process can also pose major challenges, especially as data centers grapple to cope with the rising demand for storage capacity. However, potential strategies to improve efficiency, maximize reproducibility, minimize sequencing error, and facilitate correct data management are underway.
An Invaluable Tool
NGS has made its mark as an invaluable tool since the early 2000s. Methods like whole-genome sequencing, whole-exome sequencing, targeted sequencing, transcriptome, epigenome, and metagenome sequencing each offer unique benefits to research settings around the world and are only set to develop further. Meanwhile, the growing power of NGS, paired with its ever-lowering costs, has also enabled many clinicians to adopt the technique in their practices.
The Latest in Lab Techniques, Tools, and Tech
BioTechniques publishes the latest updates on laboratory techniques, technologies, and tools for lab researchers and other life sciences professionals. When BioTechniques released its first issue in 1983, it was the first publication to feature peer-reviewed research on lab methods and instrumentation, providing a cutting-edge voice for the life sciences sector. Today, readers of various scientific and medical disciplines use both the journal and BioTechniques’ multimedia website to enhance their knowledge.