Perplexed with so many terms in sequencing? We summarized the most commonly used terms in sequencing. Should you need any more explanation about each term, please feel free to contact us and our sequencing specialists will be happy to help you.
16S ribosomal DNA (rDNA) : The 16S rRNA is a structural component of the bacterial ribosome (part of the 30S small subunit). The 16S rDNA is the gene that encodes this RNA molecule. Owing to its essential role in protein synthesis, this gene is highly conserved across all prokaryotes. There are portions of the 16S gene that are extremely highly conserved, so that a single set of “universal” PCR primers can be used to amplify a portion of this gene from nearly all prokaryotes. The gene also contains variable regions that can be used for taxonomic identification of bacteria. Amplification and taxonomic assignment of 16S rDNA sequences is a widely used method for metagenomic analysis.
Algorithm : A step-by-step method for solving a problem (a recipe). In bioinformatics, it is a set of well-defined instructions for making calculations. The algorithm can then be expressed as a set of computer instructions in any software language and implemented as a program on any computer platform.
Alignment : See Sequence alignment.
Alignment algorithm : See Sequence alignment.
Allele : In genetics, an allele is an alternative form of a gene, such as blue versus brown eye color. However, in genome sequencing, an allele is one form of a sequence variant that occurs in any position on any chromosome, or a sequence variant on any sequence read aligned to the genome—regardless of its effect on phenotype, or even if it is in a gene. In some cases, “allele” is used interchangeably with the term genotype.
Amplicon : An amplicon is a specific fragment or locus of DNA from a target organism (or organisms), generally 200–1000 bp in length, copied millions of times by the polymerase chain reaction (PCR). Amplicons for a single target (i.e., a reaction with a single pair of PCR primers) can be prepared from a mixed population of DNA templates such as HIV particles extracted from a patient’s blood or total bacterial DNA isolated from a medical or an environmental sample. The resulting deep sequencing provides detailed information about the variants at the target locus across the population of different DNA templates. Amplicons produced from many different PCR primers on many different DNA samples can be combined (with the aid of multiplex barcodes) into a single DNA sequencing reaction on an NGS machine.
Assemble : See sequence assembly.
Assembly : See sequence assembly.
BAM file : BAM is a binary sequence file format that uses BZGF compression and indexing. BAM is the binary compressed version of the SAM (Sequence Alignment/Map) format, which contains information about each sequence read in an NGS data set with respect to its alignment position on a reference genome, variants in the read versus the reference genome, mapping quality, and the sequence quality string in an ASCII string that represents PHRED quality scores.
BED file : BED is an extremely simple text file format that lists positions on a reference genome with respect to chromosome ID and start and stop positions. NGS reads can be represented in BED format, but only with respect to their position on the reference genome; no information about sequence variants or base quality is stored in the BED file.
BLAST : The Basic Local Alignment Search Tool was developed by Altschul and other bioinformaticans at the NCBI to provide an efficient method for scientists to use similarity-based searching to locate sequences in the GenBank database. BLAST uses a heuristic algorithm based on a hash table of the database to accelerate similarity searches, but it is not guaranteed to find the optimal alignment between any two sequences. BLAST is generally considered to be the most widely used bioinformatics software.
Burrows–Wheeler transformation (BWT) : BWT is a method of indexing (and compressing) a reference genome into a graph data structure of overlapping substrings, known as a suffix tree. It requires a single computational effort to build this graph for a particular reference genome, then it can be stored and reused when mapping multiple NGS data sets to this genome. The BWT method is particularly efficient when the data contain runs of repeated sequences, as in eukaryotic genomes, because it reduces the complexity of the genome by collapsing all copies of repeated strings. BWT works well for alignment of NGS reads to a reference genome because the sequence reads generally match perfectly or with few mismatches to the reference. BWT methods work poorly when many mismatches and indels are present in the reads, because many alternate paths through the suffix tree must be mapped. Highly cited NGS alignment software that makes use of BWT includes BWA, Bowtie, and SOAP2.
Capillary DNA sequencing : This is a method used in DNA sequencing machines manufactured by Life Technologies Applied Biosystems. The technology is a modification of Sanger sequencing that contains several innovations: the use of fluorescent labeled dye terminators (or dye primers), cycle sequencing chemistry, and electrophoresis of each sample in a single capillary tube containing a polyacrylamide gel. High voltage is applied to the capillaries causing the DNA fragments produced by the cycle sequencing reaction to move through the polymer and separate by size. Fragment sizes are determined by a fluorescent detector, and the bases that comprise the sequence of each sample are called automatically.
ChIP-seq : Chromatin immunoprecipitation sequencing uses NGS to identify fragments of DNA bound by specific proteins such as transcription factors and modified histone subunits. Tissue samples or cultured cells are treated with formaldehyde, which creates covalent cross-links between DNA and associated proteins. The DNA is purified and fragmented into short segments of 200–300 bp, then immunoprecipitated with a specific antibody. The cross-links are removed, and the DNA segments are sequenced on an NGS machine (usually Illumina). The sequence reads are aligned to a reference genome, and protein-binding sites are identified as sites on the genome with clusters of aligned reads.
Cloning : In the context of DNA sequencing, DNA cloning refers to the isolation of a single purified fragment of DNA from the genome of a target organism and the production of millions of copies of this DNA fragment. The fragment is usually inserted into a cloning vector, such as a plasmid, to form a recombinant DNA molecule, which can then be amplified in bacterial cells. Cloning requires significant time and hands-on laboratory work and creates a bottleneck for traditional Sanger sequencing projects.
Consensus sequence : When two or more DNA sequences are aligned, the overlapping portions can be combined to create a single consensus sequence. In positions where all overlapping sequences have the same base (a single column of the multiple alignment), that base becomes the consensus. Various rules may be used to generate the consensus for positions where there are disagreements among overlapping sequences. A simple majority rule uses the most common letter in the column as the consensus. Any position where there is disagreement among aligned bases can be written as the letter N to designate “unknown.” There is also a set of IUPAC ambiguity codes (YRWSKMDVHB) that can be used to specify specific sets of different DNA bases that may occupy a single position in the consensus.
Contig : A contiguous stretch of DNA sequence that is the result of assembly of multiple overlapping sequence reads into a single consensus sequence. A contig requires a complete tiling set of overlapping sequence reads spanning a genomic region without gaps.
Coverage : The number of sequence reads in a sequencing project that align to positions that overlap a specific base on a target genome, or the average number of aligned reads that overlap all positions on the target genome.
de Bruijn graph : This is a graph theory method for assembling a long sequence (like a genome) from overlapping fragments (like sequence reads). The de Bruijn graph is a set of unique substrings (words) of a fixed length (a k-mer) that contain all possible words in the data set exactly once. For genome assembly, the sequence reads are split into all possible k-mers, and overlapping k-mers are linked by edges in the graph. Reads are then mapped onto the graph of overlapping k-mers in a single pass, greatly reducing the computational complexity of genome assembly.
De novo assembly : See De novo sequencing.
De novo sequencing : The sequencing of the genome of a new, previously unsequenced organism or DNA segment. This term is also used whenever a genome (or sequence data set) is assembled by methods of sequence overlap without the use of a known reference sequence. De novo sequencing might be used for a region of a known genome that has significant mutations and/or structural variation from the reference.
Diploid : A cell or organism that contains two copies of every chromosome, one inherited from each parent.
DNA fragment : A small piece of DNA, often produced by a physical or chemical shearing of larger DNA molecules. NGS machines determine the sequence of many DNA fragments simultaneously.
Exon : A portion of a gene that is transcribed and spliced to form the final messenger RNA (mRNA). Exons contain protein-coding sequence and untranslated upstream and downstream regions (3′ UTR and 5′ UTR). Exons are separated by introns, which are sequences that are transcribed by RNA polymerase, but spliced out after transcription and not included in the mature mRNA.
FASTA format : This is a simple text format for DNA and protein sequence files developed by William Pearson in conjunction with his FASTA alignment software. The file has a single header line that begins with a “>” symbol followed by a sequence identifier. Any other text on the first line is also considered the header, and any text following the first carriage return/line feed is considered part of the sequence. Multiple sequences can be stored in the same text file by adding additional header lines and sequences after the end of the first sequence.
FASTQ file : A text file format for NGS reads that contains both the DNA sequence and quality information about each base. Each sequence read is represented as a header line with a unique identifier for each sequence read and a line of DNA bases represented as text (GATC), which is very similar to the FASTA format. A second pair of lines is also present for each read, another header line and then a line with a string of ASCII symbols, equal in length to the number of bases in the read, which encode the PHRED quality score for each base.
Fragment assembly : To determine the complete sequence of a genome or large DNA fragment, short sequence reads must be merged. In Sanger sequencing projects, overlaps between sequence reads are found and aligned by similarity methods, then consensus sequences are generated and used to create contigs. Eventually a complete tiling of contigs is assembled across the target DNA. In NGS, there are too many sequence reads to search for overlaps among them all (a problem with exponential complexity). Alternate algorithms have been developed for de novo assembly of NGS reads, such as de Bruijn digraphs, which map all reads to a common matrix of short k-mer sequences (a problem with linear complexity).
GenBank : The international archive of DNA and protein sequence data maintained by the National Center for Biotechnology Information (NCBI), a division of the U.S. National Library of Medicine. GenBank is part of a larger set of online scientific databases maintained by the NCBI, which includes the PubMed online database of published scientific literature, gene expression, sequence variants, taxonomy, chemicals, human genetics, and many software tools to work with these data.
Heterozygote : Humans and most other eukaryotes are diploid, meaning that they carry two copies of each chromosome in every somatic cell. Therefore, each individual carries two copies of each gene, one inherited from each parent. If the two copies of the gene are different (i.e., different alleles of that gene), then the person is said to be a heterozygote for that gene. A homozygote has two identical copies of that gene. In genome sequencing, every base of every chromosome can be considered as a separate data point; thus any single base can be genotyped as heterozygous or homozygous in that individual.
High-performance computing (HPC) (Chapter 12): High-performance computing (HPC) provides computational resources to enable work on challenging problems that are beyond the capacity and capability of desktop computing resources. Such large resources include powerful supercomputers with massive numbers of processing cores that can be used to run high-end parallel applications. HPC designs are heterogeneous, but generally include multicore processors, multiple CPUs within a single computing device or node, graphics processing units (GPUs), and multiple nodes grouped in a cluster interconnected by high-speed networking systems. The most powerful current supercomputers can perform several quadrillion (1015) operations per second (petaflops). Trends for supercomputing architecture are for greater miniaturization of parallel processing units, which saves energy (and reduces heat), speeds message passing, and allows for access to data in shared memory caches.
Histone : In eukaryotic cells, the DNA in chromosomes is organized and protected by wrapping around a set of scaffold proteins called histones. Histones are composed of six different proteins (H1, H2A, H2B, H3, H4, H5). Two copies of each histone bind together to form a spool structure. DNA winds around the histone core about 1.65 times, using a length of 147 bp to form a unit known as the nucleosome. Methylation and other modifications of the histone proteins affect the structure and function of DNA (epigenetics).
Human Genome Project (HGP) : An international effort including 20 sequencing centers in China, France, Germany, Great Britain, Japan, and the United States, coordinated by the U.S. Department of Energy and the National Institutes of Health, to sequence the entire human genome. The effort formally began in 1990 with the allocation of funds by Congress and the development of high-resolution genetic maps of all human chromosomes. The project was formally completed in two stages, the “working draft” genome in 2000 and the “finished” genome in 2003. The 2003 version of the genome was declared to have fewer than one error per 10,000 bases (99.99% accuracy), an average contig size of >27 million bases, and to cover 99% of the gene-containing regions of all chromosomes. In addition, the HGP was responsible for large improvements in DNA sequencing technology, mapping more than 3 million human SNPs, and genome sequences for Escherichia coli, fruit fly, and other model organisms.
Human Microbiome Project : An effort coordinated by the U.S. National Institutes of Health to profile microbes (bacteria and viruses) associated with the human body—first to inventory the microbes present at various locations inside and outside the body and the normal range of variation in healthy people, then to investigate changes in these microbial populations associated with disease.
Illumina sequencing : The NGS sequencing method developed by the Solexa company, then acquired by Illumina Inc. This method uses “sequencing by synthesis” chemistry to simultaneously sequence millions of ∼300-bp-long DNA template molecules. Many sample preparation protocols are supported by Illumina including whole-genome sequencing (by random shearing of genomic DNA), RNA sequencing, and sequencing of fragments captured by hybridization to specific oligonucleotide baits. Illumina has aggressively improved its system through many updates, at each stage generally providing the highest total yield and greatest yield of sequence per dollar of commercially available DNA sequencers each year, leading to a dominant share of the NGS market. Machines sold by Illumina include the Genome Analyzer (GA, GAII, GAIIx), HiSeq, and MiSeq. At various times, with various protocols, Illumina machines have produced NGS reads of 25, 36, 50, 75, 100, and 150 bp as well as paired-end reads.
Indels : Insertions or deletions in one DNA sequence with respect to another. Indels may be a product of errors in DNA sequencing, the result of alignment errors, or true mutations in one sequence with respect to another—such as mutations in the DNA of one patient with respect to the reference genome. In the context of NGS, indels are detected in sequence reads after alignment to a reference genome. Indels are called in a sample (i.e., a patient’s genome) after variant detection has established a high probability that the indel is present in multiple reads with adequate coverage and quality, and not the result of errors in sequencing or alignment.
Intron : A portion of a gene that is spliced out of the primary transcript of a gene and not included in the final messenger RNA (mRNA). Introns separate exons, which contain the protein-coding portions of a gene.
ktup, k-tuple, or k-mer : A short word composed of DNA symbols (GATC) that is used as an element of an algorithm. A sequence read can be broken down into shorter segments of text (either overlapping or non-overlapping words). The length of the word is called the ktup size. Very fast exact matching methods can be used to find words that are shared by multiple sequence reads or between sequence reads and a reference genome. Word matching methods can use hash tables and other data structures that can be manipulated much more efficiently by computer software than sequence reads represented by long text strings.
Mate-pair sequencing : Mate-pair sequencing is similar to paired-end sequencing; however, the size of the DNA fragments used as sequencing templates are much longer (1000–10,000 bp). To accommodate these long template fragments on NGS platforms such as Illumina, additional sample preparation steps are required. Linkers are added to the ends of the long fragments, then the fragments are circularized. The circular molecules are then sheared to generate new DNA fragments at an appropriate size for construction of sequencing libraries (200–300 bp). From this set of sheared fragments, only those fragments containing the added linkers are selected. These selected fragments contain both ends of the original long fragment. New primers are added to both ends, and standard paired-end sequencing is performed. The orientation of the paired sequence reads after mapping to the genome is opposite from a standard paired-end method (outward facing rather than inward facing). Mate-pair methods are particularly valuable for joining contigs in de novo sequencing and for detecting translocations and large deletions (structural variants).
Metagenomics : The study of complete microbial populations in environmental and medical samples. Often conducted as a taxonomic survey using direct PCR (with universal 16S primers) of DNA extracted from environmental samples. Shotgun metagenomics sequences all DNA in these samples, then attempts both taxonomic and functional identification of genes encoded by microbial DNA.
Microarray : A collection of specific oligonucleotide probes organized in a grid pattern of microscopic spots attached to a solid surface, such as a glass slide. The probes contain sequences from known genes. Microarrays are generally used to study gene expression by hybridizing labeled RNA extracted from an experimental sample to the array, and then measuring the intensity of signal in each spot. Microarrays can also be used for genotyping by creating an array of probes that match alternate alleles of specific sequence variants.
Multiple alignment : A computational method that lines up, as a set of rows of text, three or more sequences (of DNA, RNA, or proteins) to maximize the identity of overlapping positions while minimizing mismatches and gaps. The resulting set of aligned sequences is also known as a multiple alignment. Multiple alignments may be used to study evolutionary information about the conservation of bases at specific positions in the same gene across different organisms or about the conservation of regulatory motifs across a set of genes. In NGS, multiple alignment methods are used to reduce a set of overlapping reads that have been mapped to a region of a reference genome by pairwise alignment, to a single consensus sequence; and also to aid in the de novo assembly of novel genomes from sets of overlapping reads created by fragment assembly methods.
Next-generation (DNA) sequencing (NGS) : DNA sequencing technologies that simultaneously determine the sequence of DNA bases from many thousands (or millions) of DNA templates in a single biochemical reaction volume. Each template molecule is affixed to a solid surface in a spatially separate location, and then amplified to increase signal strength. The sequences of all templates are determined in parallel by the addition of complementary nucleotide bases to a sequencing primer coupled with signal detection from this event.
Paired-end sequencing : A technology that obtains sequence reads from both ends of a DNA fragment template. The use of paired-end sequencing can greatly improve de novo sequencing applications by allowing contigs to be joined when they contain read pairs from a single template fragment, even if no reads overlap. Paired-end sequencing can also improve the mapping of reads to a reference genome in regions of repetitive DNA (and detection of sequence variants in those locations). If one read contains repetitive sequence, but the other maps to a unique genome position, then both reads can be mapped.
Paired-end read : See Paired-end sequencing.
Phred score : The Phred software was developed by Phil Green and coworkers working on the Human Genome Project to improve the accuracy of base calling on ABI sequencers (using fluorescent Sanger chemistry). Phred assigns a quality score to each base, which is equivalent to the probability of error for that base. The Phred score is the negative log (base 10) of the error probability; thus a base with an accuracy of 99% receives a Phred score of 20. Phred scores have been adopted as the measure of sequence quality by all NGS manufacturers, although the estimation of error probability is done in many different ways (in some cases with questionable validity).
Poisson distribution : A random probability distribution in which the mean is equal to the variance. This distribution describes rare events that occur with equal probability across an interval of time or space. In NGS, sequence reads obtained from sheared genomic DNA are often assumed to be Poisson-distributed across the genome.
Pyrosequencing : A method of DNA sequencing developed in 1996 by Nyrén and colleagues that directly detects the addition of each nucleotide base as a template is copied. The method detects light emitted by a chemiluminescent reaction driven by the pyrophosphate that is released as the nucleotide triphosphate is covalently linked to the growing copy strand. Each type of base is added in a separate reaction mix, but terminators are not used; thus a series of identical bases (a homopolymer) creates multiple covalent linkages and a brighter light emission. This chemistry is used in the Roche 454 sequencing machines.
Reference genome : A curated consensus sequence for all of the DNA in the genome (all of the chromosomes) of a species of organism. Because the reference genome is created as the synthesis of a variety of different data sources, it may occasionally be updated; thus a particular instance of that reference is referred to by a version number.
Reference sequence : The formally recognized, official sequence of a known genome, gene, or artificial DNA construct. A reference sequence is usually stored in a public database and may be referred to by an accession number or other shortcut designation, such as human genome hg19. An experimentally determined sequence produced by a NGS machine may be aligned and compared to a reference sequence (if one exists) in order to assess accuracy and to find mutations.
Repetitive DNA : DNA sequences that are found in identical duplicates many times in the genome of an organism. Some repetitive DNA elements are found in genomic features such as centromeres and telomeres with important biological properties. Other repetitive elements such as transposons are similar to viruses that copy themselves into many locations on the genome. Simple sequence repeats are another type of repetitive element comprised of linear repeats of 1-, 2-, or 3-base patterns such as CAGcagCAGcag.… A short sequence read that contains only repetitive sequence may align to many different genomic locations, which creates problems with de novo assembly, mapping of sequence fragments to a reference genome, and many related applications.
Ribosomal DNA (rDNA) : Genes that code for ribosomal RNA (rRNA) are present in multiple copies in the genomes of all eukaryotes. In most eukaryotes, the rDNA genes are present in identical tandem repeats that contain the coding sequences for the 18S, 5.8S, and 28S rRNA genes. In humans, a total of 300–400 rDNA repeats are located in regions on chromosomes 13, 14, 15, 21, and 22. These regions form the nucleolus. Additional tandem repeats of the coding sequence for the 5S rRNA are located separately. rRNA is a structural component of ribosomes and is not translated into protein. The rRNA genes are highly transcribed, contributing >80% of the total RNA found in cells. RNA sequencing methods generally include purification steps to remove rRNA or to enrich mRNA from protein-coding genes.
Ribosomal RNA (rRNA) : See Ribosomal DNA.
RNA-seq : The sequencing of cellular RNA, usually as a method to measure gene expression, but also used to detect sequence variants in transcribed genes, alternative splicing, gene fusions, and allele-specific expression. For novel genomes, RNA-seq can be used as experimental evidence to identify expressed regions (coding sequences) and map exons onto contigs and scaffolds.
Roche 454 Genome Sequencer : DNA sequencers developed in 2004 by 454 Life Sciences (subsequently purchased by Roche) were the first commercially available machines that used massively parallel sequencing of many templates at once. These “next-generation sequencing” (NGS) machines increased the output (and reduced the cost) of DNA sequencing by at least three orders of magnitude over sequencing methods that used Sanger chemistry, but produced shorter sequence reads. 454 machines use beads to isolate individual template molecules and an emulsion PCR system to amplify these templates in situ, then perform the sequencing reactions in a flow cell that contains millions of tiny wells that each fits exactly one bead. 454 uses pyrosequencing chemistry, which has very few base-substitution errors, but a tendency to produce insertion/deletion errors in stretches of homopolymer DNA.
SAM/BAM : See BAM file.
Sanger sequencing method : The method developed by Frederick Sanger in 1975 to determine the nucleotide sequence of cloned, purified DNA fragments. The method requires that DNA be denatured into single strands, then a short oligonucleotide sequencing primer is annealed to one strand, and DNA polymerase enzyme extends the primer, adding new complementary deoxynucleotides one at a time, creating a copy of the strand. A small amount of a dideoxynucleotide is included in the reaction, which causes the polymerase to terminate, creating truncated copies. In a reaction with a single type of dideoxynucleotide, all fragments of a specific size will end with the same base. Four separate reactions containing a single dideoxynucleotide (ddG, ddA, ddT, and ddC) must be conducted, and then all four reactions are run on four adjacent lanes of a polyacrylamide gel. The actual sequence is determined from the length of the fragments, which correspond to the position where a dideoxynucleotide was incorporated.
Sequence alignment : An algorithmic approach to find the best matching of consecutive letters in one sequence (text symbols that represent the polymer subunits of DNA or protein sequences) with another. Generally sequence alignment methods balance gaps with mismatches, and the relative scoring of these two features can be adjusted by the user.
Sequence assembly : A computational process of finding overlaps of identical (or nearly identical) strings of letters among a set of sequence fragments and iteratively joining them together to form longer sequences.
Sequence fragment : A short string of text that represents a portion of a DNA (or RNA) sequence. NGS machines produce short reads that are sequence fragments that are read from DNA fragments.
Sequence variants : Differences at specific positions between two aligned sequences. Variants include single-nucleotide polymorphisms (SNPs), insertions and deletions, copy number variants, and structural rearrangements. In NGS, variants are found after alignment of sequence reads to a reference genome. A variant may be observed as a single mismatched base in a single sequence read, or it may be confirmed by variant detection software from multiple sources of data.
Sequencing by synthesis : This is the term used by Illumina to describe the chemistry used in its NGS machines (Illumina Genome Analyzer, HiSeq, MiSeq). The biochemistry involves a single-stranded template molecule, a sequencing primer, and DNA polymerase, which adds nucleotides one by one to a DNA strand complementary to the template. Nucleotides are added to the templates in separate reaction mixes for each type of base (GATC), and each synthesis reaction is accompanied by the emission of light, which is detected by a camera. Each nucleotide is modified with a reversible terminator, so that only one nucleotide can be added to each template. After a cycle of four reactions adding just one G, A, T, or C base to each template, the terminators are removed so that another base can be added to all templates. This cycle of synthesis with each of the four bases and removal of terminators is repeated to achieve the desired read length.
Sequencing primer : A short single-stranded oligonucleotide that is complementary to the beginning of a fragment of DNA that will be sequenced (the template). During sequencing, the primer anneals to the template DNA, then DNA polymerase enzyme adds additional nucleotides that extend the primer, forming a new strand of DNA complementary to the template molecule. DNA polymerase cannot synthesize new DNA without a primer. In traditional Sanger sequencing, the sequencing primer is complementary to the plasmid vector used for cloning; in NGS, the primer is complementary to a linker that is ligated to the ends of template DNA fragments.
Sequence read, short read : When DNA sequence is obtained by any experimental method, including both Sanger and next-generation methods, the data are obtained from individual template molecules as a string of nucleotide bases (represented by the letter symbols G, A, T, C). This string of letters is called a sequence read. The length of a sequence read is determined by the technology. Sanger reads are typically 500–800 bases long, Roche 454 reads 200–400 bases, and Illumina reads may be 25–200 bases (depending on the model of machine, reagent kit, and other variables). Sequence reads produced by NGS machines are often called short reads.
SFF file : Standard Flowgram Format is a file type developed by Roche 454 for the sequencing data produced by their NGS machine. The SFF file contains both sequence and quality information about each base. The format was initially proprietary, but has been standardized and made public in collaboration with the international sequence databases. SFF is a binary format and requires custom software to read it or convert it to human-readable text formats.
Shotgun sequencing : A strategy for sequencing novel or unknown DNA. Many copies of the target DNA are sheared into random fragments, then primers are added to the ends of these fragments to create a sequencing library. The library is sequenced by high-throughput methods to generate a large number of DNA sequence reads that are randomly sampled from the original target. The target DNA is reconstructed using a sequence assembly algorithm that finds overlaps between the sequence reads. This method may be applied to small sequences such as cosmid and BAC clones, or to entire genomes.
Smith–Waterman alignment : A rigorous optimal alignment method for two sequences based on dynamic programming. This method always find the optimal alignment between two sequences, but it is slow and very computationally demanding because it computes a matrix of all possible alignments with all possible gaps and mismatches. The size of this matrix increases with the square of the lengths of the sequences to be aligned, and it requires huge amounts of memory and CPU time to work with genome-sized sequences.
SOLiD sequencing : The Applied Biosystems division of Life Technologies Inc. purchased the SOLiD (Supported Oligo Ligation Detection) technology from the biotech company Agencourt Personal Genomics and released the first commercial version of this NGS machine in 2007. The technology is fundamentally different from any other Sanger or NGS method in that it uses ligation of short fluorescently labeled oligonucleotides to a sequencing primer rather than DNA polymerase to copy a DNA template. Sequences are detected 2 bases at a time, and then base calls are made based on two overlapping oligos. Raw data files use a “color space” system that is different from the base calls produced by all other sequencing systems and requires different informatics software. This system has some interesting built-in error correction algorithms but has failed to show superior overall accuracy in the hands of customers. The yield of the system is similar to that of Illumina NGS machines.
Variant detection : NGS is frequently used to identify mutations in DNA samples from individual patients or experimental organisms. Sequencing can be done at the whole-genome scale; RNA-seq, which targets expressed genes; exome capture, which targets specific exon regions captured by hybridization to probes of known sequence; or amplicons for genes or regions of interest. In all cases, sequence variants are detected by alignment of NGS reads to a reference sequence and then identification of differences between the reads and the reference. Variant detection algorithms must distinguish between random sequencing errors, differences caused by incorrect alignment, and true variants in the genome of the target organism. Various combinations of base quality scores, alignment quality scores, depth of coverage, variant allele frequency, and the presence of nearby sequence variants and indels are all used to differentiate true variants from false positives. Recent algorithms have also made use of machine learning methods based on training sets of genotype data or large sets of samples from different patients/organisms that are sequenced in parallel with the same sample preparation methods on the same NGS machines.
SMRT® Cell: Consumable substrates comprising arrays of zero-mode waveguide nanostructures.
Adapters: Exogenous nucleic acids that are ligated to a nucleic acid molecule to be sequenced. For example, SMRTbell™ adapters are hairpin loops that are ligated to both ends of the double stranded DNA insert to produce a SMRTbell™ sequencing template. When adapter sequences are removed from a CCS read, the read is split into multiple subreads.
Movie: Real-time observation of a SMRT® Cell.
zero-mode waveguide (ZMW): A nanophotonic device for confining light to a small observation volume. This can be, for example, a small hole in a conductive layer whose diameter is too small to permit the propagation of light in the wavelength range used for detection. Physically part of a SMRT® Cell.
Sequencing ZMW: A ZMW (zero-mode waveguide) that is expected to be able to produce a sequence if it is populated with a polymerase. ZMWs used for automated SMRT Cell alignment are not considered sequencing ZMWs.
The wells and SMRT Cells to include in the sequencing run.
The collection and analysis protocols to use for the selected wells and cells.
polymerase read (formerly called “read”): A sequence of nucleotides incorporated by the DNA polymerase while reading a template, such as a circular SMRTbell™ template. Polymerase reads are most useful for quality control of the instrument run. Polymerase read metrics primarily reflect movie length and other run parameters rather than insert size distribution. Polymerase reads are trimmed to include only the high quality region; they include sequences from adapters; and can further include sequence from multiple passes around a circular template.
subread: Each polymerase read is partitioned to form one or more subreads, which contain sequence from a single pass of a polymerase on a single strand of an insert within a SMRTbell™ template and no adapter sequences. The subreads contain the full set of quality values and kinetic measurements. Subreads are useful for applications like de novo assembly, resequencing, base modification analysis, and so on.
circular consensus (CCS) read: The consensus sequence determined using subreads taken from a single ZMW. This is not aligned against a reference sequence. In contrast to Reads of Insert, CCS reads require at least two full-pass subreads from the insert.
read of insert: Represents the highest quality single sequence for an insert, regardless of the number of passes. For example, if your template received one-and-a-half subreads, that information will be combined into a Read of Insert. CCS is an example of a special case where at least two full subreads are collected for an insert. Reads of Insert give the most accurate estimate of the length of the insert sequence loaded onto a SMRT® Cell. For long templates, Reads of Insert may be the same as Polymerase Reads.
Read Length Terminology
mapped polymerase read length: The total number of bases along a read from the first adapter or aligned subread to the last adapter or aligned subread. Approximates the sequence produced by a polymerase in a ZMW.
mapped subread length: The length of the subread alignment to a target reference sequence. This does not include the adapter sequence.
polymerase read length: The total number of bases produced from a ZMW after trimming. This may include the adapter sequence.
Primary Analysis Terminology
Primary analysis protocol: Specifies signal processing of the movie, base calling of the traces/pulses, and quality assessment of the base calls. Primary analysis is always performed on the instrument.
Adapter Screening: Annotates adapter read locations. Used to break a read into subreads during secondary analysis mapping and Circular Consensus.
High Quality Region Screening: Annotates the high quality sequencing regions of a read to be used during Raw Read Trimming.
Insert Screening: Annotates insert DNA regions in the Polymerase Read.
Quality Value Assignment: A prediction of the error probability of a basecall.
Quality Value (QV): The total probability that the basecall is an insertion or substitution or is preceded by a deletion.
QV = -10 * log10(p). For example, QV 20 is 99% accurate, QV 30 is 99.9% accurate, and QV 50 is 99.999% accurate.
Insertion QV: The probability that the basecall is an insertion with respect to the true sequence.
Deletion QV: The probability that a deletion error occurred before the current base.
Substitution QV: The probability that the basecall is a substitution.
raw read trimming: Extraction of high quality regions from an unfiltered read. Trimming of an unfiltered read produces a polymerase read.
Read Quality Assignment: A trained prediction of a read’s mapped accuracy based on its pulse and base file characteristics (peak signal-to-noise ratio, average base QV, interpulse distance, and so on). This is used during secondary analysis filtering.
Secondary Analysis Terminology
Secondary analysis protocol: Specifies how to
Align a group of reads to a reference sequence to produce a consensus sequence.
Assemble a set of reads into contigs to produce a de novo sequence.
Identify insertions, deletions, and SNPs.
Evaluate consensus quality and quality of the instrument run.
Consensus: Generation of a consensus sequence from multiple-sequence alignment.
De Novo Assembly: Assembly of all subreads without a reference sequence.
Filtering: Removes reads that do not meet the Read Quality and Read Length parameters set by the user. The current default filtering parameters defined by Pacific Biosciences are:
Read Quality ≥ .75 (as of SMRT Analysis v2.1)
Read Length ≥ 50 bases
Mapping: Local alignment of a read or subread to a reference sequence.
circular consensus accuracy: Accuracy based on multiple sequencing passes around a single circular template molecule.
consensus accuracy: Accuracy based on aligning multiple sequencing reads or subreads together, optionally with a reference sequence.
polymerase read quality: A trained prediction of a read’s mapped accuracy based on its pulse and base file characteristics (peak signal-to-noise ratio, average base QV, inter-pulse distance, and so on).
Subread Accuracy: The post-mapping accuracy of the basecalls.
Formula: [1 – (errors/subread length)], where errors = number of deletions + insertions + substitutions.