Glossary of Genomics and Bioinformatics

banner

Glossary

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Click on a letter above to jump to that section of the Glossary.

Glossary

Adapted from the GEP Glossary and the MGI Glossary.

3' (3-prime)

A term that identifies one end of a single-stranded nucleic acid molecule. The 3' end is that end of the molecule which terminates in a 3' phosphate group. The 3' direction is the direction toward the 3' end. Nucleic acid sequences are written with the 5' end to the left and the 3' end to the right, in reference to the direction of DNA synthesis during replication (from 5' to 3'), RNA synthesis during transcription (from 5' to 3'), and the reading of mRNA sequence (from 5' to 3') during translation.

For more detail, please see the entries on nucleosides and nucleic acids at Wikipedia.

See also:

3' UTR

3' Untranslated Region. That portion of an mRNA from the 3' end to the position of the last codon used in translation.

For more detail, please see the entry on mRNA at Wikipedia.

See also:

5' UTR

5' Untranslated Region. That portion of an mRNA from the 5' end to the position of the first codon used in translation.

For more detail, please see the entry on mRNA at Wikipedia.

See also:

amino acid
carboxy-terminal
central dogma

amplification

An increase in the number of copies of a specific DNA fragment; can be in vivo or in vitro.

See also: PCR (polymerase chain reaction)

annotation

Note added to a document to provide additional needed information. Gene annotation is the process of indicating the location, structure, and identity of genes in a genome. As this may be based on incomplete information, gene annotations are constantly changing with improved knowledge. Gene annotation databases change regularly, and different databases may refer to the same gene/ protein by different names, reflecting a changing understanding of protein function.

antisense strand

Also called the negative, template, or non-coding strand. This strand of the DNA sequence of a single gene is the complement of the 5' to 3' DNA strand known as the sense, positive, non-template, or coding strand. The term loses meaning for longer DNA sequences with genes on both strands.

See also:

forward (plus) strand
reverse (minus) strand

Augustus

A particular program used in molecular biology; Augustus is an ab initio single isoform gene finder.

BAC

Bacterial Artificial Chromosome, a cloning vector that can accept a large insert of foreign DNA (up to 300,000 bp) and be propagated as a chromosome in bacteria. In addition to the cloning site, the BAC contains selectable markers and a bacterial origin of replication that maintains one copy of the BAC per cell.

base

One of a set of nitrogenous compounds attached to the sugar-phosphate backbone in a nucleic acid. In DNA, the purine bases are adenine (A) and guanine (G), while the pyrimidine bases are cytosine (C) and thymine (T). In RNA, the purine bases are adenine (A) and guanine (G), while the pyrimidine bases are cytosine (C) and uracil (U). Although formally incorrect (the nitrogenous base that gives each nucleotide its name is only part of the nucleotide), this is often used as a synonym for nucleotide.

For more detail, please see the entries on nucleosides and nucleic acids at Wikipedia.

base pair (base pairing)

The hydrogen bonding of one of the bases (A, C, G, T, U) with another, as dictated by the optimization of hydrogen bond formation in DNA (A-T and C-G) or in RNA (A-U and C-G). Two polynucleotide strands, or regions thereof, in which all the nucleotides form such base pairs are said to be complementary. In achieving complementarity, each strand of DNA can serve as a template for synthesis of its partner strand.

For more detail, please see the entries on nucleosides and nucleic acids at Wikipedia.

BLAST

Basic Local Alignment Search Tool, an algorithm used to detect local similarity between biological sequences (either nucleic acid or protein), typically used to search a large database of sequences with a single query sequence. For one implementation, see BLAST at NCBI.

BLAST2

BLAST version in which two sequences can be compared to each other. The sequences can either be nucleic acid or protein. Go to the BLAST web page and select "Align two or more sequences."

BLASTN

BLAST version in which the query and subject are both nucleotide sequences. Typically used to search a nucleotide database with a nucleotide sequence. Go to the BLAST web page and select "blastn."

BLASTP

BLAST version in which both the query and subject are amino acid sequences. Typically used to search a protein database with a protein sequence. Go to the BLAST web page and select "blastp."

BLASTX

BLAST version in which the query is nucleotide sequence and all 6 frames are translated and compared to the subject that is an amino acid sequence. Typically used to search a protein database with a nucleotide sequence. Go to the BLAST web page and select "blastx."

Blosum matrix

This is a matrix that gives a numerical value to the substitution of one amino acid for another when performing protein comparisons. It is based on the observed levels of amino acid substitution when comparing closely related but divergent blocks of protein sequences. The BLOSUM number refers to the percentage of identical amino acids found within the blocks used to generate the matrix. For example, BLOSUM62 uses blocks of protein sequences with no more than 62% identity.

For more information, see the page on BLAST substitution matrices at NCBI.

bioinformatics

The application of computer technology to the management of biological information. Specifically, it is the science of developing computer databases and algorithms to facilitate and expedite biological research, particularly in genomics.

C-terminal

Refers to the end of a protein that contains the carboxylic group -COOH, corresponding to the 3' end of the encoding gene. Also called carboxy-terminal.

For more detail, please see the entry on protein at Wikipedia.

See also:

amino acid
amino-terminal
central dogma

canonical site

In molecular genetics/genomics, this typically refers to intron splicing sites. The vast majority of introns have the nucleotides GT (splice donor site) at their 5' ends and AG (splice acceptor site) at their 3' ends. These are the canonical sites. Variants are rare, and are called non-canonical sites.

carboxy-terminal

Refers to the end of a protein that contains the carboxylic group -COOH, corresponding to the 3' end of the encoding gene. Also called C-terminal.

For more detail, please see the entry on protein at Wikipedia.

See also:

amino acid
amino-terminal
central dogma

central dogma

The principal statement of the molecular basis of inheritance. In its simplest form:

"DNA makes RNA makes protein."

This means that (generally) genetic information is stored in and transmitted as DNA. Genes are expressed by being copied as RNA ( transcription), which in eukaryotes is processed into mRNA via splicing and polyadenylation. The information in mRNA is translated into a protein sequence using a genetic code to interpret three-base codons as instructions to add one of twenty amino acids or to stop translation.

For more detail, please see the entry on genetic code at Wikipedia.

cDNA

"Complementary DNA," a double-stranded DNA molecule prepared in vitro by copying an RNA molecule back into DNA using reverse transcriptase. The RNA component of the resulting RNA-DNA hybrid is then destroyed by alkali, and the complementary strand to the remaining DNA strand synthesized by DNA polymerase. The resulting double-stranded DNA can be used for cloning and analysis.

CDS

Coding sequence, the part of the DNA sequence of a gene that is translated into protein.

For more detail, please see the entry on mRNA at Wikipedia.

Please see the visual guide to terms associated with parts of a typical eukaryotic gene, below.

chromosome

One molecule of double-stranded DNA, carrying an arrangements of genes interspersed with other sequences. In prokaryotes the chromosome is often a circle of DNA, while in eukaryotes chromosomes are typically linear, extending from one end, a telomere, through the centromere to the other telomere.

cleavage

An enzymatic or chemical breakage of the covalent bond that joins two nucleotides or two amino acids in their respective polymers.

CLUSTALW

A program that produces global sequences aligments, typically with multiple related sequences.

coding exon

In a gene, any exon that contains some part of the CDS; in contrast, an exon that has no part translated into protein is called a "non-coding exon."

Please see the visual guide to terms associated with parts of a typical eukaryotic gene, below.

coding strand

In a gene, the DNA strand that has the sequence found in the RNA molecule. Also called the sense, positive, or non-template strand.

See also:

forward (plus) strand
reverse (minus) strand

codon

The sequence of three nucleotides in DNA or RNA that specifies a particular amino acid. For more detail, please see the entry on genetic code at Wikipedia.

complement

The nucleotide sequence of the nucleic acid strand that would form a double-stranded molecule with the nucleic acid strand in question, using standard base-pairing rules.

Consed

A program used to view and edit the DNA sequence data assembled by Phred/Phrad or other base-calling and assembly programs.

consensus

When comparing multiple sequences, whether by Phred/Phrap during DNA sequence assembly or algorithms such as CLUSTALW for protein sequence comparisons, the sequence that reflects the most commonly seen base at each position.

conserved synteny

The occurrence of synteny of orthologous genes in two different organisms.

contig

A contiguous assembly, without gaps, of overlapping sequences based on regions of shared sequence.

coordinates

Numerical position within a biological sequence, e.g. the first base in a DNA sequence would have the coordinate 1.

cosmid

A type of hybrid plasmid (often used as a cloning vector) that contains cos sequences, DNA sequences originally from the Lambda phage. Cosmids are maintained in E. coli as multi-copy episomes, and are typically used to build genomic libraries.

See also:

central dogma
nucleic acid
RNA

DNA affinity chromatography

A procedure in which a protein is separated from a mixture of other proteins by its ability to bind specifically to a particular sequence of DNA that has been immobilized on a matrix in a separation column.

DNA binding domain

The region of a protein that can specifically bind to DNA. Several motifs that can bind to DNA have been characterized (e.g. helix-loop-helix, leucine zipper, and zinc-finger domains). In most cases, the structure of the domain has evolved so that a portion of it interacts with a specific sequence of DNA by binding in the major groove of the DNA molecule.

DNase hypersensitive site

A region along the chromatin fiber that is nucleosome-free, and hence much more accessible to cleavage by DNase I or other nucleases. Such sites are usually found at the promoters and enhancers of active or inducible genes.

DNA ligase

An enzyme that joins pieces of DNA by catalyzing the formation of a covalent phosphodiester bond between the 5' phosphate end of one nucleotide and the 3'- OH group of the adjacent nucleotide.

DNA polymerase

A group of enzymes capable of extending a strand of DNA by adding successive deoxyribonucleotides at the 3' end in the order dictated by an associated template strand; the basic biochemical reaction it catalyzes is: (dNMP)n + dNTP ----> (dNMP)n+1 + PPi

DNAase I footprinting

A means of determining the exact binding site of a protein on a specific DNA fragment. When a protein- DNA complex is digested with DNase I, the bound protein will protect the DNA from cleavage at the site of the protein- DNA complex. A comparison of the cleavage fragments from DNA with and without the protein bound to it will indicate the region of protein binding by the absence of cleavage.

donor site

Splicing donor site. The boundary between an intron and the exon immediately upstream (i.e. on the 5' side of the intron).

For more information, please see the entry on mRNA splicing at Wikipedia.

dot chromosome

The fourth chromosome of Drosophila melanogaster, which is comprised principally of heterochromatin; any very small chromosome that appears as a "dot" during mitosis.

Dot Matrix View

The comparison of two sequences on an X-Y plot where points ("dots") on the graph ("matrix") indicate sequence identity at the corresponding positions. A continuous line with slope 1 indicates high levels of sequence conservation in that region and provides confidence in the proposed gene model.

downstream

Toward the 3' end of a single stranded DNA molecule or gene of interest. "Upstream" similarly refers to something closer to the 5' end.

duplication

The creation of a second copy of a sequence in a genome. A duplicate copy of a gene may be mutated without affecting the viability of the organism, so gene duplication is thought to be a significant factor in the evolution of genomic diversity.

E value

Expected value, a numerical indication of the statistical significance of an alignment. Describes the number of hits one can "expect" to see by chance when searching a database of a particular size. The lower the E value, the more significant the alignment. For small E-values, will be nearly identical to the probability of seeing an alignment with this quality purely by chance. See the BLAST tutorial at NCBI.

electrophoresis

A procedure to separate molecules in an electric field. DNA migration takes place through a gel matrix (agarose or polyacrylamide) that acts as a molecular sieve. Since DNA is negatively charged, when placed in an electric field, it is attracted toward the positive electrode. Because the negative charges are uniformly distributed along the DNA, molecules separate in the field based on their sizes.

end labeling

The addition of a radioactively labeled group to one end (5' or 3') of a DNA strand.

endonuclease

An enzyme that cleaves phosphodiester bonds within a nucleic acid chain. A particular endonuclease may be specific for RNA or for single-stranded or double-stranded DNA. Restriction enzymes are endonucleases that cut double-stranded DNA at or near a specific target sequence, such as CGCG.

enhancer

A eukaryotic DNA sequence located outside of the promoter region, where an activator of transcription ( protein) may bind.

ENSEMBL

Joint genome browser maintained by the European Bioinformatics Institute and the Wellcome Trust Sanger Institute. Contains searchable genomic information for select model organisms.

environmental containment

Process for ensuring that an organism (such as a host cell) cannot survive long in the environment, based on its requirements for survival and growth, to prevent its uncontrolled proliferation. It is used as a safety precaution when growing recombinant DNA clones with potentially hazardous properties (for example, when cloning a gene encoding a neurotoxin).

enzyme

Enzymes are proteins that serve as biological catalysts. Only proteins were once thought to be enzymes; now, catalytic RNAs called ribozymes are known.

EST

Expressed sequence tag: a short DNA sequence derived from a single read of a clone from a cDNA library. They are therefore by definition from genes that are expressed in that cell type under the growth conditions used. Typically ESTs have high levels of sequence error because they are from single reads, and not the consensus of multiple reads.

ethidium bromide

An intercalating chemical that is used to stain DNA; it is fluorescent under ultraviolet light. Ethidium bromide is a mutagen; use gloves and other appropriate protection when handling this material.

euchromatin

Those regions of the genome that do not remain condensed throughout the cell cycle. Euchromatic regions are typically enriched for genes, show significantly higher levels of recombination and lower levels of repeats than heterochromatic regions.

See also:

amino acid
carboxy-terminal
central dogma

NCBI

National Center for Biotechnology Information. Please see the NCBI website.

negative strand

Also called the antisense, template, or non-coding strand. This strand of the DNA sequence of a single gene is the complement of the 5' to 3' DNA strand known as the sense, positive, non-template, or coding strand. The term loses meaning for longer DNA sequences with genes on both strands.

See also:

forward (plus) strand
reverse (minus) strand

non-canonical

In molecular genetics/genomics, this typically refers to intron splicing site sequences that are only very rarely used and are never considered by gene prediction algorithms;

See also:

forward (plus) strand
reverse (minus) strand

non-consensus

A base or sequence that does not match the most common element found at a given position.

See also:

forward (plus) strand
reverse (minus) strand

nucleic acid

DNA or RNA. Each of these compounds consists of a backbone of sugar molecules ribose for RNA and deoxyribose for DNA linked by single phosphate groups. Attached to the sugars of the backbone are any of four nitrogenous bases, A, T, C or G for DNA and A, U, C or G for RNA.

For more detail, please see the entry on nucleic acids at Wikipedia.

nucleoside

A small molecule made up of either a purine or pyrimidine base linked to a pentose (sugar), generally either ribose or deoxyribose.

For more detail, please see the entries on nucleosides and nucleic acids at Wikipedia.

nucleotide

A nucleoside linked to one or more phosphate groups via an ester bond with the pentose. DNA and RNA are polymers of nucleotides, linked through the 5' and 3' carbons of the sugar.

For more detail, please see the entries on nucleosides and nucleic acids at Wikipedia.

ORF

Open Reading Frame, a long stretch of codons in the same reading frame uninterrupted by stop codons; an ORF may reflect the presence of a gene.

orthologous genes

Genes in different organisms that are direct evolutionary counterparts; that is, they are related by descent from a common ancestor. Orthologous genes normally have the same cellular function.