Click on a letter above to jump to that section of the Glossary.
- 3' (3-prime)
- A term that identifies one end of a single-stranded
nucleic acid molecule. The 3' end is
that end of the molecule which terminates in a 3' phosphate group. The 3' direction is the
direction toward the 3' end. Nucleic acid sequences are written with the
5' end to the left
and the 3' end to the right, in reference to the direction of
DNA synthesis during
replication
(from 5' to 3'),
RNA synthesis during
transcription (from 5' to 3'), and the reading of
mRNA sequence (from 5' to 3') during
translation.
For more detail, please see the entries on
nucleosides and
nucleic acids at Wikipedia.
See also:
- 3' UTR
- 3' Untranslated Region. That portion of an
mRNA from the
3' end to the position of the last
codon used in
translation.
For more detail, please see the entry on
mRNA at Wikipedia.
See also:
5' UTR
Please see the visual guide to terms associated with parts of a typical eukaryotic gene, below.
- 454 sequencing
- A large-scale parallel
pyrosequencing system capable of sequencing roughly 400 - 600
megabases of
DNA per 10-hour run. The technology is known for its relatively unbiased sample
preparation and moderately long, highly accurate sequence reads (~400 base pairs in length).
Please see the
video.
- 5' (5-prime)
- A term that identifies one end of a single-stranded
nucleic acid molecule. The 5' end is that end of the molecule
which terminates in a 5' phosphate group. The 5' direction is the direction toward the 5' end.
Nucleic acid sequences are written with the 5' end to the left and the
3' end to the right, in reference to the direction of
DNA synthesis during
replication (from 5' to 3'),
RNA synthesis during
transcription (from 5' to 3'), and the reading of
mRNA sequence (from 5' to 3') during
translation.
For more detail, please see the entries on
nucleosides and
nucleic acids at Wikipedia.
See also:
- 5' UTR
- 5' Untranslated Region. That portion of an
mRNA from the
5' end to the position of the first
codon used in
translation.
For more detail, please see the entry on
mRNA at Wikipedia.
See also:
3' UTR
Please see the visual guide to terms associated with parts of a typical eukaryotic gene, below.
- ab initio
- Formulated without experimental data. Latin. From the beginning. In computing, ab initio is a term used to define computations based soley on theory or
using only fundamental constants. In computational biology, the term refers to
algorithms
that use only sequence information rather than including experimental observations to make
predictions about
gene structure.
- acceptor site
- Splicing acceptor site. The boundary between an
intron and the
exon immediately downstream (i.e. on the
3' side of the intron).
For more information, please see the entry on
mRNA splicing at Wikipedia.
- accession ID
- A unique alphanumeric character string that is used to unambiguously identify a particular record in a
database.
DNA,
RNA, and
protein sequences submitted
to NCBI or equivalent databases all have unique accession IDs. For example, the human leptin receptor's accession number
is P48357 in the SwissProt database.
- algorithm
- A detailed sequence of actions to perform to accomplish some task. Technically, an algorithm
must reach a result after a finite number of steps, thus ruling out brute force search methods
for certain problems, though some might claim that brute force search was also a valid (generic)
algorithm. The term is also used loosely for any sequence of actions (which may or may not terminate).
- alignment
- In bioinformatics, a sequence alignment is a way of arranging two or more sequences of
DNA,
RNA, or
protein to identify regions of similarity; such similarity may be a consequence
of functional, structural, or evolutionary relationships between the sequences.
- alignment score
- An alignment score is a numerical value used in computational biology to quantify the
level of similarity between two aligned sequences. Generally the higher the score, the
more similar the two sequences.
- allele
- One of the variant forms of a
gene, differing from other forms in its
nucleotide sequence.
- alpha-satellite sequence
- A family of tandemly repeated
DNA sequences present in the
chromosome of many higher
eukaryotes, believed to help maintain the structure and function of the centromere.
- alternative splicing
- The production of two or more distinct
mRNAs from
RNA transcripts having the same sequence
via differences in
splicing (by the choice of different
exons).
- ALU sequences
- A set of dispersed, highly abundant related sequences, each about 300 bp long, in primate
genomes; each Alu element is a
retrotransposon. The individual members have a characteristic
AluI restriction enzyme cleavage site.
- amino acid
- A molecule of the general formula NH2-CHR-COOH, where "R" is one of a number
of different side chains, including acidic, basic, or hydrophobic groups. Amino acids are the building blocks of
proteins. The sixty-four
codons of the
genetic code allow the use of twenty different amino acids (the primary amino acids) in the synthesis of
proteins. Other nonprimary amino acids occur in proteins by
enzymatic modification of amino acids in mature proteins, and as metabolic intermediates.
For more detail, please see the entry on
proteinogenic amino acids at Wikipedia.
- amino-terminal
- Refers to the end of a
protein that contains the amino group -NH2, corresponding to the
5' end of the encoding
gene. Also called
N-terminal.
For more detail, please see the entry on
protein at Wikipedia.
See also:
- amplification
- An increase in the number of copies of a specific
DNA fragment; can be
in vivo or
in vitro.
See also:
PCR (polymerase chain reaction)
- annotation
- Note added to a document to provide additional needed information.
Gene annotation is the process of indicating the location, structure, and identity of
genes in a genome. As this may be based on incomplete information, gene annotations are
constantly changing with improved knowledge.
Gene annotation databases change regularly,
and different databases may refer to the same
gene/
protein by different names, reflecting
a changing understanding of
protein function.
- antisense strand
- Also called the
negative,
template, or
non-coding strand. This strand of the
DNA sequence of a single
gene is the complement of the 5' to 3'
DNA strand known as the
sense,
positive,
non-template, or
coding strand. The term loses meaning for longer
DNA sequences with
genes on both strands.
See also:
- Augustus
- A particular program used in molecular biology; Augustus is an ab initio single isoform
gene finder.
- BAC
- Bacterial Artificial Chromosome, a cloning vector that can accept a large insert of foreign
DNA (up to 300,000 bp) and be propagated as a
chromosome in bacteria.
In addition to the cloning site, the BAC contains selectable markers and a bacterial origin of
replication that maintains one copy of the BAC per cell.
- base
- One of a set of nitrogenous compounds attached to the sugar-phosphate backbone in a
nucleic acid. In
DNA, the purine bases are adenine (A) and guanine (G), while the pyrimidine
bases are cytosine (C) and thymine (T). In
RNA, the purine bases are adenine (A) and guanine (G), while the pyrimidine
bases are cytosine (C) and uracil (U). Although formally incorrect (the nitrogenous base that gives each
nucleotide its name is only part of the nucleotide), this is
often used as a synonym for nucleotide.
For more detail, please see the entries on
nucleosides and
nucleic acids at Wikipedia.
- base pair (base pairing)
- The hydrogen bonding of one of the bases (A, C, G, T, U) with another, as dictated by
the optimization of hydrogen bond formation in
DNA (A-T and C-G) or in
RNA (A-U and C-G).
Two polynucleotide strands, or regions thereof, in which all the nucleotides form such base
pairs are said to be complementary. In achieving complementarity, each strand of
DNA can
serve as a template for synthesis of its partner strand.
For more detail, please see the entries on
nucleosides and
nucleic acids at Wikipedia.
- BLAST
- Basic Local Alignment Search Tool, an
algorithm used to detect local similarity between
biological sequences (either
nucleic acid or
protein), typically used to search a large
database of sequences with a single query sequence. For one implementation, see
BLAST at NCBI.
- BLAST2
- BLAST version in which two sequences can be compared to each other. The sequences can
either be
nucleic acid or
protein. Go to the
BLAST web page and select
"Align two or more sequences."
- BLASTN
- BLAST version in which the query and subject are both nucleotide sequences. Typically used to search a
nucleotide database with a nucleotide sequence. Go to the
BLAST web page and select
"blastn."
- BLASTP
- BLAST version in which both the query and subject are
amino acid sequences. Typically
used to search a
protein database with a protein sequence. Go to the
BLAST web page and select "blastp."
- BLASTX
- BLAST version in which the query is
nucleotide sequence and all 6 frames are translated
and compared to the subject that is an
amino acid sequence. Typically used to search a
protein database with a
nucleotide sequence. Go to the
BLAST web page and select "blastx."
- Blosum matrix
- This is a matrix that gives a numerical value to the substitution of one
amino acid for another when performing
protein comparisons. It is based on the observed levels of amino acid
substitution when comparing closely related but divergent blocks of protein sequences. The
BLOSUM number refers to the percentage of identical amino acids found within the blocks used
to generate the matrix. For example, BLOSUM62 uses blocks of
protein sequences with no more
than 62% identity.
For more information, see the page on
BLAST substitution matrices at NCBI.
- bioinformatics
- The application of computer technology to the management of biological information.
Specifically, it is the science of developing computer
databases and
algorithms to facilitate and expedite biological research, particularly in
genomics.
- C-terminal
- Refers to the end of a
protein that contains the carboxylic group -COOH, corresponding
to the 3' end of the encoding
gene. Also called
carboxy-terminal.
For more detail, please see the entry on
protein at Wikipedia.
See also:
- canonical site
- In molecular genetics/genomics, this typically refers to
intron splicing sites. The vast
majority of introns have the nucleotides GT
(splice donor site) at their
5' ends and AG
(splice acceptor site) at their
3' ends. These are the canonical sites. Variants are rare, and are called
non-canonical sites.
- carboxy-terminal
- Refers to the end of a
protein that contains the carboxylic group -COOH, corresponding
to the 3' end of the encoding
gene. Also called
C-terminal.
For more detail, please see the entry on
protein at Wikipedia.
See also:
- central dogma
- The principal statement of the molecular basis of inheritance. In its simplest form:
"DNA makes
RNA makes
protein."
This means that (generally) genetic information is stored in and transmitted as DNA.
Genes are expressed by being copied as RNA (
transcription), which in eukaryotes is processed into
mRNA via
splicing and
polyadenylation. The information in mRNA is
translated into a protein sequence using a
genetic code to interpret three-base
codons as instructions to add one of twenty
amino acids or to stop translation.
For more detail, please see the entry on
genetic code at Wikipedia.
- cDNA
- "Complementary DNA," a double-stranded
DNA molecule prepared in vitro by copying an RNA molecule back into
DNA using reverse transcriptase. The RNA component of the resulting
RNA-DNA hybrid is then destroyed by alkali, and the complementary strand to the remaining
DNA strand synthesized by DNA polymerase. The resulting double-stranded
DNA can be used for
cloning and analysis.
- CDS
- Coding sequence, the part of the
DNA sequence of a
gene that is translated into
protein.
For more detail, please see the entry on
mRNA at Wikipedia.
Please see the visual guide to terms associated with parts of a typical eukaryotic gene, below.
- chromosome
- One molecule of double-stranded
DNA, carrying an arrangements of
genes interspersed with
other sequences. In prokaryotes the
chromosome is often a circle of
DNA, while in eukaryotes
chromosomes are typically linear, extending from one end, a telomere, through the centromere
to the other telomere.
- cleavage
- An enzymatic or chemical breakage of the covalent bond that joins two
nucleotides or two
amino acids in their respective polymers.
- CLUSTALW
- A program that produces global sequences aligments, typically with multiple related sequences.
- coding exon
- In a
gene, any
exon that contains some part of the CDS; in contrast, an exon that has
no part translated into
protein is called a "non-coding exon."
Please see the visual guide to terms associated with parts of a typical eukaryotic gene, below.
- coding strand
- In a
gene, the
DNA strand that has the sequence found in the
RNA molecule. Also called the
sense,
positive, or
non-template strand.
See also:
- codon
- The sequence of three
nucleotides in
DNA or
RNA that specifies a particular
amino acid.
For more detail, please see the entry on
genetic code at Wikipedia.
- complement
- The
nucleotide sequence of the
nucleic acid strand that would form a double-stranded
molecule with the nucleic acid strand in question, using standard base-pairing rules.
- Consed
- A program used to view and edit the
DNA sequence data assembled by Phred/Phrad or other
base-calling and assembly programs.
- consensus
- When comparing multiple sequences, whether by Phred/Phrap during
DNA sequence assembly or
algorithms such as
CLUSTALW for
protein sequence comparisons, the sequence that reflects the most commonly seen base at each position.
- conserved synteny
- The occurrence of
synteny of
orthologous genes in two different organisms.
- contig
- A contiguous assembly, without gaps, of overlapping sequences based on regions of shared sequence.
- coordinates
- Numerical position within a biological sequence, e.g. the first base in a
DNA sequence
would have the coordinate 1.
- cosmid
- A type of hybrid plasmid (often used as a cloning vector) that contains cos sequences,
DNA sequences originally from the Lambda phage. Cosmids are maintained in E. coli as multi-copy
episomes, and are typically used to build genomic libraries.
See also:
fosmid
- CpG island
- A region of
nucleotides, typically 300-3000 bp in length, that has a higher than expected
frequency of the sequence C followed by G, usually found in or near promoter sequences in
organisms with significant amounts of
DNA methylation. Exact mathematically-based definitions
vary among published studies.
- cryptic site
- A site in a nucleic acid sequence that does not match an authentic
splice site, whether
donor or
acceptor, but
which upon mutation to a consensus splice site is used by the cell's splicing machinery
and thereby causes incorrectly-spliced
mRNA products to be made.
- database
- A data structure that stores metadata, i.e. data about data. More generally, an organized collection of information.
- de novo prediction
- Analysis of a
DNA sequence to predict the location of
genes (
exons and
introns) using
only the sequence itself and known characteristics of genes (e.g. consensus splice site
sequences in eukaryotic genomes). No knowledge of experimentally-confirmed structure or
function is used in de novo prediction.
- degenerate sequence
- A sequence in which one symbol can represent multiple possibilities. The genetic code is said to be degenerate because most
amino acids are encoded by multiple
codons.
For more detail, please see the entry on
genetic code at Wikipedia.
In
DNA
sequence a degenerate code allows for a single symbol to designate more than one possible
base, e.g. B stands for C, G or T.
- deletion
- The removal of some part of a
nucleic acid or
protein sequence.
- differential gene expression
- Pattern of
gene expression in multi-cellular organisms, where distinct patterns of
transcription
are shown by different cell types, or at different times during development, or under different
environmental conditions.
- DINE
- A particular family of transposable elements found in Drosophila genomes.
- dissociation
- The separation of double stranded
DNA into single strands by raising the temperature or the pH.
- DNA
- Deoxyribonucleic acid. The
nucleic acid of which
genes are made.
For more detail, please see the entry on
DNA at Wikipedia.
See also:
- DNA affinity chromatography
- A procedure in which a
protein is separated from a mixture of other
proteins by its ability to bind specifically to a particular sequence of
DNA that has been immobilized on
a matrix in a separation column.
- DNA binding domain
- The region of a
protein that can specifically bind to
DNA. Several motifs that can bind to
DNA have been characterized (e.g. helix-loop-helix, leucine zipper, and zinc-finger domains).
In most cases, the structure of the domain has evolved so that a portion of it interacts with
a specific sequence of
DNA by binding in the major groove of the
DNA molecule.
- DNase hypersensitive site
- A region along the chromatin fiber that is nucleosome-free, and hence much more accessible
to cleavage by DNase I or other nucleases. Such sites are usually found at the promoters
and enhancers of active or inducible
genes.
- DNA ligase
- An enzyme that joins pieces of
DNA by catalyzing the formation of a covalent phosphodiester
bond between the
5' phosphate end of one
nucleotide and the
3'- OH group of the adjacent nucleotide.
- DNA polymerase
- A group of enzymes capable of extending a strand of
DNA by adding successive deoxyribonucleotides
at the 3' end in the order dictated by an associated template strand; the basic biochemical
reaction it catalyzes is: (dNMP)n + dNTP ----> (dNMP)n+1 + PPi
- DNAase I footprinting
- A means of determining the exact binding site of a
protein on a specific DNA fragment.
When a
protein-
DNA complex is digested with DNase I, the bound
protein will protect the DNA
from cleavage at the site of the
protein-
DNA complex. A comparison of the cleavage fragments from
DNA with and without the
protein bound to it will indicate the region of
protein binding by the absence of cleavage.
- donor site
- Splicing donor site. The boundary between an
intron and the
exon immediately upstream (i.e. on the 5' side
of the intron).
For more information, please see the entry on
mRNA splicing at Wikipedia.
- dot chromosome
- The fourth
chromosome of Drosophila melanogaster, which is comprised principally of heterochromatin;
any very small
chromosome that appears as a "dot" during mitosis.
- Dot Matrix View
- The comparison of two sequences on an X-Y plot where points ("dots") on the graph ("matrix")
indicate sequence identity at the corresponding positions. A continuous line with slope 1
indicates high levels of sequence conservation in that region and provides confidence in
the proposed
gene model.
- downstream
- Toward the
3' end of a single stranded
DNA molecule or
gene of interest. "Upstream"
similarly refers to something closer to the
5' end.
- duplication
- The creation of a second copy of a sequence in a genome. A duplicate copy of a
gene may
be mutated without affecting the viability of the organism, so
gene duplication is thought
to be a significant factor in the evolution of genomic diversity.
- E value
- Expected value, a numerical indication of the statistical significance of an alignment.
Describes the number of hits one can "expect" to see by chance when searching a database
of a particular size. The lower the E value, the more significant the alignment. For small
E-values, will be nearly identical to the probability of seeing an alignment with this
quality purely by chance. See the
BLAST tutorial at NCBI.
- electrophoresis
- A procedure to separate molecules in an electric field.
DNA migration takes place through
a gel matrix (agarose or polyacrylamide) that acts as a molecular sieve. Since
DNA is
negatively charged, when placed in an electric field, it is attracted toward the positive
electrode. Because the negative charges are uniformly distributed along the
DNA, molecules
separate in the field based on their sizes.
- end labeling
- The addition of a radioactively labeled group to one end (5' or 3') of a
DNA strand.
- endonuclease
- An enzyme that cleaves phosphodiester bonds within a
nucleic acid chain. A particular
endonuclease may be specific for
RNA or for single-stranded or double-stranded
DNA. Restriction
enzymes are endonucleases that cut double-stranded
DNA at or near a specific target sequence,
such as CGCG.
- enhancer
- A eukaryotic
DNA sequence located outside of the promoter region, where an activator of
transcription (
protein) may bind.
- ENSEMBL
- Joint genome browser maintained by the European Bioinformatics Institute and the Wellcome
Trust Sanger Institute. Contains searchable genomic information for select model organisms.
- environmental containment
- Process for ensuring that an organism (such as a host cell) cannot survive long in the
environment, based on its requirements for survival and growth, to prevent its uncontrolled
proliferation. It is used as a safety precaution when growing recombinant
DNA clones with
potentially hazardous properties (for example, when cloning a
gene encoding a neurotoxin).
- enzyme
- Enzymes are
proteins that serve as biological catalysts. Only
proteins were once thought
to be enzymes; now, catalytic RNAs called ribozymes are known.
- EST
- Expressed sequence tag: a short
DNA sequence derived from a single read of a clone from a
cDNA library. They are therefore by definition from
genes that are expressed in that cell
type under the growth conditions used. Typically ESTs have high levels of sequence error
because they are from single reads, and not the consensus of multiple reads.
- ethidium bromide
- An intercalating chemical that is used to stain
DNA; it is fluorescent under ultraviolet
light. Ethidium bromide is a mutagen; use gloves and other appropriate protection when handling
this material.
- euchromatin
- Those regions of the genome that do not remain condensed throughout the cell cycle.
Euchromatic regions are typically enriched for
genes, show significantly higher levels of
recombination and lower levels of repeats than heterochromatic regions.
See also:
heterochromatin.
- eukaryote
- The class of organisms composed of one or more cells, each of which contains a membrane-enclosed
nucleus and packages its
DNA with histones in a nucleosome array. Eukaryotic cells typically
have other complex organelles, such as mitochondria.
- exon
- An exon is a contiguous segment of eukaryotic
DNA that corresponds to a portion of the mature (processed)
RNA product of that
gene. Exons are found only in eukaryotic genomes,
and are separated by
introns. Although the introns are transcribed with the exons, the
latter are spliced out and discarded during
RNA processing.
Please see the visual guide to terms associated with parts of a typical eukaryotic gene, below.
- exonuclease
- An enzyme that cleave
nucleotides one at a time from the end of a polynucleotide chain.
A particular exonuclease may be specific for either the
5' or
3' end of either
DNA or
RNA.
The 3'-to-5' exonuclease activity of DNA pol I and DNA pol III allows these enzymes to excise
the nucleotide that they just added if it base pairs incorrectly ("editing").
- FASTA
- A text format used to represent
nucleotide or
amino acid sequences. FASTA-formatted sequences
begin with a rightward-pointing angle bracket or chevron (>) after which follows information
about the sequence on the same line. Line two begins the actual sequence of interest.
Variations of FASTA format (e.g. Pearson FASTA and NCBI FASTA) exist, but may not be handled
gracefully or even correctly by a given program, so care should be taken to use the desired
version of FASTA as appropriate.
- feature
- Part of a GenBank entry containing information about a
nucleotide sequence of interest.
Features include but are not limited to the length of the sequence, predicted positions of
promoters, ribosome binding sites,
protein coding regions (CDS), and
translation products.
- FlyBase
- A database of Drosophila
genes and genomes; please see the
FlyBase website.
- forward strand
- In a display of a double-stranded
DNA sequence, which may be as long as an entire chromosome,
the strand that is read from
5' to
3' from left to right is called the forward or
plus strand.
The strand that is read from 5' to 3' from right to left is called the
reverse or
minus stand.
- fosmid
- Cloning vectors based on bacterial F-plasmids. Fosmids are used in cloning genomic libraries,
typically with insert sized of ~40 kb and are maintained in E. coli as a stable single copy
episome.
See also:
cosmid
- frame
- A frame is a single series of adjacent
nucleotide triplets in
DNA or
RNA: one frame
would have bases at positions 1, 4, 7, etc. as the first base of sequential
codons. There are 3 possible reading frames in an
mRNA strand and six in a double stranded DNA molecule
due to the two strands from which
transcription is possible. Different computer programs
number these frames differently, particularly for frames of the negative strand, so care
should be taken when comparing designated frames from different programs.
- gander
- Server at GEP that houses a copy of the University of California San Diego genome
browser filled with GEP-specific projects.
- gaps
- When comparing or aligning two or more
protein or
nucleic acid sequences, spaces ("gaps")
may be introduced in the final alignment in order to maximize matches and minimize mismatches.
Assignment of gaps is dependent on the alignment program and model parameters chosen.
- GC-rich region
- A region of
DNA with higher than expected fraction of CG basepairs. In some organisms
GC-rich regions are found in or near promoters.
See also:
CpG island.
- gel shift assay
- Also known as electrophoretic mobility shift assay (EMSA), a gel shift assay is a means of detecting
DNA-
protein interactions. A complex between a fragment of
DNA and a
protein moves more slowly during gel electrophoresis than does the
DNA fragment alone, resulting in
a "shift" in the position of the
DNA fragment in the presence of the
protein.
- gene
- The basic unit of heredity; a portion of
DNA (in most organisms) or
RNA (in some viruses) that usually encodes a
protein product.
See also:
central dogma
Please see the visual guide to terms associated with parts of a typical eukaryotic gene, below.
- Gene Model Checker
- A program developed by GEP to test the validity of a proposed
gene model after annotation.
This programs checks that the model complies with easily defined characteristics of a
gene
including the presence of start and stop
codons and proper sequences at
intron/
exon splice
junctions. It does not check for other important characteristics like overlap with evidence for
transcription or degree of sequence conservation. Please see the
Gene Model Checker
at the GEP website.
- Gene Record Finder
- A GEP-developed tool to provide information on Drosophila melanogaster genes, organized
to describe how isoforms are related to each other (common and unique
exons) and provide
protein sequences corresponding to each exon.
- GeneID
- A number that identifies a
protein with a known function at JCSG (Joint Center for Structural Genomics).
- genetic code
- The relationship of the sixty-four
nucleic acid
codons to the twenty primary
amino acids. For more detail, please see the entry on
genetic code at Wikipedia.
- genome
- All of the
nucleotides, and their order, found in a single cell, or organelle. The entire
complement of genetic material of an organism. "Haploid genome" refers to one copy of each
chromosome in a diploid organism.
- genomics
- The comprehensive study of whole sets of
genes and their interactions rather than single genes or
proteins.
- genscan
- An ab initio gene prediction program developed at MIT. Please see
GENSCAN at MIT.
- GEP
- Genomics Education Partnership, based at Washington University, St. Louis. Please see the
GEP website.
- GFF
- Gene Feature Format; a particular format for text files used to describe genomic annotations.
For specifications see
gff specs
at the Sanger site. Many genome browsers allow data to be imported using data encoded in GFF format.
- global alignment
- An alignment of two sequences over their entire lengths (note that the alignment need not be perfect).
- golden tiling path
- A method created by the Human Genome Project (HGP) sequencing labs that uses mapping
markers to choose the minimum number of slightly overlapping clones that completely span a
genomic region of interest.
- handedness
- A helix is characterized as right-handed if it is turning clockwise as it moves away
from you; if it turns counter-clockwise, it is left-handed.
- helicase
- An enzyme that uses the energy of ATP hydrolysis to disrupt the hydrogen bonds that hold
the two strands of
DNA together, allowing the double helix to unwind.
- heterochromatin
- DNA that remains condensed throughout the cell cycle, heterochromatin is thought to be
tightly bound to
proteins and other molecules. Heterochromatic regions tend to have a high
content of repetitious
DNA (satellite DNA, middle repetitious sequences), are gene-poor,
show little or no
transcriptional activity, and replicate late in S-phase. Blocks of
heterochromatin are generally found around the centromeres and telomeres.
See also:
euchromatin.
- histones
- The small, basic
proteins used to package the
DNA in chromatin in eukaryotes. The core
histones (H2A, H2B, H3, and H4) are highly conserved throughout all eucharyotes, while
histone H1 is more variable.
- HLH (Helix-Loop-Helix or helix-turn-helix)
- A ca. 20-
amino acid
protein motif that forms two alpha-helices that cross at an angle
of ~120°. This motif occurs frequently in
DNA binding proteins, with one of the helices
(stabilized by the other) forming contacts with the DNA in the major groove.
- homologous
- Nucleic acid and
protein sequences are homologous if they have evolved from a common
ancestor.
Please see the figure showing
homology.
- homolog (also homologue)
- A specific member of a group of homologous sequences or molecules.
Please see the figure showing
homology.
- homology
- Homology is the state of being homologous.
Algorithms such as BLAST identify similarity
that is evidence for, but not necessarily proof of, homology.
Please see the figure showing
homology.
- host-vector system
- The combination of a particular plasmid, phage, or virus and the bacterium or other
host cell in which it is propagated. It is essential that the host cell be transformed by
the vector
DNA at a reasonable frequency, and that the vector have an appropriate origin of
replication to function in the host cell.
- hybridization
- The process whereby complementary single strands of
DNA come together to form a double
helix. If one strand is labeled, it can be used to identify the second, unlabeled strand.
For example, in filter hybridization, the labeled probe will hybridize to the unlabeled
complementary
DNA that is stuck to the filter.
- hydrogen bond
- A noncovalent bond between an electronegative atom (such as N or O) and a hydrogen atom
covalently bonded to another electronegative atom. Hydrogen bonds are individually weak,
but collectively contribute significantly to the stabilization of the
DNA double helix.
The pairing required to form optimal hydrogen bonds between DNA
nucleotides underlies the
principle of complimentarity.
- identity
- Two elements at comparable positions in an alignment (a
nucleotide or an
amino acid) that are
the same are said to be identical; the fraction of two sequences that consist of such
elements is expressed as "percent identity."
- in-frame
- Referring to something (e.g., a deletion) that does not alter the coding
frame of a gene.
- in silico
- Performed on a computer.
- in situ hybridization
- A technique performed by denaturing the
DNA of cells or tissue sections and adding a single-stranded
DNA probe. The probe is labeled so that the site of hybridization can be
detected by autoradiography or other appropriate detection protocols.
- in vitro
- Performed in the absence of intact cells; "in vitro" literally means "in glass."
See also:
in vivo.
- in vivo
- Occurring in living cells.
See also:
in vitro.
- indel
- Shorthand term to designate a gap in an alignment that designates "either an insertion
or deletion." Typically used when the historical event that created the difference between
two sequences cannot be determined.
- induced mutation
- A change in a
DNA sequence caused by exposure of the
DNA to a mutagen.
- initiation
- The process in which
DNA or
RNA polymerase binds to a
DNA strand to begin copying it.
- initiation codon
- The first
codon of a
coding sequence. In eukaryotes this is almost always ATG, which
codes for methionine.
- initiator
- A weak consensus sequence [PyPyAN(T/A)PyPyPy] in eukaryotes, found with the A at
position +1 of the
gene, that serves as a recognition sequence for RNA polymerase II.
- insertion
- The addition of
DNA within a given sequence; this may occur as a result of duplication
or insertion of foreign sequences such as transposable elements or viral
DNA.
- intron
- Non-coding sections of a eukaryotic
nucleic acid sequence found between
exons. Introns
are removed ("spliced out") of
mRNA after
transcription and before the molecule is exported
to the cytoplasm for
translation.
See also:
exon
Please see the visual guide to terms associated with parts of a typical eukaryotic gene, below.
- isoform
- Alternate forms of a specific
protein with slightly different
amino acid sequences. Often
different isoforms are produced by alternative splicing of a particular
mRNA.
- LINE
- Long Interspersed Nuclear Elements are a class of retrotransposon commonly found in
eukaryotic genomes.
- local alignment
- An alignment where short, highly similar sequences are displayed.
See also:
global alignment.
- low complexity DNA
- DNA segments that have particularly simple sequences, such as mononucleotide runs
(AAAAAAAA) or dinucleotide repeats (ATATATATATAT).
- LTR
- Long Terminal Repeats;
DNA sequences present in many contiguous copies (up to several
kilobases in total) found at the end of a class of retrotransposons.
- mature mRNA
- Messenger
RNA that has been completely processed; it has a 7-methylguanosine cap at its
5' end, a poly (A) tail at its 3' end, and has all its
introns spliced out from it.
- megabase
- One million bases or base pairs.
- minus strand
- In a display of a double-stranded
DNA sequence, which may be as long as an entire
chromosome, the strand that is read from
5' to
3' from left to right is called the
forward or
plus strand.
The strand that is read from 5' to 3' from right to left is called the
reverse or minus stand.
- mRNA
- Messenger
RNA, a kind of
RNA that is translated into a polypeptide.
Please see the visual guide to terms associated with parts of a typical eukaryotic gene, below.
- N-terminal
- Refers to the end of a
protein that contains the amino group -NH2, corresponding to the
5' end of the encoding
gene. Also called amino-terminal.
For more detail, please see the entry on
protein at Wikipedia.
See also:
- NCBI
- National Center for Biotechnology Information. Please see the
NCBI website.
- negative strand
- Also called the
antisense,
template, or
non-coding strand. This strand of the
DNA sequence of a single
gene is the complement of the 5' to 3'
DNA strand known as the
sense,
positive,
non-template, or
coding strand. The term loses meaning for longer
DNA sequences with
genes on both strands.
See also:
- non-canonical
- In molecular genetics/genomics, this typically refers to
intron splicing site sequences
that are only very rarely used and are never considered by
gene prediction
algorithms;
See also:
canonical site
- non-coding RNA
- An
RNA molecule that functions without being translated into
protein for example,
transfer RNA and
ribosomal RNA; note that this is not the same thing as the "non-coding strand."
- non-coding strand
- Also called the
negative,
template, or
anti-sense strand. This strand of the
DNA sequence of a single
gene is the complement of the 5' to 3'
DNA strand known as the
sense,
positive,
non-template, or
coding strand. The term loses meaning for longer
DNA sequences with
genes on both strands.
See also:
- non-consensus
- A base or sequence that does not match the most common element found at a given position.
See also:
consensus sequence.
- non-redundant
- Refers to the absence of identical components; many databases have the same sequences
present multiple times, but non-redundant versions are searched to save time and computing resources.
- non-template strand
- In a
gene, the
DNA strand that has the sequence found in the
RNA molecule. Also called the
coding,
sense, or
positive strand.
See also:
- nucleic acid
- DNA or
RNA. Each of these compounds consists of a backbone of sugar molecules ribose for
RNA and deoxyribose for
DNA linked by single phosphate groups. Attached to the sugars of the
backbone are any of four nitrogenous bases, A, T, C or G for
DNA and A, U, C or G for
RNA.
For more detail, please see the entry on
nucleic acids at Wikipedia.
- nucleoside
- A small molecule made up of either a purine or pyrimidine base linked to a pentose (sugar),
generally either ribose or deoxyribose.
For more detail, please see the entries on
nucleosides and
nucleic acids at Wikipedia.
- nucleotide
- A
nucleoside linked to one or more phosphate groups via an ester bond with the pentose.
DNA and
RNA are polymers of nucleotides, linked through the
5' and
3' carbons of the sugar.
For more detail, please see the entries on
nucleosides and
nucleic acids at Wikipedia.
- ORF
- Open Reading Frame, a long stretch of
codons in the same reading
frame uninterrupted by stop codons; an ORF may reflect the presence of a
gene.
- orthologous genes
- Genes in different organisms that are direct evolutionary counterparts; that is, they
are related by descent from a common ancestor. Orthologous
genes normally have the same cellular function.
Please see the figure showing
homology.
- ortholog (also orthologue)
- A specific member of a group of orthologous sequences or molecules.
Please see the figure showing
homology.
- orthology
- Orthology is the state of being orthologous.
Please see the figure showing
homology.
- P element
- A Drosophila transposable element that has been used as a tool for insertion mutagenesis
and for germline transformation.
- paralogous genes
- Genes at different chromosomal locations in the same organism that have structural similarities indicating
that they derived from a common ancestral
gene and have since diverged from the parent copy by mutation and selection or drift.
Please see the figure showing
homology.
- paralog (also paralogue)
- A specific member of a group of paralogous sequences or molecules.
Please see the figure showing
homology.
- paralogy
- Paralogy is the state of being paralogous.
Please see the figure showing
homology.
- PCR
- Polymerase Chain Reaction. A method of amplifying specific
DNA segments based on hybridization to a primer pair. A DNA sample is
denatured by heating in the presence of a vast molar excess of short single-stranded DNA
primers (around 20
nucleotides) whose sequence is chosen based on the
target sequence. The reaction mixture also contains a thermostable
DNA polymerase,
dNTPs, and buffer. The primer sequences are selected so that they:
- are derived from opposite strands of the target sequence,
- have their 3' ends facing each other, and
- are separated by a length of DNA that can be reliably synthesized in vitro.
The sample is then cooled to a temperature that allows primer annealing and
in vitro replication. The sample is subjected to multiple cycles
of denaturation and cooling to allow multiple rounds of
replication. The quantity of the target sequence doubles during
each cycle, causing the target sequence to be amplified, while other DNA sequences in the
sample remain unamplified.
- pep
- The suffix of a
protein/peptide sequence file.
- phase
- The phase describes the relationship between the
translation
frame of an
exon and the
position of a splice junction. In the GEP we define the term to describe the number of
bases between the end of the exon (defined by the
splice site) and the full
codon nearest that splice
site. The number of bases between the adjacent full codon at an exon/
splice site junction can be
either 0, 1 or 2. The phase of an exon/splice-donor junction will determine which
frame is
translated in the downstream exon as it will indicate how many bases are used after the
acceptor splice site to create a full codon of 3 bases.
Please see the figure showing
phase.
- phosphodiester bond
- A covalent bond in which two hydroxyl groups form ester linkages to the same phosphate
group; joins successive
nucleotides in
DNA or
RNA.
- Phred/Phrap
- Phred is a base-calling
algorithm and Phrap is an assembly
algorithm used to align the
output of multiple sequencing reactions.
- plasmid
- A small circular
DNA molecule that carries genetic elements permitting its autonomous extra-chromosomal
replication in bacteria or other single-cell organisms. A plasmid can be
used as a recombinant DNA vector, to propagate foreign
DNA in a bacterial cell. In addition to the essential origin for
replication, plasmids generally carry a variety of marker genes,
enabling easy identification of cells harboring the recombinant
DNA.
- plus strand
- In a display of a double-stranded
DNA sequence, which may be as long as an entire chromosome,
the strand that is read from
5' to
3' from left to right is called the
forward or plus strand.
The strand that is read from 5' to 3' from right to left is called the
reverse or
minus stand.
- poly(A) tail
- The segment of adenylate residues that is posttranscriptionally added to the
3' end of eukaryotic
mRNA. About 250
nucleotides of (A) are added by poly (A) polymerase following
cleavage of the newly synthesized
RNA about 20
nucleotides downstream of an AAUAAA signal sequence.
- polyadenylation
- The process by which a series of adenosine (A) ribonucleotides is added to the
3' end of a
spliced RNA to make a mature
mRNA. This addition to the RNA is sometimes referred to as a
poly-A tail, and commonly contains several hundred
bases.
- polypeptide
- An
amino acid chain containing hundreds to thousands of amino acids joined together by
peptide (amide) bonds.
- positive strand
- In a
gene, the
DNA strand that has the sequence found in the
RNA molecule. Also called the
coding,
sense, or
non-template strand.
See also:
- pre-mRNA
- The initial transcript from a
protein-coding
gene is often called a pre-mRNA and contains both
introns and
exons. Pre-mRNA requires processing (addition of 5' cap and 3' poly (A) tail,
removal of introns) to produce the final
mRNA molecule containing only exons.
- primary transcript
- The immediate product of
transcription of a
gene, which is often modified before becoming fully functional.
Please see the visual guide to terms associated with parts of a typical eukaryotic gene, below.
- primer
- A single-stranded
nucleic acid that can "prime" replication of a
template. More specifically, a single-stranded
nucleic acid capable of hybridizing to a template single-stranded nucleic acid in
such a way as to leave part of the template to the
3' end of the primer single-stranded.
DNA polymerase can then synthesize a new strand starting from the 3' end of the primer and adding
nucleotides to the growing strand by base complementarity to the template.
See also:
PCR
- prokaryotes
- The class of single-cell organisms, including the eubacteria and archaea, that lack
membrane-limited organelles, including a nucleus.
- promoter
- A segment of
DNA to which
RNA polymerase binds to initiate
transcription of the
downstream
gene(s).
- protein
- A molecule composed of one or more chains of
amino acids in a specific order; the order
is determined by the base sequence of
nucleotides in the
gene that codes for the protein.
Proteins are required for the structure, function, and regulation of cells, tissues,
and organs; each protein has unique functions. Examples are enzymes, hormones, receptors,
antibodies, and structural proteins.
For more detail, please see the entry on
protein at Wikipedia.
See also:
polypeptide
- protein-coding gene
- Any
gene whose ultimate biologically functional product is a
protein, as opposed to an
RNA molecule such as
tRNA or
rRNA.
- pseudogene
- A sequence of
DNA similar to a
gene but nonfunctional, probably the remnant of a once functional
gene that accumulated mutations.
- purine (Pu)
- Adenine (A) and guanine (G) are purines, two of the four nitrogenous bases found in
DNA.
- pyrimidine (Py)
- Cytosine (C) and thymine (T) are pyrimidines, two of the four nitrogenous bases found in
DNA.
In RNA, thymine is replaced by uracil (U).
- pyrosequencing
- A sequencing technology based on sequencing by synthesis; bases are identified on the
basis of the release of pyrophosphate as they are incorporated into the growing
DNA chain.
Please see the
video.
- query
- The input sequence (or other type of search term) to which all of the entries in a
database are to be compared.
- raw sequence
- Sequence that has been neither finished nor curated, and therefore not ready for annotation.
- reading frame
- See
frame.
- regulatory elements
- DNA sequences that control expression of a
gene by binding to
proteins that increase
or decrease synthesis of the gene product.
- RepeatMasker
- A program that screens
DNA sequences for interspersed repeats and low complexity DNA
sequences. Please see the Repeatmasker website.
- repetitious DNA
- See
repetitive DNA sequence.
- repetitive DNA sequence
- A
DNA sequence that is repeated multiple times in the genome; such sequences can vary
considerably in length and number of copies per genome.
- replication
- The process of producing two
DNA molecules from one. During replication, the two strands
of the parent helix separate and DNA polymerase synthesizes a new, complementary strand for
each parental strand, following the rules of base pairing (A-T and G-C).
- retrotransposon
- These are transposable DNA elements (transposons) that employ retroviral-like reverse
transcription during the process of transposition: retrotransposon
DNA is first transcribed into an
RNA template which is then reverse-transcribed into a DNA copy that is inserted
into a new genomic site.
- reverse strand
- In a display of a double-stranded
DNA sequence, which may be as long as an entire chromosome,
the strand that is read from
5' to
3' from left to right is called the
forward or
plus strand.
The strand that is read from 5' to 3' from right to left is called the
reverse or
minus stand.
- RNA
- Ribonucleic acid. A
nucleic acid that is the primary product of
gene expression. Chemically, it differs from
DNA by the substitution of ribose for deoxyribose in the sugar-phosphate
backbone and by the substitution of the base uracil for thymine.
See also:
- RNA polymerase
- An enzyme that synthesizes a strand of RNA by adding successive ribonucleotides in
the order dictated by a template strand of
DNA.
- rRNA
- Ribosomal RNA, RNA molecules that are components of the ribosome. rRNA forms the
structural scaffold for assembly of the ribosome, and plays a critical role in catalyzing
peptide bond formation.
- satellite DNA/simple sequence DNA
- Highly repetitious
DNA sequence; generally based on a short sequence (7-20
nucleotides) repeated
up to a million times in the haploid genome. Usually found in heterochromatic regions, often
associated with the centromere.
- sense strand
- In a
gene, the
DNA strand that has the sequence found in the
RNA molecule. Also called the
coding,
positive, or
non-template strand.
See also:
- shotgun sequencing
- A strategy for sequencing whole genomes, it was pioneered by the for-profit company Celera.
Genomes are cut into very small pieces, cloned into plasmids, sequenced, and then assembled into whole
chromosomes or genomes. This method is faster than hierarchical shotgun sequencing
but more prone to assembly errors.
- simple repeat
- A
nucleotide repeat with one or a small number of bases, such as AAAAAAAAAAAA or
CACACACACA.
- SINE
- Short Interspersed Nuclear Elements are a class of
DNA segments derived from reverse-transcribed
genes and commonly found in eukaryotic genomes.
- SNP
- Single-nucleotide polymorphism; a difference in
DNA sequence at a single base between
two sequences.
- splicing
- The process by which
introns are removed and
exons are joined to produce a mature, functional
RNA from a primary transcript. Some RNAs are self-splicing, but most require a specific
ribonucleoprotein complex to catalyze the reaction.
For more information, please see the entry on
mRNA splicing at Wikipedia.
Please see the visual guide to terms associated with parts of a typical eukaryotic gene, below.
- splicing acceptor site
- The boundary between an
intron and the
exon immediately downstream (i.e. on the
3' side of the intron).
For more information, please see the entry on
mRNA splicing at Wikipedia.
See also:
splicing donor site.
- splicing donor site
- The boundary between an
intron and the
exon immediately upstream (i.e. on the
5' side of the intron).
For more information, please see the entry on
mRNA splicing at Wikipedia.
See also:
splicing acceptor site.
- splicing junction
- Also "splice junction." Either a
splicing acceptor site or a
splicing donor site.
- splicing transesterification mechanism
- A chemical reaction that joins the
5' phosphate of the first
nucleotide located at the
5' end of the downstream
exon with the
3' hydroxyl group of the last
nucleotide of the upstream
exon forming a phosphodiester bond.
- start codon
- The first
codon of a
coding sequence. In eukaryotes this is almost always ATG, which
codes for methionine.
- start site
- The
nucleotide at which
transcription starts, usually denoted as position +1 in reference to the
gene being transcribed.
- stop codon
- A
codon that specifies the termination of peptide synthesis; sometimes called "nonsense
codons," since they do not specify any
amino acid.
- strand plus/minus
- See "forward strand."
- STRs
- Short tandem repeats. At many places in genomes, there are short sequences (~5- 35 bp)
of bases that are not transcribed and that are repeated several times in a row (a tandem array).
Different individuals will often have a different number of repeats and populations usually
have a wide range of copy numbers at a given site. The number of repeats can therefore be
convenient genetic markers for determining genetic relationships.
- subject
- The sequence, typically retrieved from a database, to which the sequence of interest (the
query) is being compared.
- synteny
-
The state of being on the same
chromosome. A gene is also said to be syntenic to a particular chromosome
if it is known to be located on that chromosome but is otherwise unmapped.
See also:
conserved synteny
- tandem array
- The same sequence, repeated multiple times, where each copy of the repeat immediately
follows the previous copy. Genes encoding
rRNA and the histones are usually in tandem arrays.
Repetitious sequences that are NOT in a tandem array are referred to as "dispersed".
- tblastn
- Blast search tool in which the
query is an
amino acid sequence and the
subjects are the
six amino acid sequences translated from the six frames found in double stranded
DNA. Typically used when using a
protein sequence to search a
nucleotide database. Go to the
BLAST web page and select
"tblastn."
- tblastx
- BLAST version in which the query is all six possible
amino acid sequences derived from
translation of all six frames and the subjects are the six possible
amino acid sequences derived from
translation of all six frames of another
nucleotide sequence. Not surprisingly, this is
computationally very expensive.
- template
- The starting material in a
PCR reaction. When referring to a
DNA strand, it is also
called the
negative,
antisense, or
non-coding strand. This strand of the
DNA sequence of a single
gene is the complement of the 5' to 3'
DNA strand known as the
sense,
positive,
non-template, or
coding strand. The term loses meaning for longer
DNA sequences with genes on both strands.
See also:
- termination codon
- A
codon that specifies the termination of peptide synthesis; sometimes called "nonsense
codons," since they do not specify any
amino acid.
- tiling path
- A set of overlapping clones that cover (ideally) the entire sequence being assembled.
See also:
golden tiling path
- transcript
- An
RNA molecule (or species of RNA molecule) that is the product of
transcription.
- transcription
- The process of copying one strand of a
DNA double helix by RNA polymerase, creating a
complimentary strand of RNA called the transcript.
- transcription terminator
- Also called simply a terminator, it is a section of genomic
DNA that marks the end of
gene or operon, where
transcription should stop.
- translation
- The process by which
codons in an
mRNA are used by the ribosome to direct
protein synthesis. For more detail, please see the entry on
translation at Wikipedia.
- translational start
- The first
codon of a
coding sequence. In eukaryotes this is almost always ATG, which
codes for methionine.
- translocation
- Literally "a change in location." In translocations part of a
chromosome is
transferred to another position in the genome. In a reciprocal translocation, two nonhomologous
chromosomes exchange
chromosome segments ending in a telomere. In an insertional translocation,
a segment of one
chromosome not ending in a telomere is inserted into a location on a nonhomologous
chromosome.
- transposable genetic element
- A genetic element that can insert into (and exit from) a
chromosome, and may therefore
relocate in the genome; this class includes insertion sequences, transposons, retrotransposons,
some phages, and controlling elements. Much of the middle repetitious
DNA in eukaryotic genomes is made up of damaged transposable elements.
- transposons
- Segments of
DNA that can move around to different positions in the genome of a single
cell. In the process, they may cause mutations or increase (or decrease) the amount of
DNA in the genome.
- tRNA
- Transfer RNA, small (~75 bp) L-shaped
RNA molecules that deliver specific
amino acids
to ribosomes according to the sequence of a bound
mRNA. The proper tRNA is selected through
the complementary base pairing of its three-nucleotide anticodon with the mRNA's
codon, and its
amino acid group is transferred to the growing polypeptide.
- TwinScan
- A
gene prediction
algorithm that uses conservation to a second "informant" genome to
assist in the prediction of genes.
- UCSC
- University of California Santa Cruz, host to a popular genome browser. Please see the
UCSC Genome Browser.
- upstream
- Toward the
5' end of a single stranded lenth of
DNA or
gene of interest.
See also:
downstream
- UTR
- Untranslated region; a segment of
DNA (or
RNA) which is transcribed and present in the
mature mRNA, but not translated into
protein. UTRs may occur at either or both the
5' and
3' ends of a
gene or transcript.
Please see the visual guide to terms associated with parts of a typical eukaryotic gene, below.
- vector
- A plasmid, phage, or other
DNA that is used to maintain and propagate inserted foreign
DNA in a host cell.
- VNTRs
- Variable Number of Tandem Repeats.
See also:
STRs.