An identifier supplied by the curators of the major biological databases upon submission
of a novel entry that uniquely identifies that sequence (or other) entry.
The amino acid residues at the catalytic site of an enzyme. These residues provide
the binding and activation energy needed to place the substrate into its transition
state and bridge the energy barrier of the reaction undergoing catalysis
A purine base found in DNA and RNA
Independent, autonomous, software modules that can search the Internet for data
or content pertinent to a particular application, such as a gene, protein, or biological
Agricultural biotechnology (AgBio)
The application of rDNA technology to agriculturally important plants and organisms.
A series of steps defining a procedure or formula for solving a problem, that can
be coded into a programming language and executed. Bioinformatics algorithms typically
are used to process, store, analyze, visualize and make predictions from biological
The result of a comparison of two or more gene or protein sequences in order to
determine their degree of base or amino acid similarity. Sequence alignments are
used to determine the similarity, homology, function or other degree of relatedness
between two or more genes or gene products.
A given form of a gene that occupies a specific position or locus on a chromosome.
Variant forms of genes occurring at the same locus are said to be alleles of one
One of the alternate combinations of a folded protein that are possible due to by
recombination of multiple gene segments during mRNA splicing that occurs in higher
One of the possible alternate combinations of exons into a folded protein that are
possible by recombining multiple gene segments during mRNA splicing in higher organisms.
A common set of dispersed DNA sequences found throughout the human genome; each
is about 300 bases long and they are repeated at least 500,000 times. Alu sequences
are speculated to have originated from viral RNA sequences that integrated into
human DNA thousands of years ago.
One of the 20 chemical building blocks that are joined by amide (peptide) linkages
to form a
chain of a protein
Reasoning by which the function of a novel gene or protein sequence may be deduced
from comparisons with other gene or protein sequences of known function. Identifying
analogous or homologous genes via similarity searching and alignment is one of the
chief uses of Bioinformatics. (See also alignment, similarity search.)
A combination of comments, notations, references, and citations, either in free
format or utilizing a controlled vocabulary, that together describe all the experimental
and inferred information about a gene or protein. Annotations can also be applied
to the description of other biological systems. Batch, automated annotation of bulk
biological sequence is one of the key uses of Bioinformatics tools.
The triplet of contiguous bases on tRNA that binds to the codon sequence of nucleotides
on mRNA. Example: GGG codes for Glycine.
Any foreign molecule that stimulates an immune response in a vertebrate organism.
Many antigens are proteins such as the surface proteins of foreign organisms.
DNA or RNA composed of the complementary sequence to the target DNA/RNA. Also used
to describe a therapeutic strategy that uses antisense DNA or RNA sequences to target
specific gene DNA sequences or mRNA implicated in disease, in order to bind and
physically inhibit their expression by physically blocking them.
A method for measuring a biological activity. This may be enzyme activity, binding
affinity, or protein turnover. Most assays utilize a measurable parameter such as
color, fluorescence or radioactivity to correlate with the biological activity.
Compilation of overlapping sequences from one or more related genes that have been
clustered together based on their degree of sequence identity or similarity. Sequence
assembly may be used to piece together "shotgun" sequencing fragments (see shotgun
sequencing) based upon overlapping restriction enzyme digests, or may be used to
identify and index novel genes from "single-pass" cDNA sequencing efforts.
A method used to locate radioisotope-labeled materials which have been separated
in gels or are present in blots. The location of the radiolabeled material is determined
by overlaying the test material with a photographic film that is sensitive to the
Bacterial artificial chromosome (BAC)
Cloning vector that can incorporate large fragments of DNA. (see YACS)
A virus that infects bacteria. The bacteriophage DNA has served as a basis for cloning
vectors, and is also utilized to create phage libraries containing human or other
An insect virus which forms the basis of a protein expression system
A pair of nitrogenous bases (a purine and a pyrimidine), held together by hydrogen
bonds, that form the core of DNA and RNA i.e. the A:T, G:C and A:U interactions.
A three dimensional arrangement taken up by polypeptide chains that consists of
alternating strands linked by hydrogen bonds. The alternating strands together form
a sheet that is frequently twisted. One of the secondary structural elements characteristic
The field of endeavor that relates to the collection, organization and analysis
of large amounts of biological data using networks of computers and databases (usually
with reference to the genome project and DNA sequence information)
Having two binding sites; having 2 free electrons available for binding.
The joining of DNA fragments that contain no overhang at either end and consequently
no DNA bases available for hybridization (cf. sticky-end ligation).
The -COOH functional group, acidic in nature, found in all amino acids
cDNA (complementary DNA)
A DNA strand copied from mRNA using reverse transcriptase. A cDNA library represents
all of the expressed DNA in a cell.
A set of DNA fragments prepared from the total mRNA obtained from a selected cell,
tissue or organism.
The basic unit of any living organism.
The life cycle of a cell which is marked by cell division which is separated into
four phases: G1, S, G2, and M. DNA replication is confined to the S(synthesis) phase,
and chromosomal separation in the M (mitotic) phase.
A cloning artifact created by a foreign gene being inserted into a vector in an
incorrect orientation resulting in the expression of a protein consisting of a fusion
of two different gene products.
Data file output from most popular DNA sequencers. Chromat files consist of the
fluorescent traces generated by the sequencer for each of the four chemical bases,
A, C, G, and T, together with the sequence and measures of the error in the traces
at each sequence position.
The chromosome as it appears in its condensed state, composed of DNA and associated
proteins (mainly histones).
The structure in the cell nucleus that contains all of the cellular DNA together
with a number of proteins that compact and package the DNA.
Research studies that involve patients. Biotechnology companies typically use clinical
trials to assess the efficacy and safety of new therapies and to answer scientific
questions. Typically, there are 3 phases during a clinical trial. Phase I is designed
to evaluate the safety of the product in humans; phase II analyses the effects of
dose escalation, and phase III definitively evaluates the clinical efficacy of the
A population of genetically identical cells or DNA molecules.
The formation of clones or exact genetic replicas.
The grouping of similar objects in a multidimensional space. Clustering is used
for constructing new features which are abstractions of the existing features of
those objects. The quality of the clustering depends crucially on the distance metric
in the space. In bioinformatics, clustering is performed on sequences, high-throughput
expression and other experimental data. Clusters of partial or complete gene sequences
can be used to identify the complete (contiguous) sequence and to better identify
its function. Clustering expression data enables the researcher to discern patterns
of co-regulation in groups of genes.
Coding regions (CDS)
The portion of a genomic sequence bounded by start and stop codons that identifies
the sequence of the protein being coded for by a particular gene.
A sequence of three adjacent nucleotides that designates a specific amino acid or
start/stop site for transcription.
The use of chemical methods to generate all possible combinations of chemicals starting
with a subset of compounds. The building blocks may be
, nucleic acids or small molecules. The libraries of compounds formed by
this methodology are used to probe for new pharmaceutical reagents (see high-throughput
Complementary determining region (CDR)
The hyper variable regions of an antibody molecule, consisting of three loops from
the heavy chain and three from the light chain, that together form the antigen-binding
Complexity (of gene sequence)
The term "low complexity sequence" may be thought of as synonymous with regions
of locally biased amino acid composition. In these regions, the sequence composition
deviates from the random model that underlies the calculation of the statistical
significance (P-value) of an alignment. Such alignments among low complexity sequences
are statistically but not biologically significant, i.e., one cannot infer homology
(common ancestry) or functional similarity.
(in software) The complete ordering and description of all parts of a software or
database system. Configuration management is the use of software to identify, inventory
and maintain the component modules that together comprise one or more systems or
The precise three-dimensional arrangement of atoms and bonds in a molecule describing
its geometry and hence its molecular function.
A single sequence delineated from an alignment of multiple constituent sequences
that represents a "best fit" for all those sequences. A "voting" or other selection
procedure is used to determine which residue (nucleotide or amino acid) is placed
at a given position in the event that not all of the constituent sequences have
the identical residue at that position.
Constitutive synthesis (expression)
Synthesis of mRNA and protein at an unchanging or constant rate regardless of a
cellís requirements (see housekeeping genes).
A length of contiguous sequence assembled from partial, overlapping sequences, generated
from a "shotgun" sequencing project. Contigs are typically created computationally,
by comparing the overlapping ends of several sequencing reads generated by restriction
enzyme digestion of a segment of genomic DNA. The creation of contigs in the presence
of sequencing errors, ambiguities and the presence of repeats is one of the most
computationally challenging aspects of the role of Bioinformatics in genome analysis.
The end-point of any algorithm that uses iteration or recursion to guide a series
of data processing steps. An algorithm is usually said to have reached convergence
when the difference between the computed and observed steps falls below a pre-defined
DNA vectors that allow the insertion of long fragments of DNA (up to 50 k bases).
Term used to describe the high resolution molecular structure derived by x- ray
crystallographic analysis of protein or other biomolecular crystals.
The medium of the cell between the nucleus and the cell membrane.
A pyrimidine base found in DNA and RNA.
A process whereby automated or semi-automated algorithms are used to process experimental
data, including noise, experimental errors and other artifacts, in order to generate
and store high-quality data for use in subsequent analysis. Data cleaning is typically
required in high-throughput sequencing where compression or other experimental artifacts
limit the amount of sequence data generated from each sequencing run or "read."
The ability to query very large databases in order to satisfy a hypothesis ("top-down"
data mining); or to interrogate a database in order to generate new hypotheses based
on rigorous statistical correlations ("bottom-up" data mining).
Data processing is defined as the systematic performance of operations upon data
such as handling, merging, sorting, and computing. The semantic content of the original
data should not be changed, but the semantic content of the processed data may be
Vast arrays of heterogeneous (biological) data, stored within a single logical data
repository, that are accessible to different querying and manipulation methods.
Any file system by which data gets stored following a logical process. (see also
Mathematical procedure to separate out the overlapping effects of molecules such
as mixtures of compounds in a high-throughput screen, or mixtures of cDNAs in a
high density array.
A chromosomal alteration in which a portion of the chromosome or the underlying
DNA is lost.
Process in which different deletions in a region of DNA are created and used to
map the functionally critical areas of that DNA. e.g the minimal region of DNA required
for a test promoter can be ascertained by systematic deletions in the region of
A graphical procedure for representing the output of a hierarchical clustering method.
A dendrogram is strictly defined as a binary tree with a distinguished root, that
has all the data items at its leaves. Conventionally, all the leaves are shown at
the same level of the drawing. The ordering of the leaves is arbitrary, as is their
horizontal position. The heights of the internal nodes may be arbitrary, or may
be related to the metric information used to form the clustering.
A composite molecule formed by the binding of two molecules (see homo and heterodimers).
Covalent link formed between the sulphur atoms of two different cysteine residues
in a protein. Important in maintaining the folded structure of a protein, and also
for linking different proteins in a complex.
DNA (deoxyribonucleic acid)
The chemical that forms the basis of the genetic material in virtually all organisms.
DNA is composed of the four nitrogenous bases Adenine, Cytosine, Guanine, and Thymine,
which are covalently bonded to a backbone of deoxyribose - phosphate to form a DNA
strand. Two complementary strands (where all Gs pair with Cs and As with Ts) form
a double helical structure which is held together by hydrogen bonding between the
A technique for identifying human individuals based on a restriction enzyme digest
of tandemly repeated DNA sequences that are scattered throughout the human genome,
but are unique to each individual.
The deposition of oligonucleotides or cDNAs onto an inert substrate such as glass
or silicon. Thousands of molecules may be organized spatially into a high-density
matrix. These DNA chips may be probed to allow expression monitoring of many thousands
of genes simultaneously. Uses include study of polymorphisms in genes, de novo sequencing
or molecular diagnosis of disease.
An enzyme that catalyzes the synthesis of DNA from a DNA template given the deoxyribonucleotide
Short single stranded DNA molecules of specific base sequence, labeled either radioactively
or immunologically, that are used to detect and identify the complementary base
sequence in a gene or genome by hybridizing specifically to that gene or sequence.
The technique in which the specific sequence of bases forming a particular DNA region
One of a series of enzymes that can digest DNA.
A region of special biological interest within a single protein sequence. However,
a domain may also be defined as a region within the three-dimensional structure
of a protein that may encompass regions of several distinct protein sequences that
accomplishes a specific function. A domain class is a group of domains that share
a common set of well-defined properties or characteristics.
An agent that affects a biological process. Specifically, a molecule whose molecular
structure can be correlated with its pharmacological activity.
Drug discovery cycle
The cycle of events required to develop a new drug. Typically this involves research,
preclinical testing and clinical development, and can take from 5 to 12 years.
The use of an electronic database of cDNA sequences (or probes derived from them)
in order to measure the relative levels of mRNAs expressed in different cells or
tissues. An example of the use of an electronic Northern might be to identify the
differences in the genes expressed in prostate cancer and those in benign prostate
hyperplasia, by subtracting the database of one from the other and seeing which
The use of an external electric field to separate large biomolecules on the basis
of their charge by running them through acrylamide or agarose gels.
DNA sequences that can greatly increase the transcription rates of genes even though
they may be far upstream or downstream from the promoter they stimulate.
A class of proteins that are capable of catalyzing chemical reactions (the making
or breaking of chemical bonds). They do so by orienting their substrates into a
suitable geometry in a particular location (the active site) where electrophilic
or nucleophilic amino acid residues can participate in the reaction. Enzymes are
protein catalyst that speeds up chemical reactions that would otherwise be prohibitively
slow under physiological conditions.
The study of complex expression networks or linkages both spatially (within the
body) and temporally (at different times in development).
Value that describes the equilibrium state of the reversible reaction between two
A cell or organism with a distinct membrane-bound nucleus as well as specialized
membrane-based organelles (see also prokaryote).
The region of DNA within a gene that codes for a polypeptide chain or domain. Typically
a mature protein is composed of several domains coded by different exons within
a single gene.
Expressed Sequence Tags (ESTs)
A small sequence from an expressed gene that can be amplified by PCR. ESTs act as
physical markers for cloning and full length sequencing of the cDNAs of expressed
genes. Typically identified by purifying mRNAs, converting to cDNAs, and then sequencing
a portion of the cDNAs.
Expression (gene or protein)
A measure of the presence, amount, and time-course of one or more gene products
in a particular cell or tissue. Expression studies are typically performed at the
RNA (mRNA) or protein level in order to determine the number, type, and level of
genes that may be up-regulated or down-regulated during a cellular process, in response
to an external stimulus, or in sickness or disease. Gene chips and proteomics now
allow the study of expression profiles of sets of genes or even entire genomes.
The level and duration of expression of one or more genes, selected from a particular
cell or tissue type, generally obtained by a variety of high-throughput methods,
such as sample sequencing, serial analysis, or microarray-based detection.
A cloning vector that is engineered to allow the expression of protein from a cDNA.
The expression vector provides an appropriate promoter and restriction sites that
allow insertion of cDNA.
A fingerprint is a set of motifs used to predict the occurrence of similar motifs,
in either an individual sequence or in a database. Fingerprints are refined by iterative
scanning of a composite protein sequence database. A composite or multiple-motif
fingerprint contains a number of aligned motifs taken from different parts of a
multiple alignment. True family members are then easy to identify by virtue of possessing
all elements of the fingerprint, while subfamily members may be identified by possessing
only part of it.
A deletion, substitution, or duplication of one or more bases that causes the reading-frame
of a structural gene to shift from the normal series of triplets.
The use of genomic information to delineate protein structure, function, pathways
and networks. Function may be determined by "knocking out" or "knocking in" expressed
genes in model organisms such as worm, fruit fly, yeast or mouse.
The protein resulting from the genetic joining and expression of 2 different genes
Gaps (affine gaps)
A gap is defined as any maximal, consecutive run of spaces in a single string of
a given alignment. Gaps help create alignments that better conform to underlying
biological models and more closely fit patterns that one expects to find in meaningful
alignment. The idea is to take in account the number of continuous gaps and not
only the number of spaces when calculating an alignment. Affine gaps contain a component
for gap insertion and a component for gap extension, where the extension penalty
is usually much lower than the insertion penalty. This mimics biological reality
as multiple gaps would imply multiple mutations, but a single mutation can lead
to a long gap quite easily.
The penalty applied to a similarity score for the introduction of an insertion or
deletion gap, the extension of a gap, or both. Gap penalties are usually subtracted
from a cumulative score being determined for the comparison of two or more sequences
via an optimization algorithm that attempts to maximize that score.
A technique by which molecules are separated by size or charge by passing them through
a gel under the influence of an external electric field.
A listing of the number, type, label and sequence of all the genes identified within
the genome of a given organism. Gene indices are usually created by assembling overlapping
EST sequences into clusters, and then determining if each cluster corresponds to
a unique gene. Methods by which a cluster can be identified as representing a unique
gene include identification of long open reading frames (ORFs), comparison to genomic
sequence, and detection of SNPs or other features in the cluster that are known
to exist in the gene.
Data bank of genetic sequences operated by a division of the National Institutes
Classically, a unit of inheritance. In practice, a gene is a segment of DNA on a
chromosome that encodes a protein and all the regulatory sequences (promoter) required
to control expression of that protein.
Gene chips (also Gene arrays)
The covalent attachment of oligonucleotides or cDNA directly onto a small glass
or silicon chip in organized arrays. Over 50,000 different DNA fragments can be
presented on a single chip providing a high throughput parallel method of probing
gene expression, genotype or gene function.
The conversion of information from gene to protein via transcription and translation.
Subsets of genes containing homologous sequences which usually correlate with a
A collection of cloned DNA fragments created by restriction endonuclease digestion
that represent part or all of an organism's genome.
The product, either RNA or protein, that results from expression of a gene. The
amount of gene product reflects the activity of the gene.
The use of genetic material for therapeutic purposes. The therapeutic gene is typically
delivered using recombinant virus or liposome based delivery systems.
The mapping of all possible codons into the 20 amino acids including the start and
Genetic engineering (Recombinant DNA technology)
The procedures used to isolate, splice and manipulate DNA outside the cell. Genetic
Engineering allows a recombinantly engineered DNA segment to be introduced into
a foreign cell or organism, and be able to replicate and function normally.
Any gene that can be readily recognized by its phenotypic effect, and which can
be used as a marker for a cell, chromosome, or individual carrying that gene. Also,
any detectable polymorphism used to identify a specific gene.
The complete genetic content of an organism.
Genomic DNA (sequence)
DNA sequence typically obtained from mammalian or other higher-order species, which
includes both intron and exon sequence (coding sequence), as well as non-coding
regulatory sequences such as promoter, and enhancer sequences.
The analysis of the entire genome of a chosen organism.
Strictly, all of the genes possessed by an individual. In practice, the particular
alleles present in a specific genetic locus.
The addition of carbohydrate groups (sugars) e.g. to polypeptide chains
One of the nitrogenous purine bases found in DNA and RNA
A double-helical region in a single DNA or RNA strand formed by the hydrogen-bonding
between adjacent inverse complementary sequences to form a hairpin shaped structure.
A cell or organism containing only one set of chromosomes without the homologous
pairs. (cf. diploid)
Protein composed of 2 different chains or subunits.
Hybrid structure formed by the annealing of two DNA strands (or an RNA and DNA)
that have sufficient complementarity in their sequence to allow hydrogen bonding.
Hidden Markov model (HMM)
A joint statistical model for an ordered sequence of variables. The result of stochastically
perturbing the variables in a Markov chain (the original variables are thus "hidden"),
where the Markov chain has discrete variables which select the "state" of the HMM
at each step. The perturbed values can be continuous and are the "outputs" of the
HMM. A Hidden Markov Model is equivalently a coupled mixture model where the joint
distribution over states is a Markov chain. Hidden Markov models are valuable in
bioinformatics because they allow a search or alignment algorithm to be trained
using unaligned or unweighted input sequences; and because they allow position-dependent
scoring parameters such as gap penalties, thus more accurately modeling the consequences
of evolutionary events on sequence families.
The method by which very large numbers of compounds are screened against a putative
drug target in either cell-free or whole-cell assays. Typically, these screenings
are carried out in 96 well plates using automated, robotic station based technologies
or in higher- density array ("chip") formats.
Another name for the MHC in humans; refers to the "Human Leukocyte Antigen" complex
located on chromosome 6.
A highly conserved region in a homeotic gene composed of 180 bases (60 amino acids)
that specifies a protein domain (the homeodomain) that serves as a master genetic
regulatory element in cell differentiation during development in species as diverse
as worms, fruit flies, and humans.
A 60 amino-acid protein domain coded for by the homeobox region of a homeotic gene.
A gene that controls the activity of other genes involved in the development of
a body plan. Homeotic genes have been found in organisms ranging from plants to
(strict) Two or more biological species, systems or molecules that share a common
evolutionary ancestor. (general) Two or more gene or protein sequences that share
a significant degree of similarity, typically measured by the amount of identity
(in the case of DNA), or conservative replacements (in the case of protein), that
they register along their lengths. Sequence "homology" searches are typically performed
with a query DNA or protein sequence to identify known genes or gene products that
share significant similarity and hence might inform on the ancestry, heritage and
possible function of the query gene.
Genes that are always expressed (i.e. they are said to be constitutively expressed)
due to their constant requirement by the cell.
Human Anti-Murine Antibody Response (HAMA)
An immune response generated in humans to antibodies raised in murine (e.g. mouse
or rat) cells.
The interaction of complementary nucleic acid strands. This can occur between two
DNA strands or between DNA and RNA strands, and is the basis of many techniques
such as Southern and northern blots.
A weak chemical interaction between an electronegative atom (e.g. nitrogen or oxygen)
and a hydrogen atom that is covalently attached to another atom. This bond maintains
the two-helices of DNA together and is also the primary interaction between water
(lit. water-loving) The degree to which a molecule is soluble in water. Hydrophilicity
depends to a large degree on the charge and polarizability of the molecule and its
ability to form transient hydrogen-bonds with (polar) water molecules.
(lit. water-hating) The degree to which a molecule is insoluble in water, and hence
is soluble in lipids. If a molecule lacking polar groups is placed in water, it
will be entropically driven to finding a hydrophobic environment (such as the interior
of a protein or a membrane).
Antibody variants localized to the variable portion of an immunoglobulin that are
recognized by their antigenic determinants. The determinants are composed from the
antigen-combining site or CDRs. Every unique antigenic determinant has a specific
antibody with its own unique idiotype.
A member of the globulin protein family consisting of two light and two heavy chains
linked by disulfide bonds. All antibodies are immunoglobulins.
in silico (biology)
(Lit. computer mediated). The use of computers to simulate, process, or analyze
a biological experiment.
in situ hybridization
A variation of the DNA/RNA hybridization procedure in which the denatured DNA is
in place in the cell and is then challenged with RNA or DNA extracted from another
source. (See also fluorescence in situ hybridization).
The physical insertion of DNA into the host cell genome. The process is used by
retroviruses where a specific enzyme catalyses the process or can occur at random
sites with other DNA (e.g. transposons).
The communication of a molecular message from the surface of the cell to the nucleus
via the participation of a series of molecules, including receptors, enzymes, proteins,
and small-molecules. The end result of the signaling process is the up- or down-regulation
of a particular series of genes that may be involved in cell growth, division or
Nucleotide sequences found in the structural genes of eukaryotes that are non-coding
and interrupt the sequences containing information that codes for polypeptide chains.
Intron sequences are spliced out of their RNA transcripts before maturation and
protein synthesis. (cf. Exons)
Two different restriction enzymes which recognize and cut DNA at the same recognition
site. e.g Sma I and Xma I both recognize and cut the sequence CCCGGG.
Two or more enzymes capable of catalyzing the same reaction but varying in their
specificity due to differences in their structures and hence their efficiencies
under different environmental conditions.
A series of steps in an algorithm whereby the processing of data is performed repetitively
until the result exceeds a particular threshold. Iteration is often used in multiple
sequence alignments whereby each set of pairwise alignments are compared with every
other, starting with the most similar pairs and progressing to the least similar,
until there are no longer any sequence-pairs remaining to be aligned.
Term used to describe the excess DNA that is present in the genome beyond that required
to encode proteins. A misleading term since these regions are likely to be involved
in gene regulation, and other as yet unidentified functions.
The constitution (typically number and size) of chromosomes in a cell or individual.
Knockout mice (gene targeting)
Mice which have been engineered to lack a chosen gene. The gene is inactivated in
so called embryonic stem cells using the technique of homologous recombination.
These cells are then introduced into a early stage embryo (blastocyst) and this
is then transplanted into a recipient mouse. The subsequent progeny lack the targeted
gene in some cells. This technique is used to determine the function of the chosen
"Lab on a chip"
Term describing microdevices that allow rapid, microanalytical analysis of DNA or
protein in a single, fully integrated system. Typically, these devices are miniature
surfaces, made of silicon, glass or plastic, which carry the necessary microdevices
(pumps, valves, microfluidic controllers, and detectors) that allow sample separation
and analysis. These devices are used in drug discovery, genetic testing and separation
A candidate compound identified as the best "hit" (tight binder) after screening
of a combinatorial (or other) compound library, that is then taken into further
rounds of screening to determine its suitability as a drug.
The process of converting a putative lead compound ("hit") into a therapeutic drug
with maximal activity and minimal side affects, typically using a combination of
computer-based drug design, medicinal chemistry and pharmacology.
Protein motif which binds DNA in which 4-5 Leucines are found at 7 amino acid intervals.
This motif is present typically in transcription factors and other proteins that
In Bioinformatics, a lexicon refers to a pre-defined list of terms that together
completely define the contents of a particular database. (strict.) The component
in the grammar which is in bare form a list of words or lexical entries.
A large collection of compounds, peptides
or genes which may be screened in order to isolate cognate molecules.
Any small molecule that binds to a protein or receptor; the cognate partner of many
cellular proteins, enzymes, and receptors.
The association of genes (or genetic loci) on the same chromosome. Genes that are
linked together tend to be transmitted together.
A genetic map of a chromosome or genome delineated by mapping the positions of genes
to their chromosomes by their linkage to readily identifiable genetic loci.
The specific position occupied by a gene on a chromosome. At a given locus, any
one of the variant forms of a gene may be present. The variants are said to be alleles
of that gene.
A measure of genetic distance between two linked genes that corresponds to a recombination
frequency of 1%.
Any multivariate probability density whose independence diagram is a chain.The variables
are ordered, and each variable "depends" only on its neighbors in the sense of being
conditionally independent of the others. Markov chains are an integral component
of hidden Markov models.
A process within the cell nucleus that results in the reduction of the chromosome
number from diploid (two copies of each chromosome) to haploid (a single copy) through
two reductive divisions in germ cells.
Melting (of DNA)
The denaturation of double-stranded DNA into two single strands by the application
of heat. (Denaturation breaks the hydrogen bonds holding the double-stranded DNA
Messenger RNA (mRNA)
The complementary RNA copy of DNA formed from a single-stranded DNA template during
transcription that migrates from the nucleus to the cytoplasm where it is processed
into a sequence carrying the information to code for a polypeptide domain.
The addition of -CH3 (methyl) groups to a target site. Typically such addition occurs
on to the cytosine bases of DNA. (see maternal imprinting).
A 2D array, typically on a glass, filter, or silicon wafer, upon which genes or
gene fragments are deposited or synthesized in a predetermined spatial order allowing
them to be made available as probes in a high-throughput, parallel manner.
The miniaturization of chemical reactions or pharmacological assays into microscopic
tubes or vessels in order to greatly increase their throughput, by placing many
of them side-by-side in an array.
Compounds that mimic the function of other molecules via their high degree of structural
(conformational) similarity, and hence physiochemical properties.
A point mutation in which one codon (triplet of bases) is changed into another designating
a different amino acid.
The nuclear division that results in the replication of the genetic material and
its redistribution into each of the daughter cells during cell division.
In bioinformatics, modeling usually refers to molecular modeling, a process whereby
the three-dimensional architecture of biological molecules is interpreted (or predicted),
visually represented, and manipulated in order to determine their molecular properties.
(general) A series of mathematical equations or procedures which simulate a real-life
process, given a set of assumptions, boundary parameters, and initial conditions.
A single unit of any biological molecule or macromolecule, such as an amino acid,
nucleic acid, polypeptide domain, or protein.
Having one binding site; strictly, an atom with only one free electron available
for binding in its highest energy shell.
A conserved element of a protein sequence alignment that usually correlates with
a particular function. Motifs are generated from a local multiple protein sequence
alignment corresponding to a region whose function or structure is known. It is
sufficient that it is conserved, and is hence likely to be predictive of any subsequent
occurrence of such a structural/functional region in any other novel protein sequence.
A set of genes derived by duplication of an ancestral gene, followed by independent
mutational events resulting in a series of independent genes either clustered together
on a chromosome or dispersed throughout the genome.
Multiple (sequence) alignment
A Multiple Alignment of k sequences is a rectangular array, consisting of characters
taken from the alphabet A, that satisfies the following conditions: There are exactly
k rows; ignoring the gap character, row number i is exactly the sequence sI; and
each column contains at least one character different from "-". In practice multiple
sequence alignments include a cost/weight function, that defines the penalty for
the insertion of gaps (the "-" character) and weights identities and conservative
substitutions accordingly. Multiple alignment algorithms attempt to create the optimal
alignment defined as the one with the lowest cost/weight score.
Approach to high-throughput sequencing that uses several pooled DNA samples run
through gels simultaneously and then separated and analyzed.
Any agent that can cause an increase in the rate of mutations in an organism.
An inheritable alteration to the genome that includes genetic (point or single base)
changes, or larger scale alterations such as chromosomal deletions or rearrangements.
Pure, isolated DNA devoid of any proteins that may bind to it.
NCEs (New Chemical Entity)
Compounds identified as potential drugs that are sent from research and development
into clinical trials to determine their suitability.
The second round amplification of an already PCR-amplified sequence using a new
pair of primers which are internal to the original primers. Typically done when
a single PCR reaction generates insufficient amounts of product.
A neural net is an interconnected assembly of simple processing elements, units
or nodes, whose functionality is loosely based on the animal brain. The processing
ability of the network is stored in the inter-unit connection strengths, or weights,
obtained by a process of adaptation to, or learning from, a set of training patterns.
Neural nets are used in bioinformatics to map data and make predictions, such as
taking a multiple alignment of a protein family as a training set in order to identify
novel members of the family from their sequence data alone.
A point mutation in which a codon specific for an amino-acid is converted into a
A technique to identify RNA molecules by hybridization that is analogous to Southern
blotting (see Southern blotting).
Any enzyme that can cleave the phosphodiester bonds of nucleic acid backbones.
A five-carbon sugar covalently attached to a nitrogen base.
A nucleic acid unit composed of a five carbon sugar joined to a phosphate group
and a nitrogen base.
Object databases combine the elements of object orientation and object-oriented
programming languages with database capabilities. They provide more than persistent
storage of programming language objects. Object databases extend the functionality
of object programming languages (e.g., C++, Smalltalk, or Java) to provide full-featured
database programming capability. The result is a high level of congruence between
the data model for the application and the data model of the database. Object-relational
databases are used in Bioinformatics to map molecular biological objects (such as
sequences, structures, maps and pathways) to their underlying representations (typically
within the rows and columns of relational database tables.) This enables the user
to deal with the biological objects in a more intuitive manner, as they would in
the laboratory, without having to worry about the underlying data model of their
A short molecule consisting of several linked nucleotides (typically between 10
and 60) covalently attached by phosphodiester bonds.
Open reading frame (ORF)
Any stretch of DNA that potentially encodes a protein. Open reading frames start
with a start codon, and end with a termination codon. No termination codons may
be present internally. The identification of an ORF is the first indication that
a segment of DNA may be part of a functional gene.
A segment of DNA that interacts with the products of regulatory genes and facilitates
the transcription of one or more structural genes.
A unit of transcription consisting of one or more structural genes, an operator,
and a promoter.
Orthologs are genes in different species that evolved from a common ancestral gene
by speciation. Normally, orthologs retain the same function in the course of evolution.
Identification of orthologs is critical for reliable prediction of gene function
in newly sequenced genomes. (See also Paralogs.)
Collection of cloned sequences made by generating randomly overlapping DNA fragments
with infrequently cutting restriction enzymes.
A region of DNA with a symmetrical arrangement of bases occurring about a single
point such that the base sequences on either side of that point are identical (if
the strands are both read in the same direction) e.g 5í GAATTC 3í whose complementary
sequence is 3í CTTAAG 5í.
Molecular biological patterns usually occur at the level of the characters making
up the gene or protein sequence. A pattern language must be defined in order to
apply different criteria to different positions of a sequence. In order to have
position-specific comparison done by a computer, a pattern-matching algorithm must
allow alternative residues at a given position, repetitions of a residue, exclusion
of alternative residues, weighting, and ideally, combinatorial representation.
Bioinformatics strives to define representations of key biological data types, algorithms
and inference procedures, including sequences, structures, biological pathways and
reactions. Representing and computing with biological pathways requires ontologies
for representing pathway knowledge; User interfaces to these databases; Physicochemical
properties of enzymes and their substrates in pathways; And pathway analysis of
whole genomes including identifying common patterns across species and species differences.
Paralogs are genes related by duplication within a genome. Orthologs retain the
same function in the course of evolution, whereas paralogs evolve new functions,
even if these are related to the original one.
Parameters are user-selectable values, typically experimentally determined, that
govern the boundaries of an algorithm or program. For instance, selection of the
appropriate input parameters governs the success of a search algorithm. Some of
the most common search parameters in bioinformatics tools include the stringency
of an alignment search tool, and the weights (penalties) provided for mismatches
A short stretch of amino acids each covalently coupled by a peptide (amide) bond.
Peptide bond (amide bond)
A covalent bond formed between two amino acids when the amino group of one is linked
to the carboxyl group of another (resulting in the elimination of one water molecule).
A virus that infects bacterial cells and serves as a useful vector for introducing
genes into bacteria for a number of purposes.
A technique in which phage are engineered to fuse a foreign peptide or protein with
their capsid (surface) proteins and hence display it on their cell surfaces. The
immobilized phage may then be used as a screen to see what ligands bind to the expressed
fusion protein exhibited (displayed) on the phage surface.
The use of (DNA-based) genotyping in order to target pharmaceutical agents to specific
patient populations. Genetic differences are known to affect responses to many types
of drug therapy, and pharmacogenomics analysis serves to customize the use of pharmaceuticals
for specific subgroups of patients.The rationale for this approach is that observed
gene expression differences may correlate with, and explain, the differences in
side effects and efficacy to drugs in humans.
The three dimensional spatial arrangement of atoms, substituents, functional groups,
or chemical features that together are sufficient to describe the pharmacologically
active components of a drug molecule or molecule series.
Any observable feature of an organism that is the result of one or more genes.
The segmentation of the animal kingdom into about 30 major groups collectively known
as phyla. The members of each phylum share the same basic structure and organization.
For instance, fish, birds, and human beings belong to one phylum - the Chordata
- because all have spinal cords.
A physical map consists of a linearly ordered set of DNA fragments encompassing
the genome or region of interest. Physical maps are of two types, macro-restriction
maps and ordered clone maps. The former consists of an ordered set of large DNA
fragments generated by using restriction enzymes whose recognition sequences are
infrequently represented in the genome. An ordered clone map consists of an overlapping
collection of cloned DNA fragments. The DNA may be cloned into any one of the available
vector systems -- YACs, cosmids, phage, or even plasmids. Major advantages of ordered
clone maps are that they are of high resolution and directly provide the clones
for further study.
Any replicating DNA element that can exist in the cell independently of the chromosomes.
Synthetic plasmids are used for DNA cloning. Most commonly found in bacterial cells.
The multiple effects on an organism's phenotype due to a single gene or allele e.g
the cytokines which can bind to multiple cellular receptors and effect growth and
multiple immune pathways.
A mutation in which a single nucleotide in a DNA sequence is substituted by another
The stretch of Adenine (A) residues at the 3í end of eukaryotic mRNA that is added
to the pre-mRNA as it is processed, before its transport from the nucleus to the
cytoplasm and subsequent translation at the ribosome.
A site on the 3í-end of messenger RNA (mRNA) that signals the addition of a series
of Adenines during the RNA processing step and before the mRNA migrates to the cytoplasm.
These so-called poly(A) "tails" increase mRNA stability and allow one to isolate
mRNA from cells by PCR-amplification using poly(T) primers.
Inheritance involving alleles at many genetic loci.
Polymerase chain reaction (PCR )
Technique used to amplify or generate large amounts of replica DNA of a segment
of any DNA whose "flanking" sequences are known. Oligonucleotide primers which bind
these flanking sequences are used by an enzyme (Taq polymerase) to copy the sequence
in between the primers. Cycles of heat to break apart the DNA strands, cooling to
allow the primers to bind, and heating again to allow the enzyme to copy the intervening
sequence lead to a doubling of DNA at each cycle. The reactions are typically carried
out on a regulated heating block and consist of 30-35 cycles of repeated amplification
of all the DNA present. Single molecules of "target" DNA can be amplified to microgram
amounts of DNA. The target DNA can be of any origin.
(lit. many forms) The existence of a gene in a population in at least two different
forms at a frequency far higher than that attributable to recurrent mutation alone.
Variations in a population may be measured by determining the rate of mutation in
polymorphic genes (see SNPs).
A single chain of covalently attached amino acids joined by peptide bonds. Polypeptide
chains usually fold into a compact, stable form (a domain) that is part (or all)
of the final protein.
Method used to define the location of a gene on a chromosome and use this information
to identify and clone the gene. The location of the gene is determined by linkage
analysis of DNA from a large family containing afflicted and normal members to identify
linkages between the transmission of the disease gene and observable genetic markers.
This information is then used to screen (by chromosomal jumping and walking) the
location for putative genes. The disease gene must be compared between the afflicted
and normal family members and be shown to be different in the two groups. The full
sequencing of the gene will then provide information regarding the characteristics
and function of the gene product, and a potential explanation for the cause of the
Alterations made to pre-mRNA before it leaves the nucleus and becomes mature mRNA.
Alterations made to a protein after its synthesis at the ribosome. These modifications,
such as the addition of carbohydrate or fatty acid chains, may be critical to the
function of the protein.
Primary sequence (protein)
The linear sequence of a polypeptide or protein.
Primary structure (protein)
see primary sequence.
A short oligonucleotide that provides a free 3í hydroxyl for DNA or RNA synthesis
by the appropriate polymerase (DNA polymerase or RNA polymerase).
Any biochemical that is labeled or tagged in some way so that it can be used to
identify or isolate a gene, RNA, or protein.
Sequence profiles are usually derived from multiple alignments of sequences with
a known relationship, and consist of tables of position-specific scores and gap-penalties.
Each position in the profile contains scores for all of the possible amino acids,
as well as one penalty score for opening and one for continuing a gap at the specified
position. Attempts have been made to further improve the sensitivity of the profile
by refining the procedures to construct a profile starting from a given multiple
alignment. Other representations for sequence domains or motifs do not necessarily
require the presence of a correct and complete multiple alignment, such as hidden
An organism or cell that lacks a membrane-bounded nucleus. Bacteria and blue-green
algae are the only surviving prokaryotes (cf. Eukaryote).
A promoter site is defined by its recognition by eukaryotic RNA polymerase II; its
activity in a higher eukaryote; by experimentally evidence, or homology and sufficient
similarity to an experimentally defined promoter; and by observed biological function.
Sets of proteins that share a common evolutionary origin reflected by their relatedness
in function which is usually reflected by similarities in sequence, or in primary,
secondary or tertiary structure. Subsets of proteins with related structure and
The entire protein complement of a given organism.
The study of the proteome. Typically, the cataloging of all the expressed proteins
in a particular cell or tissue type, obtained by identifying the proteins from cell
extracts using a combination of 2D gel electrophoresis and mass spectrometry. The
large scale analysis of the protein composition and function. (cf genomics)
A nitrogen-containing compound with a double-ring structure. The parent compound
of Adenine and Guanine.
A nitrogen-containing compound with a single six-membered ring structure. The parent
compound of Thymidine and Cytosine.
A DNA, RNA of protein sequence used to search a sequence database in order to identify
close or remote family members (homologs) of known function, or sequences with similar
active sites or regions (analogs), from whom the function of the query may be deduced.
Rational drug design (Structure based drug design)
The development of drugs based on the 3-dimensional molecular structure of a particular
A sequence of codons beginning with an initiation codon and ending with a termination
codon, typically of at least 150 bases (50 amino acids) coding for a polypeptide
or protein chain (see ORF and URF).
Sources of biological or chemical material that can be used as the starting blocks
in laboratory experiments. Reagents can range from chemicals needed to perform a
particular chemical reaction, constituents of a laboratory protocol, or clones to
be used in a large-scale gene expression study.
Any trait that is expressed phenotypically only when present on both alleles of
a gene (cf dominant).
Recombinant DNA (rDNA)
DNA molecules resulting from the fusion of DNA from different sources. The technology
employed for splicing DNA from different sources and for amplifying the resultant
A new combination of alleles resulting from the rearrangement occurring by crossing-over
or by independent assortment (see crossing over).
An algorithmic procedure whereby an algorithm calls on itself to perform a calculation
until the result exceeds a threshold, in which case the algorithm exits. Recursion
is a powerful procedure with which to process data and is computationally quite
A DNA sequence that functions to control the expression of other genes by producing
a protein that modulates the synthesis of their products (typically by binding to
the gene promoter). (cf. Structural gene).
A database that follows E. F. Coddi's 11 rules, a series of mathematical and logical
steps for the organization and systemization of data into a software system that
allows easy retrieval, updating, and expansion. An RDBMS stores data in a database
consisting of one or more tables of rows and columns. The rows correspond to a record
(tuple); the columns correspond to attributes (fields) in the record. In an RDBMS,
a view, defined as a subset of the database that is the result of the evaluation
of a query, is a table. RDBMSs use Structured Query Language (SQL) for data definition,
data management, and data access and retrieval. Relational and object-relational
databases are used extensively in bioinformatics to store sequence and other biological
Relational Database Management Systems (RDBMS)
A software system that includes a database architecture, query language, and data
loading and updating tools and other ancillary software that together allow the
creation of a relational database application.
Repeats (repeat sequences)
Repeat sequences and approximate repeats occur throughout the DNA of higher organisms
(mammals). For example, the Alu sequences of length about 300 characters, appear
hundreds of thousands of times in Human DNA with about 87% homology to a consensus
Alu string. Some short substrings such as TATA-boxes, poly-A and (TG)* also appear
more often than by chance. Repeat sequences may also occur within genes, as mutations
or alterations to those genes. Repetitive sequences, especially mobile elements,
have many applications in genetic research. DNA transposons and retroposons are
routinely used for insertional mutagenesis, gene mapping, gene tagging, and gene
transfer in several model systems.
Repetitive elements provide important clues about chromosome dynamics, evolutionary
forces, and mechanisms for exchange of genetic information between organisms The
most ubiquitous class of repetitive elements in the DNA sequence in primate genomes
is the Alu family of interspersed repeats which have arisen in the last 65 million
years of evolution Alu repeats belong to a class of sequences defined as short interspersed
elements (SINEs). Approximately 500,000 Alu SINEs exist within the human genome,
representing about 5% of the genome by mass.
The synthesis of an informationally identical macromolecule (e.g. DNA) from a template
The protein product of a regulatory gene that combines with a specific operator
(regulatory DNA sequence) and hence blocks the transcription of genes in an operon.
Restriction enzyme (restriction endonuclease)
A type of enzyme that recognizes specific DNA sequences (usually palindromic sequences
4, 6, 8 or 16 base pairs in length) and produces cuts on both strands of DNA containing
those sequences only. The "molecular scissors" of rDNA technology.
Restriction fragment length polymorphisms (RFLPs)
Variation within the DNA sequences of organisms of a given species that can be identified
by fragmenting the sequences using restriction enzymes, since the variation lies
within the restriction site. RFLPs can be used to measure the diversity of a gene
in a population.
A physical map or depiction of a gene (or genome) derived by ordering overlapping
restriction fragments produced by digestion of the DNA with a number of restriction
The use of protein information to elucidate the genetic sequence encoding that protein.
Used to describe the process of gene isolation starting with a panel of afflicted
patients (see positional cloning).
A DNA polymerase that can synthesize a complementary DNA (cDNA) strand using RNA
as a template - a so-called RNA-dependent DNA polymerase.
Reverse transcriptase-PCR (RT-PCR)
Procedure in which PCR amplification is carried out on DNA that is first generated
by the conversion of mRNA to cDNA using reverse transcriptase.
Ribonucleic acid (RNA)
A category of nucleic acids in which the component sugar is ribose and consisting
of the four nucleotides Thymidine, Uracil, Guanine, and Adenine. The three types
of RNA are messenger RNA (mRNA), transfer RNA (tRNA) and ribosomal RNA (rRNA).
Secondary structure (protein)
The organization of the peptide backbone of a protein that occurs as a result of
hydrogen bonds e.g alpha helix, Beta pleated sheet.
Selectivity of bioinformatics similarity search algorithms is defined as the significance
threshold for reporting database sequence matches. As an example, for BLAST searches,
the parameter E is interpreted as the upper bound on the expected frequency of chance
occurrence of a match within the context of the entire database search. E may be
thought of as the number of matches one expects to observe by chance alone during
the database search.
The strand of double-stranded DNA that acts as the template strand for RNA synthesis.
Typically only one gene product is produced per gene, reading from the sense strand
only. (Some viruses have open reading frames in both the sense and the antisense
Sensitivity of bioinformatics similarity search algorithms centers around two areas:
First, how well can the method detect biologically meaningful relationships between
two related sequences in the presence of mutations and sequencing errors; Secondly
how does the heuristic nature of the algorithm affect the probability that a matching
sequence will not be detected. At the user's discretion, the speed of most similarity
search programs can be sacrificed in exchange for greater sensitivity - with an
emphasis on detecting lower scoring matches.
Sequence Tagged Site (STS)
A unique sequence from a known chromosomal location that can be amplified by PCR.
STSs act as physical markers for genomic mapping and cloning.
Sexual PCR (Molecular Diversity)
Sexual PCR is a form of PCR in which similar, but not identical, DNA sequences are
reassembled to obtain novel juxtapositions, simulating the result of genetic recombination.
The result is the creation of an array of related genes which may possess improved
characteristics. By repeated rounds of recombination, selection and PCR-based amplification
vastly improved gene-products, such as enzymes with greater activity, may be generated
The cloning of an entire gene segment or genome by generating a random set of fragments
using restriction endonucleases to create a gene library that can be subsequently
mapped and sequenced to reconstruct the entire genome.
Similarity (homology) search
Given a newly sequenced gene, there are two main approaches to the prediction of
structure and function from the amino acid sequence. Homology methods are the most
powerful and are based on the detection of significant extended sequence similarity
to a protein of known structure, or of a sequence pattern characteristic of a protein
family. Statistical methods are less successful but more general and are based on
the derivation of structural preference values for single residues, pairs of residues,
short oligopeptides or short sequence patterns. The transfer of structure/function
information to a potentially homologous protein is straightforward when the sequence
similarity is high and extended in length, but the assessment of the structural
significance of sequence similarity can be difficult when sequence similarity is
weak or restricted to a short region.
Signal sequence (leader sequence)
A short sequence added to the amino-terminal end of a polypeptide chain that forms
an amphipathic helix allowing the nascent polypeptide to migrate through membranes
such as the endoplasmic reticulum or the cell membrane. It is cleaved from the polypeptide
after the protein has crossed the membrane.
Single nucleotide polymorphisms (SNPs)
Variations of single base pairs scattered throughout the human genome that serve
as measures of the genetic diversity in humans. About 1 million SNPs are estimated
to be present in the human genome, and SNPs are useful markers for gene mapping
Rapid sequencing of large segments of the genome of an organism by isolating as
many expressed (cDNA) sequences as possible and performing single sequencer runs
on their 5í or 3í ends. Single-pass sequencing typically results in individual,
error-prone sequencing reads of 400-700 bases, depending on the type of sequencer
used. However, if many of these are generated from numerous clones from different
tissues, they may be overlapped and assembled to remove the errors and generate
a contiguous sequence for the entire expressed gene.
Sites in sequences can be located either in DNA (e.g. binding sites, cleavage sites)
or in proteins. In order to identify a site in DNA, ambiguity symbols are used to
allow several different symbols at one position. Proteins, however, need a different
mechanism (see Pattern). Restriction enzyme cleavage sites, for instance, have the
following properties: limited length (typically, less than 20 base pairs); definition
of the cleavage site and its appearance (3', 5' overhang or blunt); definition of
the binding site.
A procedure for the identification of DNA by transmitting a fragment isolated on
an agarose gel to a nitrocellulose filter where it can be hybridized with a complementary
The sequence found at the 5í and 3í region of exon/intron boundaries, usually defined
by a consensus sequence:
5í CAGGTAAGT---------TNCAGG 3í
A G C T
N represents any nucleotide; the bottom line represents alternative nucleotides
at the indicated positions.
By using alternative splicing, a single message precursor from DNA can generate
an entire family of mRNAs and proteins. This can be utilized to create specificity
in cell-cell or cell-ligand interactions. A cell may produce a given protein, but
it will be a different splice-form of the protein than that produced by an adjacent
cell. In this manner, the two cells have the potential to interact differently with
other cells or molecules. Two places where this has been extremely important is
in the production of cell-surface specificity proteins in the immune and nervous
The joining together of separate DNA or RNA component parts. For example, RNA splicing
in eukaryotes involves the removal of introns and the stitching together of the
exons from the pre-mRNA transcript before maturation.
The surface area (typically measured in square angstroms) of a biological molecule,
usually a protein, that is exposed to solvent in its native, folded form. Determining
the solvent accessibility of a protein helps define which amino acids in its molecular
sequence are on the exterior of the molecule, and thus available to participate
in interactions with other molecules.
Gene which encodes a structural protein (cf. Regulatory gene).
Algorithms that predict the secondary, tertiary and sometimes even quaternary structure
of proteins from their sequences. Determining protein structure from sequence has
been dubbed "the second half of the Genetic Code" since it is the folded tertiary
structure of a protein that governs how it functions as a gene product. As yet most
structure prediction methods are only partially successful, and typically work best
for certain well-defined classes of proteins.
A model of protein evolution at the sequence level resulting in the development
of a set of widely used substitution matrices. These are frequently called Dayhoff,
MDM (Mutation Data Matrix), BLOSUM or PAM (Percent Accepted Mutation) matrices.
They are derived from global alignments of closely related sequences. Matrices for
greater evolutionary distances are extrapolated from those for lesser ones.
A cDNA library that only contains cDNAs uniquely expressed in a given cell or tissue.
e.g T cells and B cells will express many common RNAs, as well as a very small percentage
which will be unique for T cells and B cells respectively. To make a T cell subtraction
library, the cDNA from a T cell library is hybridized with a vast excess of B cell
RNA. The commonly expressed genes will result in RNA-cDNA hybrids which can be removed
(or subtracted) to leave only T cell specific cDNAs.
Tentative Consensus (TC)
The identification of a sequence from an EST cluster that represents part or all
of a complete gene. TCs are usually determined by clustering ESTs allowing for sequencing
errors, artifacts such as chimeric clones, and naturally occurring biological phenomena
such as alternative splicing. Creation of a cluster allows one to generate a consensus
sequence and then identify a long open reading frame which would suggest the possibility
of that consensus representing a bona fide gene.
Tentative Human Consensus sequences (THCs)
A consensus sequence generated from human EST fragments. THCs may be validated by
comparison against databases of known human gene sequences, human genomic sequences,
or by identification of the ORFs or other sequence features contained within the
consensus as belonging to a known human gene product.
Folding of a protein chain via interactions of its side chain molecules including
formation of disulphide bonds between cysteine residues.
A pyrimidine base found in DNA but not in RNA.
Section of an organ that consists of a largely homogenous population of cell types.
Since many organs are multifunctional, they have developed highly specialized cell
types to perform different functions. Identifying the section of an organ that is
homogenous for a particular cell type ensures that the gene expression profiles
extracted from those cells will accurately resemble the class of cells that make
up the tissue.
The single-stranded mRNA chain that is assembled from a gene template.
The assembly of complementary single-stranded RNA on a DNA template.
A group of regulatory proteins that are required for transcription in eukaryotes.
Transcription factors bind to the promoter region of a gene and facilitate transcription
by RNA polymerase.
Transfer RNA (tRNA)
A small RNA molecule that recognizes a specific amino acid, transports it to a specific
codon in the mRNA, and positions it properly in the nascent polypeptide chain.
A genetic alteration to a cell as a result of the incorporation of DNA from a genetically
different cell or virus; can also refer to the introduction of DNA into bacterial
cells for genetic manipulation.
A foreign gene that is introduced into a cell or whole organism (e.g. transgenic mice)
for therapeutic or experimental purposes.
The process of converting RNA to protein by the assembly of a polypeptide chain
from an mRNA molecule at the ribosome.
The region of a transmembrane protein that actually spans the membrane. Transmembrane
regions are usually hydrophobic in order to be thermodynamically compatible with
the lipid bi layer portion of the membrane. They may consist of either alpha-helical
or beta-strand secondary structure elements, but in either case the external residues
(the ones facing the membrane) are invariably hydrophobic while the internal residues
may be hydrophilic (as in the case of a pore or channel) or polar. One common transmembrane
structural domain is the seven-helix bundle seen in numerous channel proteins.
Section of an organ that consists of a largely homogenous population of cell types.
Since many organs are multifunctional, they have developed highly specialized cell
types to perform different functions. Identifying the section of an organ that is
homogenous for a particular cell type ensures that the gene expression profiles
extracted from those cells will accurately resemble the class of cells that make
up the tissue.
Unidentified reading frame (URF)
An open reading frame encoding a protein of undefined function.
Nitrogenous pyrimidine base found in RNA but not DNA.
Variable numbers of tandem repeats (VNTRs)
DNA sequence blocks of 2-60 base pairs which are repeated from two to more than
20 times in different individuals. This polymorphism makes VNTRs very useful DNA
markers used in genomic mapping, linkage analysis and also DNA fingerprinting.
Variation in genetic sequences and the detection of DNA sequence variants genome-wide
allow studies relating the distribution of sequence variation to a population history.
This in turn allows one to determine the density of SNPS or other markers needed
for gene mapping studies. Quantitation of these variations together with analytical
tools for studying sequence variation also relate genetic variations to phenotype.
Any agent that transfers material (typically DNA) from one host to another. Typically
DNA vectors are autonomous DNA elements (such as plasmids) that can be manipulated
and integrated into a host's DNA or recombinant viruses.
The creation and storage of vast collections of molecular structures in an electronic
database. These databases may be queried for subsets that exhibit specific physicochemical
features, or may be "virtually screened" for their ability to bind a drug target.
This process may be performed prior to the synthesis and testing of the molecules
Visualization is the process of representing abstract scientific data as images
that can aid in understanding the meaning of the data.
The density of binding sites in a gene or sequence can be used to derive a ratio
of density for each element in a pattern of interest. The combined individual density
ratios of all elements are then collectively used to build a scoring profile known
as a weight matrix. This profile can be used to test the prediction of the identification
of the selected pattern and the ability of the algorithm to discriminate them from
Technique in which specific antibodies are used to identify their antigens from
a mixture of proteins. Typically, these proteins mixtures are first separated by
electrophoresis and then transferred onto nylon sheets by electrotransfer. Radiolabeled
or enzyme-linked antibodies are incubated with the sheets and unbound antibodies
washed away allowing the position of the bound antibody to be revealed by autoradiography
or color which is formed upon addition of a substrate.
Form of a gene or allele that is considered the "standard" or most common.
In mammals, the sex chromosome that is found in two copies in the homogametic sex
(female in humans) and one copy in the heterogametic sex (male in humans).
Yeast 2-hybrid system
A yeast-based method used to simultaneously identify, and clone the gene for, proteins
interacting with a known protein. The basis of this method is a "transcriptional
reporter assay" (see definition) in which reporter gene expression is dependent
on two domains. The first domain is linked to the known protein. The second domain
is genetically linked to a library. If the library is screened against the known
protein the two domains will interact only if a protein from the library binds the
known protein, resulting in transcription activation of the reporter gene, and a
blue color. The "blue yeast clone" will contain the gene encoding the newly identified
A conformation of DNA existing as a left-handed double helix (the phosphate-sugar
backbone forms a left-handed zigzag course), which may play a role in gene regulation.
A protein motif formed by the interaction of repeated cysteine and histidine residues
with a zinc ion. The spacing of the repeats results in finger like arrangements
of the protein loops formed from the interaction which interact with DNA. These
motifs are typically found in transcription factors.