BIOINFORMATICS
PROTOCOLS AND TOOLS ON THE INTERNET
Compiled by
B. Vijay B.
Reddy and
Based on Current Protocols by Wiley Inter Science
(Up
dated in May, 2005)
1. Using Biological Databases
|
1 |
Biological databases play a central role
in bioinformatics. They offer scientists the opportunity to access sequence
and structure data for tens of thousands of sequences from a broad range of
organisms. This unit provides a brief overview of some existing sequence
databases, such as GenBank, UCSC Genome Browser,
and Ensembl. It also discusses non-sequence centric
databases, such as OMIM. Entrez, The Life Sciences
Search Engin for Nucleotide,
Protein,
Structure,
Genome,
Taxonomy,
PubMedCentral, Journals
and Books
data bases at NCBI. |
|
|
2 |
Online Mendelian
Inheritance in Man (OMIM) is a non-sequence-based information resource that
can be of tremendous use to genomics researchers, physicians, and patients.
OMIM is the electronic version of the catalog of human genes and genetic
disorders. It provides concise textual information from the literature on
most human conditions having a genetic basis, as well as pictures
illustrating the condition or disorder (where appropriate) and full citation
information. |
|
|
3 |
The whole genomes of over 1000 viruses
and over 100 microbes (as on 4/2004) can be found in Entrez
Genome. MapViewer |
|
|
|
National Library of Madicine:
MEDLINE
and Aditional Life Science Journals. |
|
|
4 |
UCSC Genome Brouser:
This site contains the reference sequence for the human and C. elegans
genomes and working drafts for the chimpanzee, mouse, rat, chicken, Fugu,
Drosophila, C. briggsae,
yeast, and SARS genomes. It also shows the CFTR (cystic fibrosis) region in
13 species. |
|
|
5 |
The Map Viewer supports
search and display of genomic information by chromosomal position. Regions of
interest can be retrieved by text queries (e.g. gene or marker name) or by
sequence alignment (BLAST). View results at the whole genome level, and
select what to display in more detail. Multiple options exist to configure your
display, download data, navigate to related data, and analyze supporting
information using the tools provided. |
|
|
6 |
The TIGR Gene Indices provides access to
analyses of ESTs and gene sequences for over 60
species as well as number of resources derived fro these. |
|
|
7 |
Mouse Genome Informatics (MGI) provides integrated access to data on
the genetics, genomics, and biology of the laboratory mouse. |
|
|
8 |
WormBase is major public database from the
nematode Caenorhabditis elegans
Contains genomic sequence, genes and its products, high level traits, expression
patterns and neural connectivity. It is also connected to C briggsae, closely related worm. |
|
|
9 |
Protein Data
Bank (PDB) is the
world-wide repository for three-dimensional structural data determined using
various experimental methods. Several types of information are associated
with each structure deposition including atomic coordinates of the structure,
experimental data used to solve the structure, sequences of all
macromolecules that constitute the structures, details about the structure
solution method, images showing different views of the structure, derived
geometric data, and a variety of links to other resources. These data and
resources may be used for understanding and designing biochemical, genetic,
or other experiments to study the stability or function of the molecule. They
can also be used for molecular modeling and drug design. |
|
|
10 |
Human Genome Mutation Data |
|
|
|
Human Mutation databases |
|
|
11 |
The Arabidopsis Information Resource |
|
|
|
UniProt (Universal Protein Resource) is the
world's most comprehensive catalog of information on proteins. It is a
central repository of protein sequence and function created by joining the information
contained in Swiss-Prot, TrEMBL, and PIR. |
2. Recognizing Functional Domains By
Sequence Comparison
|
2 |
Blocks are multiply aligned ungapped segments corresponding to the most highly
conserved regions of proteins. Block Searcher, Get Blocks and Block Maker are
aids to detection and verification of protein sequence homology. They compare
a protein or DNA sequence to a database of protein blocks (current
version), retrieve blocks, and create new blocks, respectively. COBBLER (COnsensus Biasing By Locally Embedding
Residues), MAST, LAMA |
|
|
3 |
Clustal W is a general purpose multiple
sequence alignment program for DNA or proteins.It
produces biologically meaningful multiple sequence alignments of divergent
sequences. It calculates the best match for the selected sequences, and lines
them up so that the identities, similarities and differences can be seen.
Evolutionary relationships can be seen via viewing Cladograms
or Phylograms. |
|
|
4 |
Multiple Em
for Motif Elicitation (a tool for discovering motifs in a group of related
DNA or protein sequences.) and MAST (Motif Alignment
Search Tool: is a tool for searching biological sequence databases for
sequences that contain one or more of a group of known motifs). Meta-MEME combines motif models from MEME
into a hidden Markov model framework for use in searching sequence databases. |
|
|
5 |
Pfam is a large collection of multiple
sequence alignments and hidden Markov models covering many common protein
domains and families. For each family in Pfam you can:
Look at multiple alignments, View protein domain architectures, Examine
species distribution, Follow links to other databases and View known protein
structures. |
|
|
6 |
Tool for predicting transcription factor
binding sites in DNA sequences. It can identify binding sites using site or
consensus strings and positional weight matrices from the TRANSFAC, IMD, and
our CBIL-GibbsMat database. You may also include
your own site or consensus strings and/or weight matrices in the search. |
|
|
7 |
A database of protein families, domains
and functional sites in which identifiable features found in known proteins
can be applied to unknown protein sequences. |
3. Finding Similarities and Inferring Homologies
|
1 |
|
Sequence similarity searching is an
essential tool for molecular biologists. It is used to support inference of
protein function and for phylogenetic analysis. Every
searching procedure requires some understanding of the underlying principles,
so at the very least the investigator's selection of parameters is correct.
The principles underlying the most commonly used procedures are presented in
this unit. The discussion is intentionally non-mathematical, but does contain
references for those who desire a mathematical and statistical discussion of
these procedures. |
|
2 |
FrameSearch searches a group of protein sequences
for similarity to one or more nucleotide query sequences, or searches a group
of nucleotide sequences for similarity to one or more protein query sequences.
For each sequence comparison, the program finds an optimal alignment between
the protein sequence and all possible codons on
each strand of the nucleotide sequence. Optimal alignments may include
reading frame shifts. GCG |
|
|
3 4 |
BLAST 2.0, (Basic
Local Alignment Search Tool), provides a method for
rapid searching of nucleotide and protein databases. Sequence alignments
provide a powerful way to compare novel sequences with previously
characterized genes. Both functional and evolutionary information can be
inferred from well designed queries and alignments. |
|
|
5 |
Weight Matrices for Sequence
Similarity Scoring. ScoringMethods,
|
|
|
6 |
PileUp
creates a multiple sequence alignment from a group of related sequences using
progressive pair-wise alignments. It can also plot a tree showing the
clustering relationships used to create the alignment. Related programs LineUp, Pretty, PlotSimilarity,
ProfileMake, ProfileSearch,
ProfileSegments, ProfileGap,
Distance, GrowTree, Diverge and Gap area available. |
|
|
6 |
SeqLab, an X-Windows windows
user interface to the GCG
Wisconsin Package sequence analysis software, allows access to GCG's Wisconsin Package programs using a graphical user
interface using a mouse with pull-down menus. For a more detailed description,
see the
description of SeqLab from Accelrys's
web page. |
|
|
7 |
Multiple Sequence Alignment using
different methods: DIALIGN,
Match-Box,
MultAlin, SAGA,
DIQALIGN2,
ITERALIGN,
MAFFT,
MEME, MULTALIGN,
MULTAL, MUMmer, DCA-OMA, PileUp,
POA, PRALINE, Prrp, SAGA, T-Coffee, TNB, ClustalW |
|
|
8 |
T-Coffee: A novel method for multiple
sequence alignments. |
|
|
9 |
Provides sequence similarity and
homology searching against nucleotide and protein databases using the Fasta programs. Fasta can be
very specific when identifying long regions of low similarity especially for
highly diverged sequences. You can also conduct sequence similarity and
homology searching against complete proteome or genome databases using the
Fasta
programs. Details about
this service. FASTA@virginia, MPsrch, Scanps2.3,
Fasta-SNP,
Fasta-WGS,
Fasta-Genome,
Fasta-Proteome,
Fasta
protein, Fasta Nucleotide. |
|
|
10 |
Local installation downloads.
ftp://ftp.virginia.edu/pub/fasta/ |
|
|
11 |
Local installation downloads. |
|
|
|
|
|
4. Finding Genes
|
1 |
List of Gene finding programs and tips
for using these programs with a genomic sequence. |
|
|
|
This server provides access to the
program Genscan for predicting the locations and exon-intron structures of genes in genomic sequences from
a variety of organisms. |
|
|
2 |
This page contains software tools
designed to predict putative internal protein coding exons
in genomic DNA sequences. Human, mouse and arabidopsis
exons are predicted by a program called MZEF (developed
by Dr. Michael Zhang). fission yeast exons are predicted by a program called Pombe. MZEF-SPC |
|
|
|
MZEF-SPC is an Integrated System for Exon Finding with SpliceProximalCheck as a front-end tool for Michael
Zhang's Exon Finder (MZEF) Program. The
system validates MZEF predicted splice site as to whether it is a proximal
false site or a possibly true site. |
|
|
3 |
geneid is the web server to geneid, a program to predict genes, exons, splice sites and other signals along a DNA
sequence. Visit |
|
|
4 |
A gene finder developed
specifically for eukaryotes. It is based on a dynamic programing
algorithm that considers all combinations of possible exons
for inclusion in a gene model and chooses the best of these combinations. |
|
|
5 |
The GeneMark
program is accessing the coding potential of DNA sequences by using Markov
models of coding and non-coding regions within a sliding window. This local approach
is sensitive to local variations of coding potential and is able to show
details of the coding potential distribution along with gene identification. |
|
|
6 |
Gene predictions by GeneMark.hmm
program for prokaryots and eukaryotes. |
|
|
7 |
FirstEF* (First Exon
Finder) is a 5' terminal exon and promoter
prediction program. It consists of different discriminant
functions structured as a decision tree. The probabilistic models are
optimized to find potential first donor sites and CpG-related
and non-CpG-related promoter regions based on discriminant analysis. For every potential first donor
site (GT) and an upstream promoter region, FirstEF
decides whether or not the intermediate region can be a potential first exon, based on a set of quadratic discriminant
functions. |
|
|
8 |
TWINSCAN is
a gene prediction system that models both gene structure and evolutionary
conservation. The scores of features like splice sites and coding regions are
modified using the patterns of divergence between the target genome and a
closely related genome. |
|
|
9 |
GrailEXP is a software package that predicts exons, genes, promoters, polyas,
CpG islands, EST similarities, and repetitive
elements within DNA sequence. GrailEXP is used by
the Computational Biosciences Section
at Oak Ridge National Laboratory to
annotate the entire known portion of the human genome (including both
finished and draft data). If you are interested in microbial genome
analysis and annotation, you should go to the Generation home page |
|
|
10 |
RepeatMasker is a program that screens DNA sequences
for interspersed repeats and low complexity DNA sequences. The output of the
program is a detailed annotation of the repeats that are present in the query
sequence as well as a modified version of the query sequence in which all the
annotated repeats have been masked (default: replaced by Ns). On average,
almost 50% of a human genomic DNA sequence currently will be masked by the
program. Sequence comparisons in RepeatMasker are
performed by the program cross_match, an efficient
implementation of the Smith-Waterman-Gotoh
algorithm developed by Phil Green. |
5. Modeling Structure from Sequence
|
1 |
Comparative Protein
Modeling: Servers useful for comparative protein modeling. MODELER |
|
|
|
A
database of 3D structures models. |
|
|
2 |
The
coordinate of protein 3D structure models built by FAMS, and the system of
FAMSBASE are legally protected. |
|
|
3 |
|
Modeling
membrane proteins utilizing information from silent amino acid substitution. Crystallography
& NMR System (CNS) is the result of an international collaborative effort
among several research groups. The program has been designed to provide a
flexible multi-level hierarchical approach for the most commonly used
algorithms in macromolecular structure determination. Highlights include
heavy atom searching, experimental phasing (including MAD and MIR), density
modification, crystallographic refinement with maximum likelihood targets,
and NMR structure calculation using NOEs,
J-coupling, chemical shift, and dipolar coupling data. |
6. Inferring Evolutionary Relationships
|
1 |
This unit provides a general
introduction to phylogeny. It defines common terms and discusses the issue of
rooting trees, in addition to comparing gene and species trees. Methods for
inferring phylogenies, such as distance methods, parsimony methods, and
maximum likelihood are also presented. The unit concludes with discussion of
how to assess tree confidence. |
|
|
2 |
TreeView is a simple program for displaying
phylogenies. |
|
|
3 |
PHYLIP is a free
package of programs for inferring phylogenies. It is distributed as source
code, documentation files, and a number of different types of executables.
Fast neighbor joining methods: Weighbor, BIONJ, DNADIST,
PROTDIST, SEQBOOT, CONSENSE. |
|
|
TREE-PUZZLE provides means to analyze
and reconstruct evolutionary relationships and trees based on quartets, i.e.
groups of 4 sequences, explains how to reconstruct trees based on the
maximum-likelihood principle and quartet puzzling,
discusses likelihood mapping, a method to visualize phylogenetic
content in a multiple sequence alignment and explains how to compare tree
topologies using different tests. |
||
|
A set of aligned character sequences or
a matrix of evolutionary distances often contains a number of different and
sometimes conflicting phylogenetic signals, and
thus does not always support a unique tree. The method of split decomposition
addresses this problem. For ideal data, this method gives rise to a phylogenetic tree, whereas less ideal data are
represented by a tree-like network that may indicate evidence of different
and conflicting phylogenies. The SplitsTree program, described here, implements this approach and can
be used to compute and visualize phylogenetic
networks called splits graphs. It also implements a number of distance
transformations, the computation of parsimony splits, spectral analysis and
bootstrapping. |
||
|
8 |
The PEBBLE (Phylogenetics,
Evolutionary Biology, and Bioinformatics in a moduLar
Environment) application is a relative newcomer to the field of phylogenetic applications. Although designed as a customizable
generalist application, e.g., rapidly evolving viral genes sampled over the
course of infection, or ancient DNA sequences. The basic protocol describes
the use of PEBBLE to infer a phylogenetic tree
using the sUPGMA algorithm, and the inference of
substitution rate parameters using maximum likelihood. The alternate and
support protocols describe the simulation capabilities of PEBBLE, and general
use of the PEBBLE application, respectively. |
Analyzing Gene Expression Patterns
|
1 |
After providing a brief introduction to microarray chips and experimental details, this overview
discusses analysis techniques. Data analysis from microarray
experiments generally involves two parts: acquiring and normalizing the data,
and interpreting it. This unit focuses mostly on the latter, as it is less
technology-specific. |
|
|
The Gene Ontology (GO) Project: Structured
Vocabularies for Molecular Biology and Their Application to Genome and
Expression Analysis Gene Ontology, Ontology?, UMLS. Scientists wishing to utilize genomic data
have quickly come to realize the benefit of standardizing descriptions of
experimental procedures and results for computer-driven information retrieval
systems. The focus of the Gene Ontology project is three-fold. First, the
project goal is to compile the Gene Ontologies;
structured vocabularies describing domains of molecular biology. Second, the
project supports the use of these structured vocabularies in the annotation
of gene products. Third, the gene product-to-GO annotation sets are provided
by participating groups to the public through open access to the GO database
and Web resource. This unit describes the current ontologies
and what is beyond the scope of the Gene Ontology project. It addresses the
issue of how GO vocabularies are constructed and related to genes and gene
products. It concludes with a discussion of how researchers can access,
browse, and utilize the GO project in the course of their own research. |
||
|
Dragon View
provides a suite of information visualization tools designed to aid in the
analysis of differential gene expression data that has been previously
annotated with biologically relevant information using the Dragon Database.
Presently there are three visualization tools that you can use. The J-Express package has been designed
to facilitate the analysis of microarray data with
an emphasis on efficiency, usability, and comprehensibility. The J-Express
system provides a powerful and integrated platform for the analysis of microarray gene expression data. It is platform
independent in that it requires only the availability of a Java virtual
machine on the system. The system includes a range of analysis tools and a
project management system supporting the organization and documentation of an
analysis project. This unit describes the J-Express tool emphasizing central
concepts and principles, and gives examples of how it can be used to explore
gene expression data sets. |
||
|
5 |
GenMAPP (Gene MicroArray
Pathway Profiler) is a free, stand-alone computer program designed for
viewing and analyzing gene expression data on MAPPs
representing biological pathways or any other functional grouping of genes. A
MAPP is a special file format produced with the graphics tools in GenMAPP that depicts the biological relationship between
genes or gene products. When a MAPP is linked to an expression dataset, GenMAPP automatically and dynamically color-codes the
genes on the MAPP according to criteria supplied by the user. MAPPFinder is an accessory program that works with GenMAPP and the annotations from the Gene Ontology (GO)
Consortium to identify global biological trends in gene expression data. MAPPFinder relates the microarray
dataset to the GO hierarchy and calculates a percentage and statistical score
for genes meeting the user's criterion for a meaningful gene expression
change for each GO biological process, cellular component, and molecular
function term. |
Analyzing Molecular Interactions
Building Biological databases
Comparing Large Sequence Sets
|
MUMmer is a
system for rapidly aligning entire genomes, whether in complete or draft
form. |
|
A graphical tool for viewing and
editing assembled sequences. |
Analyzin RNA Sequence and Structure
Tools and Algorithms for Sequence/Structure Data Analysis
Pairwise Sequence Alignment Methods
Multiple Sequence Alignment Methods
Major Bioinformatics Data Centers
|
Research Collaboratory
for Structural Bioinformatics - Protein Data Bank |
||