BIOINFORMATICS PROTOCOLS AND TOOLS ON THE INTERNET

Compiled by

B. Vijay B. Reddy and Yiannis Kaznessis
Digital Technology Center
University
of Minnesota, Minneapolis, MN 55455

Based on Current Protocols by Wiley Inter Science

                                                                                                                            (Up dated in May, 2005)

1. Using Biological Databases

1

Entrez

Biological databases play a central role in bioinformatics. They offer scientists the opportunity to access sequence and structure data for tens of thousands of sequences from a broad range of organisms. This unit provides a brief overview of some existing sequence databases, such as GenBank, UCSC Genome Browser, and Ensembl. It also discusses non-sequence centric databases, such as OMIM. Entrez, The Life Sciences Search Engin for Nucleotide,   ProteinStructure, Genome, TaxonomyPubMedCentral, Journals and  Books data bases at NCBI.

2

OMIM

Online Mendelian Inheritance in Man (OMIM) is a non-sequence-based information resource that can be of tremendous use to genomics researchers, physicians, and patients. OMIM is the electronic version of the catalog of human genes and genetic disorders. It provides concise textual information from the literature on most human conditions having a genetic basis, as well as pictures illustrating the condition or disorder (where appropriate) and full citation information.

3

Genome

The whole genomes of over 1000 viruses and over 100 microbes (as on 4/2004) can be found in Entrez Genome. MapViewer

 

PubMed

National Library of Madicine: MEDLINE and Aditional Life Science Journals.

4

UCSC-GB

UCSC Genome Brouser: This site contains the reference sequence for the human and C. elegans genomes and working drafts for the chimpanzee, mouse, rat, chicken, Fugu, Drosophila, C. briggsae, yeast, and SARS genomes. It also shows the CFTR (cystic fibrosis) region in 13 species.

5

NCBI-Maps

The Map Viewer supports search and display of genomic information by chromosomal position. Regions of interest can be retrieved by text queries (e.g. gene or marker name) or by sequence alignment (BLAST). View results at the whole genome level, and select what to display in more detail. Multiple options exist to configure your display, download data, navigate to related data, and analyze supporting information using the tools provided.

6

TIGR-GI

The TIGR Gene Indices provides access to analyses of ESTs and gene sequences for over 60 species as well as number of resources derived fro these.

7

MGI

Mouse Genome Informatics (MGI) provides integrated access to data on the genetics, genomics, and biology of the laboratory mouse.

8

WormBase

WormBase is major public database from the nematode Caenorhabditis elegans Contains genomic sequence, genes and its products, high level traits, expression patterns and neural connectivity. It is also connected to C briggsae, closely related worm.

9

PDB

Protein Data Bank (PDB) is the world-wide repository for three-dimensional structural data determined using various experimental methods. Several types of information are associated with each structure deposition including atomic coordinates of the structure, experimental data used to solve the structure, sequences of all macromolecules that constitute the structures, details about the structure solution method, images showing different views of the structure, derived geometric data, and a variety of links to other resources. These data and resources may be used for understanding and designing biochemical, genetic, or other experiments to study the stability or function of the molecule. They can also be used for molecular modeling and drug design.

10

HGMD

Human Genome Mutation Data

 

HMD

Human Mutation databases

11

AIR

The Arabidopsis Information Resource

 

UniProt

UniProt (Universal Protein Resource) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.


2. Recognizing Functional Domains By Sequence Comparison

2

Blocks

Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. Block Searcher, Get Blocks and Block Maker are aids to detection and verification of protein sequence homology. They compare a protein or DNA sequence to a database of protein blocks (current version), retrieve blocks, and create new blocks, respectively. COBBLER (COnsensus Biasing By Locally Embedding Residues),  MAST, LAMA

3

ClustalW

Clustal W is a general purpose multiple sequence alignment program for DNA or proteins.It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. Evolutionary relationships can be seen via viewing Cladograms or Phylograms.

4

MEME

Multiple Em for Motif Elicitation (a tool for discovering motifs in a group of related DNA or protein sequences.) and MAST (Motif Alignment Search Tool: is a tool for searching biological sequence databases for sequences that contain one or more of a group of known motifs). Meta-MEME combines motif models from MEME into a hidden Markov model framework for use in searching sequence databases.

5

Pfam

Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. For each family in Pfam you can: Look at multiple alignments, View protein domain architectures, Examine species distribution, Follow links to other databases and View known protein structures.

6

TESS

Tool for predicting transcription factor binding sites in DNA sequences. It can identify binding sites using site or consensus strings and positional weight matrices from the TRANSFAC, IMD, and our CBIL-GibbsMat database. You may also include your own site or consensus strings and/or weight matrices in the search.

7

InterPro

A database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences.



3. Finding Similarities and Inferring Homologies

1

 

Sequence similarity searching is an essential tool for molecular biologists. It is used to support inference of protein function and for phylogenetic analysis. Every searching procedure requires some understanding of the underlying principles, so at the very least the investigator's selection of parameters is correct. The principles underlying the most commonly used procedures are presented in this unit. The discussion is intentionally non-mathematical, but does contain references for those who desire a mathematical and statistical discussion of these procedures.

2

FRAME
SEARCH

FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts. GCG

3

4

BLAST-NCBI
BLAST-EBI

BLAST 2.0, (Basic Local Alignment Search Tool), provides a method for rapid searching of nucleotide and protein databases. Sequence alignments provide a powerful way to compare novel sequences with previously characterized genes. Both functional and evolutionary information can be inferred from well designed queries and alignments.

5

ScorMatrix

Weight Matrices for Sequence Similarity Scoring. ScoringMethods,

6

PileUp

PileUp creates a multiple sequence alignment from a group of related sequences using progressive pair-wise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. Related programs LineUp, Pretty, PlotSimilarity, ProfileMake, ProfileSearch, ProfileSegments, ProfileGap, Distance, GrowTree, Diverge and Gap area available.

6

SeqLab

SeqLab, an X-Windows windows user interface to the GCG Wisconsin Package sequence analysis software, allows access to GCG's Wisconsin Package programs using a graphical user interface using a mouse with pull-down menus. For a more detailed description, see the description of SeqLab from Accelrys's web page.

7

MSA

Multiple Sequence Alignment using different methods: DIALIGN, Match-Box, MultAlin, SAGA, DIQALIGN2, ITERALIGNMAFFT, MEME, MULTALIGN, MULTAL, MUMmer, DCA-OMA, PileUp, POA, PRALINE, Prrp, SAGA, T-Coffee, TNB,  ClustalW

8

T-Coffee

T-Coffee: A novel method for multiple sequence alignments.

9

FASTA

Provides sequence similarity and homology searching against nucleotide and protein databases using the Fasta programs. Fasta can be very specific when identifying long regions of low similarity especially for highly diverged sequences. You can also conduct sequence similarity and homology searching against complete proteome or genome databases using the Fasta programs. Details about this service. FASTA@virginia, MPsrchScanps2.3Fasta-SNPFasta-WGSFasta-GenomeFasta-Proteome, Fasta protein, Fasta Nucleotide.  

10

Ssearch

Local installation downloads. ftp://ftp.virginia.edu/pub/fasta/

11

BLAST

Local installation downloads. 

 

 

 


4. Finding Genes

1

GFPs, GFs

List of Gene finding programs and tips for using these programs with a genomic sequence.

 

GENESCAN

This server provides access to the program Genscan for predicting the locations and exon-intron structures of genes in genomic sequences from a variety of organisms.

2

MZEF

This page contains software tools designed to predict putative internal protein coding exons in genomic DNA sequences. Human, mouse and arabidopsis exons are predicted by a program called MZEF (developed by Dr. Michael Zhang). fission yeast exons are predicted by a program called Pombe. MZEF-SPC

 

MZEF-SPC

MZEF-SPC is an Integrated System for Exon Finding with SpliceProximalCheck as a front-end tool for Michael Zhang's Exon Finder (MZEF) Program. The system validates MZEF predicted splice site as to whether it is a proximal false site or a possibly true site.

3

GENEID

geneid  is the web server to geneid, a program to predict genes, exons, splice sites and other signals along a DNA sequence. Visit geneid homepage for more information about this program.

4

GlimmerM

A gene finder developed specifically for eukaryotes. It is based on a dynamic programing algorithm that considers all combinations of possible exons for inclusion in a gene model and chooses the best of these combinations.

5

GeneMark

The GeneMark program is accessing the coding potential of DNA sequences by using Markov models of coding and non-coding regions within a sliding window. This local approach is sensitive to local variations of coding potential and is able to show details of the coding potential distribution along with gene identification.

6

GM.hmm

Gene predictions by GeneMark.hmm program for prokaryots and eukaryotes.

7

FirstEF

FirstEF* (First Exon Finder) is a 5' terminal exon and promoter prediction program. It consists of different discriminant functions structured as a decision tree. The probabilistic models are optimized to find potential first donor sites and CpG-related and non-CpG-related promoter regions based on discriminant analysis. For every potential first donor site (GT) and an upstream promoter region, FirstEF decides whether or not the intermediate region can be a potential first exon, based on a set of quadratic discriminant functions.

8

TWINSCAN

TWINSCAN is a gene prediction system that models both gene structure and evolutionary conservation. The scores of features like splice sites and coding regions are modified using the patterns of divergence between the target genome and a closely related genome.

9

GrailEXP

GrailEXP is a software package that predicts exons, genes, promoters, polyas, CpG islands, EST similarities, and repetitive elements within DNA sequence. GrailEXP is used by the Computational Biosciences Section at Oak Ridge National Laboratory to annotate the entire known portion of the human genome (including both finished and draft data).  If you are interested in microbial genome analysis and annotation, you should go to the Generation home page

10

RepeatMask

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). On average, almost 50% of a human genomic DNA sequence currently will be masked by the program. Sequence comparisons in RepeatMasker are performed by the program cross_match, an efficient implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green.


5. Modeling Structure from Sequence

1

ServersCPM

 Comparative Protein Modeling: Servers useful for comparative protein modeling. MODELER

 

ModBase

A database of  3D structures models.

2

FAMSBASE

The coordinate of protein 3D structure models built by FAMS, and the system of FAMSBASE are legally protected.

3

CNSsolve

 

Modeling membrane proteins utilizing information from silent amino acid substitution.

Crystallography & NMR System (CNS) is the result of an international collaborative effort among several research groups. The program has been designed to provide a flexible multi-level hierarchical approach for the most commonly used algorithms in macromolecular structure determination. Highlights include heavy atom searching, experimental phasing (including MAD and MIR), density modification, crystallographic refinement with maximum likelihood targets, and NMR structure calculation using NOEs, J-coupling, chemical shift, and dipolar coupling data.


6. Inferring Evolutionary Relationships

1

PylProg

This unit provides a general introduction to phylogeny. It defines common terms and discusses the issue of rooting trees, in addition to comparing gene and species trees. Methods for inferring phylogenies, such as distance methods, parsimony methods, and maximum likelihood are also presented. The unit concludes with discussion of how to assess tree confidence.

2

TreeView

TreeView is a simple program for displaying phylogenies.

3

PHYLIP

PHYLIP is a free package of programs for inferring phylogenies. It is distributed as source code, documentation files, and a number of different types of executables. Fast neighbor joining methods: WeighborBIONJ, DNADIST, PROTDIST, SEQBOOT, CONSENSE.

4

PAUP

Phylogenetic analysis using parsimony has made it the most widely used software package for the inference of evolutionary trees.

5

MODELTEST

Modeltest helps a user to choose the model of DNA substitution that best fits his/her data, among 56 possible models. This is accomplished through an implementation of hierarchical likelihood ratio tests and the AIC criterion.

6

TREE-PUZZLE

TREE-PUZZLE provides means to analyze and reconstruct evolutionary relationships and trees based on quartets, i.e. groups of 4 sequences, explains how to reconstruct trees based on the maximum-likelihood principle and quartet puzzling, discusses likelihood mapping, a method to visualize phylogenetic content in a multiple sequence alignment and explains how to compare tree topologies using different tests.

7

SplitsTree

A set of aligned character sequences or a matrix of evolutionary distances often contains a number of different and sometimes conflicting phylogenetic signals, and thus does not always support a unique tree. The method of split decomposition addresses this problem. For ideal data, this method gives rise to a phylogenetic tree, whereas less ideal data are represented by a tree-like network that may indicate evidence of different and conflicting phylogenies. The SplitsTree program, described here, implements this approach and can be used to compute and visualize phylogenetic networks called splits graphs. It also implements a number of distance transformations, the computation of parsimony splits, spectral analysis and bootstrapping.

8

PEBBLE

The PEBBLE (Phylogenetics, Evolutionary Biology, and Bioinformatics in a moduLar Environment) application is a relative newcomer to the field of phylogenetic applications. Although designed as a customizable generalist application, e.g., rapidly evolving viral genes sampled over the course of infection, or ancient DNA sequences. The basic protocol describes the use of PEBBLE to infer a phylogenetic tree using the sUPGMA algorithm, and the inference of substitution rate parameters using maximum likelihood. The alternate and support protocols describe the simulation capabilities of PEBBLE, and general use of the PEBBLE application, respectively.


Analyzing Gene Expression Patterns

1

GE_MA

After providing a brief introduction to microarray chips and experimental details, this overview discusses analysis techniques. Data analysis from microarray experiments generally involves two parts: acquiring and normalizing the data, and interpreting it. This unit focuses mostly on the latter, as it is less technology-specific.

2

GO

The Gene Ontology (GO) Project: Structured Vocabularies for Molecular Biology and Their Application to Genome and Expression Analysis Gene Ontology, Ontology?UMLS.  Scientists wishing to utilize genomic data have quickly come to realize the benefit of standardizing descriptions of experimental procedures and results for computer-driven information retrieval systems. The focus of the Gene Ontology project is three-fold. First, the project goal is to compile the Gene Ontologies; structured vocabularies describing domains of molecular biology. Second, the project supports the use of these structured vocabularies in the annotation of gene products. Third, the gene product-to-GO annotation sets are provided by participating groups to the public through open access to the GO database and Web resource. This unit describes the current ontologies and what is beyond the scope of the Gene Ontology project. It addresses the issue of how GO vocabularies are constructed and related to genes and gene products. It concludes with a discussion of how researchers can access, browse, and utilize the GO project in the course of their own research.

3

J-Express

J-Express is a java application for the analysis of gene expression data provided by microarray experiments. Methods included in the application are:  Hierarchical clustering,  Self-organizing maps,  Principal component analysis, K-mean, Profile search, Intigrated tools for visualization.

4

DragonView

Dragon View provides a suite of information visualization tools designed to aid in the analysis of differential gene expression data that has been previously annotated with biologically relevant information using the Dragon Database. Presently there are three visualization tools that you can use. The J-Express package has been designed to facilitate the analysis of microarray data with an emphasis on efficiency, usability, and comprehensibility. The J-Express system provides a powerful and integrated platform for the analysis of microarray gene expression data. It is platform independent in that it requires only the availability of a Java virtual machine on the system. The system includes a range of analysis tools and a project management system supporting the organization and documentation of an analysis project. This unit describes the J-Express tool emphasizing central concepts and principles, and gives examples of how it can be used to explore gene expression data sets.

5

GenMAPP

GenMAPP (Gene MicroArray Pathway Profiler) is a free, stand-alone computer program designed for viewing and analyzing gene expression data on MAPPs representing biological pathways or any other functional grouping of genes. A MAPP is a special file format produced with the graphics tools in GenMAPP that depicts the biological relationship between genes or gene products. When a MAPP is linked to an expression dataset, GenMAPP automatically and dynamically color-codes the genes on the MAPP according to criteria supplied by the user. MAPPFinder is an accessory program that works with GenMAPP and the annotations from the Gene Ontology (GO) Consortium to identify global biological trends in gene expression data. MAPPFinder relates the microarray dataset to the GO hierarchy and calculates a percentage and statistical score for genes meeting the user's criterion for a meaningful gene expression change for each GO biological process, cellular component, and molecular function term.


Analyzing Molecular Interactions

1

Prediction of protein-protein interaction networks.

2

Evaluation of interaction networks.

3

DelPhi

DelPhi provides numerical solutions to the Poisson-Boltzmann equation (both linear and nonlinear form) for molecules of arbitrary shape and charge distribution. Accelrys

4

PID_all

Protein Interaction Databases: MINT, DIP, PIM, PathCalling, GRID, InterPreTS, STRING, PPI, InterDom, FusionDB, IntAct, HPID.


Building Biological databases

Comparing Large Sequence Sets

1

PipMaker

PipMaker computes alignments of similar regions in two DNA sequences. It allows the user to see relationships among more than two sequences.

2

MUMmer

MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form.


Assembling Sequences

1

Consed

A graphical tool for viewing and editing assembled sequences.


Analyzin RNA Sequence and Structure

1

Vienna

RNA secondary structure predection and comparison package.


Tools and Algorithms for Sequence/Structure Data Analysis

1

ExPasyTools

The ExPASy (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics (SIB) is dedicated to the analysis of protein sequences and structures.

2.

BiBiServ

Bielefeld University Bioinformatics Server


Pairwise Sequence Alignment Methods

AlLIGN

SIM

BLAST2SEQ


Multiple Sequence Alignment Methods

DIALIGN

MultAlin

ClustalW


Major Bioinformatics Data Centers

1

NIH-NCBI

National Center for Biotechnology Information

2

EMBL-EBI

European Bioinformatics Instirute

3

RCSB-PDB

Research Collaboratory for Structural Bioinformatics - Protein Data Bank

4

TIGR

The Inistitute for Genomic Research

5

PIR

Protein Information Resource


Bioinformatics Groups

Michael J. E. Sternberg

Tom Blundell


References:
Current Protocols in Bioinformatics

Bioinformatics Toolbox