The rapid increase in the flow of information on the Internet resulting from the bandwidth explosion can truely be likened to an avalanche. Not only has the rate of information flow increased phenominally, but equally so the spread of connectivity and the storage capacity and processing power at the various distributed centers. In fact, the volume of data available to bioinformatics specialists is so large that it defies presentation using the conventional print medium. An appropriate method is to make a Web index or "Home Page" document based on the Hyper Text Mark-up Language (HTML) with hot links or pointers to various resources. This will be the mode of the actual presentation of this article during the Workshop. What follows here will be a selective coverage of the Index document.
As the title of this article suggests, bioinformatics comprises of various different types of resources. Firstly, there are the databases. Then there are the tools (computer programs) to analyse and draw inferences from the contents of the databases. Of more recent origin are automated alert services and java applets that tap the client side computing resources in addition to those on the server side.
In the following we will give a selected listing of the resources of these various categories and provide brief descriptions of their contents and functionalities.
By far the largest databases are concerned with nucleotide sequences of DNA and aminoacid sequences of proteins.
An annotated collection of all publicly available DNA sequences is maintained in a database called the GenBank by the National Center for Biotechnology Information (NCBI) of the National Institutes of Health (NIH), USA. As of December 1996 the total holding was approximately 730,500,000 bases in 1,115,000 sequences. The database available for public access is updated once every two months. There are two other major DNA databases in the world, one maintained by the European Molecular Biology Laboratory (EMBL) and the other by the DNA DataBank of Japan (DDBJ). The GenBank, EMBL and DDBJ have joined hands to form the International Nucleotide Sequence Database Collaboration, under which they exchange data on a daily basis.
The Division of Biomedical Information Sciences of the Johns Hopkins University, School of Medicine hosts the Human Genome Database (GDB). The ambitious long term objective of the Human Genome Project is to map all the genes of the entire human genome encompassing all variants and diversities.
Two principal databases of protein sequences are, SWISS-PROT and Protein Information Resource(PIR) maintained at the Georgetown University Medical Center supported by the Division of Research Resources of the NIH.
SWISS-PROT contains sequences translated from the EMBL Nucleotide Sequence Database. A small part of the information in SWISS-PROT was originally adapted from information contained in the (PIR).
A characteristic feature of SWISS-PROT is extensive annotation of the various sequences held. Data are stored in a format similar to that of the EMBL Nucleotide Sequence and all the data are easily retrievable by computer programs.
Some of the other protein sequence databases are, The Mendelian Inheritance in Man data bank (MIM) prepared under the supervision of Victor McKusick at John Hopkins University.
The PROSITE dictionary of sites and patterns in proteins prepared by Amos Bairoch at the University of Geneva.
The restriction enzymes database (REBASE) prepared by Richard Roberts and Dana Macelis at New England BioLabs.
The G-protein--coupled receptor database (GCRDb) prepared by Lee Frank Kolakowski at the Massachusetts General Hospital Renal Unit.
The EcoGene section of the EcoSeq/EcoMap integrated Escherichia coli K12 database and the StyGene section of StySeq/StyMap integrated Salmonella typhimurium LT2 database, both prepared by Ken Rudd at the NCBI.
The gene-protein database of Escherichia coli K12 (2D-gel spots)(ECO2DBASE).
The SubtiList relational database for the Bacillus subtilis 168 genome prepared under the supervisation of Ivan Moszer at the Pasteur Institute.
The LISTA database of yeast (Saccharomyces cerevisiae) genes coding for proteins prepared under the supervisation of Patrick Linder at the University of Geneva.
The human keratinocyte 2D gel protein database from the universities of Aarhus and Ghent.
The human 2D gel protein database (SWISS-2DPAGE) of the Faculty of Medicine of the University of Geneva.
The Yeast Electrophoresis Protein Database (YEPD) prepared under the supervisation of Jim Garrells from the Quest Protein Database Center of the Cold Spring Harbor Laboratory.
The Drosophila genome database (FlyBase) prepared under the supervisation of Michael Ashburner at the Department of Genetics, University of Cambridge. (http://flybase.bio.indiana.edu:82/)
The Maize genome database (MaizeDB) developed by the USDA-ARS Maize Genome Project as part of the National Agricultural Library's Plant Genome Research Program.
The WormPep database prepared by Richard Durbin and Erik Sonnhammer from the MRC Laboratory of Molecular Biology and Sanger Center at Hinxton Hall, Cambridge.
The DictyDb database prepared by Douglas W. Smith and Bill Loomis from the University of California, San Diego (UCSD).
The Human Retroviruses and AIDS compilation of nucleic and amino acid sequences (HIV Sequence Database) edited by G. Myers, A.B. Rabson, S.F. Josephs, T.F. Smith, J.A. Berzofsky, F. Wong-Staal; published by the Theoretical Biology and Biophysics Group T-10 at Los Alamos National Laboratory; and funded by the AIDS program of the National Institute of Allergy and Infectious Diseases through an interagency agreement with the United States Department of Energy.
The database of Homology-derived Secondary Structure of Proteins (HSSP) prepared under the supervisation of Chris Sander at the EMBL.
The transcription factor database (Transfac) developed by Edgar Wingender and Rainer Knueppel from the Gesellschaft fuer Biotechnologische Forschung mbH in Braunschweig.
The Protein Data Bank (PDB) maintained by the Brookhaven National Laboratory, and supported by the United States National Science Foundation, the Division of Research Resources of the NIH and the United States Department of Energy is quite distinct from the sequence databases and deserves special mention. It contains the detailed three dimentional structural information on proteins and some nucleic acids whose structures have been solved using experimental techniques like x-ray crystallography, nuclear magnetic resonance, etc.
A Nucleic Acid Database (NDB) emulating the PDB for nucleic acids is maintained at the Rutgers University. The NDB server is located at URL: (http://ndbserver.rutgers.edu/interface/)
NRL_3D Contains entries for which an X-Ray crystal structure exists in Brookhaven. The codes for these entries start with NRL_ followed by the Brookhaven database code.
The voluminous data contained in the libraries will be of no use unless convenient techniques are provided to browse them and retrieve data from them selectively. Hence many of the database keepers themselves provide various browsing and retrieval tools which are computer programs. In this section we will describe briefly some of the prominent tools.
Two of the most widely used sequence matching software are, FASTA and the Basic Local Alignment Search Tool (BLAST)
FASTA was developed by Lipman and Pearson in 1985. FASTA considers exact matches between short substrings of two sequences. If a significant number of such exact matches is found, FASTA uses the dynamic programming algorithm to compute optimal alignments.
This approach allows to trade speed for precision: The larger we choose the substring length, the smaller is the number of exact matches. This makes the program faster, but loses precision: It becomes less likely that the optimal alignment contains enough exact matches of the given length, and the procedure may find nothing. Nevertheless, experience shows that with sensibly chosen parameters, FASTA misses very few cases of significant homology. FASTA is available from: (ftp.virginia.edu in pub/fasta)
BLAST, developed by Altschul et al. in 1990, is another heuristic based on a similar idea. BLAST focusses on no-gap alignments of (again) a certain, fixed length. Rather than requiring exact matches, BLAST uses a scoring function to measure similarity (rather than distance). In particular for proteins, one can argue that segment pairs with no gaps and a high similarity score indicate regions of functional similarity. For a given threshold score BLAST reports to the user all database entries which have a segment pair with the query sequence that scores higher than the prescribed score. If the scoring function used has a probabilistic interpretation, BLAST can also give an assessment of the statistical significance of the matches it reports.
A BLAST search can be carried out interactively through a link from the NCBI home page. NCBI also provides a BLAST server that can be accessed through e-mail. The BLAST home page provides three links, BLAST help, Basic BLAST search, and Advanced BLAST search. The Basic search provides a search with default parameters, including filtering for low complexity regions. The Advanced search allows a user to specify a number of BLAST parameters. There are five different BLAST modules, that perform the following searches:
BLASTP compares an amino acid query sequence against a protein sequence database; BLASTN compares a nucleotide query sequence against a nucleotide sequence database; BLASTX compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database; TBLASTN compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands). TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
NCBI also provides an associated search tool called ENTREZ for searching the comment fields of the sequence databases. To search with ENTREZ, select which database you want to search (e.g., GenBank) from the ENTREZ home page. A Search Page is opened, where the search query (keyword/ term/phrase/ partial word) is entered. The number of hits is reported. The search may be refined by progressively decreasing (or increasing) the number of hits by introducing additional search terms that are either ANDed (or ORed) to the original term. ENTREZ is a powerful tool which enables one to search the comments field of sequence entries but even bibliographic entries in a section of MEDLINE. An online help is provided.
Most journals which publish articles relating to protein or nucleic acid sequence determination require the author(s) to submit the sequence to one of the appropriate major databases and obtain an accession number as a precondition to the publication of the article. One of the typical uses of ENTREZ is to obtain the actual sequence of a protein or nucleic acid whose sequence determination is reported in a journal article. In this situation the accession number would be chosen as a search term after choosing the concerned database.
The human genome project provides a number of biologist friendly tools to search the entries in the database. These tools are accessible from the home page of the GDB: (http://gdbwww.gdb.org/).
Nowadays it is even possible to get alerted automatically if a new relative of your favoured sequence appears in any protein sequence databases. For example, the EMBL has one such alert service, (http://swan.embl-heidelberg.de:8080/register_sequence.html) The user can customise the homology search parameters that are used to check the daily updates of the major protein sequence databases and will be informed immediately by email.
DBWatcher (Fridiric Plewniak, Bioinformatics, I.G.B.M.C., Strasbourg, France) is a program handling periodic BLAST searches to find similarities to your own sequences. It keeps track of the previous searches and only performs new ones when necessary. Only novel similarities are reported, thus saving the time of browsing through bulky result files. When executed daily (as a cron job) it ensures that you are informed as soon as new sequences similar to yours are incorporated into a given database. Results are sent by electronic mail to one or several addresses. DBWatcher can now be run as a client remotely. Sources are available for download from : (ftp://ftp-igbmc.u-strasbg.fr/pub/DBWatcher/dbwatcher.tar.Z)
A number of utility programs which assist sequence analysis based research are available as either freeware or shareware.
The SEQIO package is a C/C++ package (or library) developed by James Knight at the University of California, Davis, which makes reading and writing sequences and biological databases easy. The following file formats are supported: Raw/Plain, GenBank, PIR (CODATA), EMBL, Swiss-Prot, FASTA, NBRF, IG/Stanford, ASN.1 text files, GCG, MSF, PHYLIP, Clustalw, and output from the FASTA and BLAST suites of programs
SSEARCH in Bill Pearson's FASTA package is a C program for a Smith - Waterman search. This code is for searching/aligning a query sequence against an entire database using the S+W algorithm.
The sim.c algorithm (and others) is located at: (http://globin.cse.psu.edu/ftp/dist/sim/)
SorFind, RepFind, and PromFind are three programs to analyse protein sequences, developed by Dr. Gordon B. Hutchinson, Department of Medical Genetics, University of British Columbia, Canada. SorFind predicts coding exons in vertebrate genomic DNA. RepFind identifies common repetitive elements in DNA sequence. PromFind predicts promoter regions in vertebrate DNA sequence.
AutoGene(AUG) is a neucleotide sequence analysis program developed by Andrew Ptitsyn and collaborators. AUG contains programs for: FASTA-GenBank -EMBL-AUG sequence format conversion; -ALU and L1HS rearch; -Polyadenilation site recognition; -Vertebrate promoter site recognition; -Vertebrate exon/ introne structure recognition; and some others. AUG can be obtained through ftp from: (ftp://ftp.bionet.nsc.ru). AutoGene also includes an exon-finding subsystem at: (ftp://ftp.bionet.nsc.ru/pub/biology/autogene).
Microbe Software has published a program called, Plasmid Toolkit. It is an intuitive Windows program for producing publication quality plasmid maps with or without sequence data. Microbesoft's address on the Web is: (http://ourworld.compuserve.com/homepages/microbesoft)
There is a program called Gene Construction Kit that can be used to generate plasmid drawings. The program also does other useful things such as generating restriction maps of imported DNA sequences. The Gene Construction Kit is available at URL: (http://www.textco.com/).
ShadyBox is a drawing program which enables you to box and shade regular and irregular shaped segments of aligned multiple sequences. It was designed with the intention of producing PostScript output suitable for use in publications. It is also possible to colour regions of sequences, individual residues or residues of specified frequency. ShadyBox can be obtained from: ANGIS- The Australian National Genomic Information Service. (http://www.angis.su.oz.au) or (ftp://ftp.angis.su.oz.au/pub/unix)
ProMSED (Protein Multiple Sequences EDitor) for Windows is an easy-to-use application for automatic and manual multiple protein sequences alignment, alignment editing, analysis and printing. Interface and the main functions are similar to Microsoft Word. ProMSED can align complete set of sequences, its subset and any selected block, providing thus flexible tool for sequences analysis, visualization, edition and illustrations preparation. ProMSED has been developed by Dr. Alexey Eroshkin of the Institute of Molecular Biology, State Research Center of Virology and Biotechnology, "Vector", Koltsovo, Novosibirsk Region 633159, Russia, and is available at (ftp://ftp.ebi.ac.uk/pub/software/dos/promsed/)
GeneDoc is a full featured Multiple Sequence Alignment Editor and Shading Utility with Phylogenetic tree support. GeneDoc is also intended to help with the publication aspects of genetics research work by providing features such as shading, page and font layout. GeneDoc can read either .MSF multiple Sequence alignment files or can Import Fasta Format files to be saved as a .MSF file project.
Phylogenetic software PIWE and NONA written by Pablo Goloboff are available from the Willi Hennig Society's software pages: (http://www.vims.edu/~mes/hennig/software.html)
ANTHEPROT (ANalyze THE PROTeins) is available from the Institut de Biologie et Chimie des Proteines. UPR 412-CNRS, Lyon Cedex, FRANCE. Well known sequence formats are supported. The helical wheel diagram with the possibility of moving along the sequence coupled with a real-time coupled 3D view of the helix in alpha carbon view is included. It can be obtained through ftp from: (ftp://ftp.ibcp.fr).
SCOP: Structural Classification of Proteins database, hierarchically organizes all proteins of known structure according to their structural and evolutionary relationships. The database can be accessed at URL: (http://scop.mrc-lmb.cam.ac.uk/scop/)
ALSCRIPT takes an alignment and produces PostScript. Download instructions are at: (ftp://geoff.biop.ox.ac.uk/README).
AMAS server allows a multiple alignment to be analysed for interesting conservation patterns. PostScript output with boxing and shading of the alignment is provided for. The AMAS server site on the Web is: (http://geoff.biop.ox.ac.uk/servers/amas_server.html).
GELPICTURE reads a contig from the Fragment Assembly database and displays a diagram of the gel alignments and a printout of the aligned gel sequences and consensus. GELPICTURE has been modified to include the sequence direction in both sections of the output, and to mark with '=======' any consensus sequence that is correct (agrees with every fragment) and has been sequenced in both directions.
GELFIGURE produces a graphical summary of a contig in a fragment assembly project. The output is in four sections: a redundancy plot, a diagram of the directoions and orientations of sequence fragments, a restriction map and a plot of open reading frames.
The plot is intended both as a quality guide during the course of a sequencing project, and as a final report for a completed assembly.
SeqPup developed by Don Gilbert (Biocomputing, University of Indiana, Bloomington) is a versatile biological sequence editor and analysis program usable on the common computer systems of Macintosh, MS-Windows and X-Windows. SeqPup can be obtained through ftp: (ftp://iubio.bio.indiana.edu/molbio/seqpup/) or from the Web at: (http://iubio.bio.indiana.edu/1/IUBio-Software%2bData/molbio/seqpup/)
The sequencing of the entire DNA of the S. cerevisae genome completed recently marks a major event in the history of biology. The analysis of these gene products will provide powerful tools for reading the genomes of other eukaryotes, particularly those of higher eukaryotes. The analysis of the yeast genome has provided a useful framework for the annotation of many of the complete genome projects currently nearing completion, as well as the upcoming human genome. A yeast web page has been set up by the Bio-Molecular Engineering Center at BU by Jim Freeman. The yeast sequence information of this webpage was obtained from the GeneQuiz Consortium and the Mips Genome Commission and an attempt has been made to integrate these two data structures as well as to supplement their annotation with that obtained From a set of functionally diagnostic patterns (Adams, R. M., et al. Protein Science 5, 1240-49, 1996). The yeast web page hosts the following search tools:
User sequence as query (via blast): (http://bmerc-www.bu.edu/protein-seq/wwwblast.html) User keyword as query: (http://bmerc-www.bu.edu/protein-seq/yeast-keyword-search.html) Unix egrep regular expression as a sequence query: (http://bmerc-www.bu.edu/protein-seq/yeast-egrep-search.html).
ASSET (Aligned Segment Statistical Evaluation Tool) includes 3 programs:
asset, purge and scan. The PURGE program removes closely related sequences
from an input file prior to running asset. This is important in order to
reduce input sequence redundancy. The command syntax for purge is:
purge
A program for the prediction of transmembrane helices using neural networks
(PHDhtm) is available at EMBL:
(http://www.embl-heidelberg.de/predictprotein/predictprotein.html)
Cutter is a web-based service that analyzes a given sequence for restriction
enzyme sites and gives an easy-to follow analysis.
(http://www.ccsi.com/firstmarket/cutter/cutter%2b.html)
The Genome Sequencing Facility at Brookhaven National Laboratory also hosts
Restriction enzyme analysis of DNA sequences at:
(http://genome1.bio.bnl.gov/cgi-bin/bbq?para=rea&MODE=0)
DNA2Prot XFCN by Jared Roach (c) August 1996 translates DNA sequences.
PROMOTER SCAN II is a program developed to recognize and predict POL II
promoters in genomic DNA sequences. Presently it is limited to mammalian
promoter sequences, and is set to find approximately 60-70% of promoter
sequences never before seen by the program, with an expected false positive
rate of less then 1 in 30,000 single-stranded bases (based upon cross
validation tests). The program is accessable on the web at URL:
(http://biosci.umn.edu/software/proscan/promoterscan.htm )
A comprehensive package of DNA/protein sequence analysis programs can be
accessed from: (http://www.webgenetics.com).
ConsInspector is a program to scan nucleic acid sequences for matches to a
precompiled library of transcription factor binding sites. The program
carries out an extensive examination of binding site candidates: the real
sequence is compared with randomly shuffled versions and sequence regions
surrounding the conserved binding site are included into the analysis.
ConsInspector is available for UNIX and VAX/VMS for ftp at:
(ftp://ariane.gsf.de/pub/unix/) and (ftp://ariane.gsf.de/pub/vax/),
respectively, or through the Web site:
(http://www.gsf.de/biodv/consinspector.html).
XPound is an exon predicting program .
BLOCK (pedigree and linkage analysis in large complex pedigrees) written
by Claus Skaanning Jensen, Aalborg University, Denmark, implements a
method called blocking Gibbs sampling. The method is based on the Markov
chain Monte Carlo method, Gibbs sampling, but combines this stochastic
method with exact local computations to get a method that can successfully
handle very large and complex (e.g., inbred) pedigrees (thousands of
individuals). The method allows the user to test the presence of linkage
between two genes. BLOCK is on the Web at:
(http://www.cs.auc.dk/~claus/block.html)
MatInd and MatInspector, are tools for the definition and detection of
consensus matches in DNA sequences. MatInspector uses a large library of
predefined matrix descriptions of transcription factor binding sites to
locate matches in nucleotide sequences of unlimited length. This library
is based on TRANSFAC database:
(http://transfac.gbf-braunschweig.de/TRANSFAC).
MatInd, MatInspector together with the library are available for UNIX and
DOS at: (ftp://ariane.gsf.de/pub/) MatInspector can also be used
interactively at (http://www.gsf.de./biodv/matinspector.html)
GENET is an On-line searchable DataBase. The database presents known gene
networks organizations and includes maps of gene-gene interactions,
sequences and structure of known regulatory elements, and links to GenBank
and Medline references. GENET is on the Web at:
(http://www.iephb.ru/~spirov/genet00.html)
Many popular computational biology software are available for the popular
Linux (PC unix freeware) platform, e.g, these include: clustalw (alignment),
readseq (conversion of sequence files), phylip (phylogenetic analysis),
(GDE (Genetic data enviroment)), ACeDB c.elegans (Genome database),
xbbtools (visual sequence analysis), seaview (visual alignment editor),
phylo_win (visual phylogenetic analysis), blast(database search), and fasta.
ProAnalyst is an easy-to-use, state-of-the-art MS-DOS application designed
to solve traditional and new tasks of protein science. Developed at the
State Research Center of Virology and Biotechnology, Koltsovo, Russia, and
by Vladimir Ivanisenko and Alexey Eroshkin, ProAnalyst is available from
EBI software library: (ftp://ftp.ebi.ac.uk/pub/software/dos/proanalyst/).
ProAnalyst is basically an advanced statistical analysis program which,
for instance, relates experimental data to protein primary and tertiary
structure, finds relationships between protein sites' characteristics
(hydrophobicity, amphipathicity, etc.) and protein activities, investigates
differences between proteins divided by functional, evolutionary or other
criteria (for example, relates genotype to phenotype), etc.
ProAnWin is a similar program for use in Windows environment. In the EBI
software library ProAnWin is at:
(ftp://ftp.ebi.ac.uk/pub/software/dos/proanwin)
PUZZLE is a maximum likelihood program for reconstructing phylogenetic trees
from nucleotide and amino acid sequence data. It is available free of charge
over the Internet and runs on all popular systems. It is distributed by the
European Bioinformatics Institute:
(ftp://ftp.ebi.ac.uk/pub/software/dos/puzzle) (DOS version)
(ftp://ftp.ebi.ac.uk/pub/software/mac/puzzle) (MacOS version)
(ftp://ftp.ebi.ac.uk/pub/software/unix/puzzle) (UNIX version)
(ftp://ftp.ebi.ac.uk/pub/software/vms/puzzle) (VMS version)
Emmanuel Skoufos of the Yale University School of Medicine has set up a
new gene discovery page. The purpose of this page is to serve as a "desktop"
area, primarily for the bench scientist with little biocomputing background.
It organizes existing search engines in a coherent, stepwise fashion
providing one of the many strategies that may lead to gene discovery.
Questions that this page helps to answer are of the type: "Does a particular
sequence of DNA code for proteins and what may their function be?" or "Is
there a protein in organism A homologous to protein X of organism B?", etc.
The principal site and a European mirror site are available at:
(http://www.geocities.com/CapeCanaveral/1915/gdp.html)
(http://konops.imbb.forth.gr/~topalis/mirror/gdp.html)
SSPAL- Prediction of protein secondary sturcture by using local alignments
has been published by Victor V. Solovyev
(http://dot.imgen.bcm.tmc.edu:9331/pssprediction/pssp.html)
FGEBEHB - search for Mammalian gene structure with exons assembling by
dynamic programming and using similarity information with known proteins
by data base scaning with fasta.
FEXHB - search for Mammalian coding exons using exon recognition functions
and similarity information with known proteins by data base scaning with
fasta. Some additional information about Gene-Finder programs can be
obtained from:
(http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html)
Other utilities available at this site are:
FGENEH - search for Mammalian gene structure with exons assembling by
dynamic programming
FEXH - search for 5'-, internal and 3'-exons
HEXON - search for internal exons
HSPL - search for splice sites
RNASPL - prediction exon-exon junctions in cDNA sequences
CDSB - prediction of Bacterial coding regions
HBR - recognition of human and bacterial sequences to test a library for
E. coli contamination by sequencing example clones
TSSG - recognition of human promoter regions (Ghosh/Prestridge motif data)
TSSW - recognition of human promoter regions (Weingender motif data base)
POLYAH - recognition of of 3'-end cleavage and polyadenilation region of
human mRNA precursors
FGENED - search for Drosophila gene structure with exons assembling by
dynamic programming
FEXD - search for Drosophila 5'-, internal and 3'-exons
DSPL - search for Drosophila splice sites
FGENEN - search for Nematode gene structure with exons assembling by dynamic
programming
FEXN - search for Nematode 5'-, internal and 3'-exons
NSPL - search for Nematode splice sites
FGENEA - search for Plant gene structure with exons assembling by dynamic
programming
FEXA - search for Plant 5'-, internal and 3'-exons
ASPL - search for Plant splice sites
SSP - prediction of a-helix and b-strand in globular proteins by segment-
oriented approach.
NSSP - prediction of a-helix and b-strand segments in globular proteins by
nearest-neighbor algorithm.
PSITE - search for PROSITE patterns with statistics
DynaClip is a program designed to trim a little bit off of the 5' and 3'
ends of DNA sequence reads. DynaClip can be found at:
(http://weber.u.washington.edu/~roach/Programs/)
Lasergene, a modular program package by the company DNASTAR has a decent
sequence alignment module that uses the Clustal method. It also has a PCR
primer selection module.(mailto: sales@dnastar.com)
CLONE is a program that can identify RE sites, ORFs, cut, ligate etc.
It has a companion ENHANCE, which gives a decent number of ways to present
your cloning strategy and another companion, PRIMER, which helps in primer
design and is rather flexible.
WWW site at EMBL (www.ebi.ac.uk) has a number of programs for DNA and
protein analysis, e.g., BBSEQ for sequence conversion, MACAW for sequence
similarity for primer design, Clustal and Phylip for sequence alignment,
etc.
Mac emulator Executor is a program that enables Apple Mac programs to be
run on Wintel (Windows/ Intel) machines. Executor runs under Linux, DOS,
and Windows95.
BioOnline Store hosts a text based search engine.
(http://synapse.bio.com/cgi-bin/bio)
clustalw documentation and on-line help is available in html format at:
(http://www-igbmc.u-strasbg.fr/BioInfo/ClustalW/)
BioComputing Division of the Virtual School of Natural Sciences,
a member school of the Globewide Network Academy, conducted a Course
on BioComputing in 1996, using the electronic conferencing system BioMOO.
An account of the course, and access to its hypertext book and related
materials, are available at:
(URL:http://www.techfak.uni-bielefeld.de/bcd/welcome.html).
The edited transcript includes a link to the main FastA v 3.0 FTP
distribution site and to previous lectures, and is available at the
following locations (WWW/hypertext):
(http://www.techfak.uni-bielefeld.de/bcd/Lectures/pearson3.html)
(http://merlin.mbcr.bcm.tmc.edu:8001/bcd/Lectures/pearson3.html)
(http://www.biotech.ist.unige.it/bcd/Lectures/pearson3.html)
PROSITE is a program that enables long aminoacid sequence patterns to be
searched in sequence databases. PROSITE can be found at:
(http://www.genome.ad.jp/SIT/MOTIF.html)
Visual sequence editor (VISED) is a multiple sequence editor for Microsoft
Windows platforms. It features viewing and editing of uptp 200 sequences
simultaneously, boxed output, pattern search function using PROSITE syntax,
Translation and other simple nucleic acid functions. VISED is available at:
(ftp.bio.indiana.edu/molbio/ibmpc)
Plasmid Processor is a simple tool for plasmid presentation for scientific
and educational purposes. It features both circular and linear DNA, user
defined restriction sites, genes and multiple cloning site. In addition you
can manipulate plasmid by inserting and deleting fragments. Created drawings
can be copied to clipboard or saved to disk for later use. Printing from
withing program is also supported. Plasmid Processor was developed by
T. Kivirauma, P. Oikari and J.Saarela of the Department of Computer Sciences
and Applied Mathematics, University of Kuopio, Finland. It can be obtained
from URL: (http://www.uku.fi/~kiviraum/plasmid/plasmid.html)
Wentian Li, Laboratory of Statistical Genetics, Rockefeller University
maintains a number of bibliographies for the benefit of computational
biologists. A bibliography on Computational Gene Recognition is available at:
(http://linkage.rockefeller.edu/wli/gene)
Another bibliography on long-range correlations in DNA sequences is at:
(http://linkage.rockefeller.edu/wli/dna_corr)
Papers on short-range and middle-range correlations in DNA sequences will
also be included. A comprehensive list of computer software for genetic
linkage analysis and genetic map construction can be found at:
(http://linkage.rockefeller.edu/soft/list.html)
For each program, the following information is provided whenever possible:
description, authors, web or ftp site, source code language, operating
systems the program runs on, references.
TACG is a character-based, command line tool for the restriction enzyme
analysis of DNA for unix-like operating systems. Written by Harry Mangalam,
UC Irvine, it can be obtained by from: (ftp://mamba.bio.uci.edu/pub/tacg/)
NRSub (the Non-Redundant Bacillus subtilis data base) is available through
anonymous FTP at: (ftp://biom3.univ-lyon1.fr/pub/nrsub/) or
(ftp://ftp.nig.ac.jp/pub/db/nrsub/)
It is also possible to access NRSub through two World Wide Web servers at:
(http://acnuc.univ-lyon1.fr/nrsub/nrsub.html) or
(http://ddbjs4h.genes.nig.ac.jp/)
A Protein Secondary Structure Prediction (DSC) program has been developed
by Ross D. King of the Biomolecular Modelling Laboratory, Imperial Cancer
Research Fund, London. Two prediction modes are available:
1) Given a single sequence. A multiple sequence alignmentt will be formed
and DSC used to predict secondary structure. This mode can be accessed at:
(http://www.icnet.uk/bmm/dsc/dsc_form_align.html)
2) Given a multiple sequence alignmnet. DSC will use this alignment to
predict secondary structure. This mode can be accessed at:
(http://www.icnet.uk/bmm/dsc/dsc_read_align.html)
The C source code of DSC is available and can be obtained through ftp:
(ftp://ftp.icnet.uk/icrf-public/bmm/king/dsc/dsc.tar.z)
Ross Overbeek, et al. at the Argonne National Laboratory have developed
the WIT/PUMA2 system that supports metabolic reconstructions and integration
of sequence and phylogenetic and metabolic information in a coherent
interactive environment. It consists of two parts, WIT and PUMA
WIT -- an interactive tool for an expert biologist, which allows one to
develop a metabolic reconstruction for an organism (from a complete, or
partial genome). WIT is available at: (http://www.cme.msu.edu/WIT/)
PUMA is a growing repository of the metabolic models for the organisms,
developed in WIT.
Several kinds of bioinformatics related searches can be executed - to GenBank,
the MEDLINE molecular biology subset, OMIM, Entrez, the BCM Search Launcher,
PDB, BLAST, GDB, etc., from the Biological Data Transport Web resource,
(http://www.data-transport.com)
Drawtree is a program written by Joe Felsenstein that produces tree pictures.
It is included in the PHYLIP software package. The 386 DOS executables for
PHYLIP are available at:
(http://evolution.genetics.washington.edu/phylip.html)
An alternative to DRAWTREE is the program TreeView written by Rod Page,
which runs under Windows and supports a range of tree file formats. The
tree pictures can be cut and pasted into other applications, as well as
saved as a Windows metafile (recognised by most drawing and word processing
programs). For more information, visit the site:
http://taxonomy.zoology.gla.ac.uk/rod/treeview.html
Don Gilbert of Indiana University has produced TreeDraw Deck.
(ftp://iubio.bio.indiana.edu/)
EditBase is a program developed by Purdue Reserach Fundation & USDA/ARS.
It is useful for DNA cloning, etc.
Bootscanning (c) Mika Salminen, Wayne Cobb, Henry M. Jackson Foundation
is a method for anaysis of viral recombination. It can be used to compare
an unknown, suspected recombinant sequence, to a set of predefined potential
parental sequences. It should be independent of organism, but, based on
tests on HIV-1 it appears to work only for sufficiently variable genes.
GDE and Phylip are required to run the package, and at the current time,
only SUN executables are available. However, the source-code is also
included. Bootscanning is available at: (http://hivgenome.hjf.org/)
Hyperchem models molecules and does very good minimization via molecular
mechanics and quantum mechanics. Hyperchem is available at:
(http://www.ppgsoft.com/)
MolScript is a molecular graphics program written by Per Kraulis A
description and a pointer to an alternative program is available at:
(http://www.bocklabs.wisc.edu/Molscript.html)
NAMD is a high-performance molecular mechanics program for simulating large
biomolecular systems on parallel and distributed computers developed by the
Theoretical Biophysics group at the University of Illinois and the Beckman
Institute. This software is made available to the molecular modeling
community free of charge, and includes commented source code, documentation
for users and programmers, and precompiled binaries for HP and SGI
workstations. Detailed documentation and the software are at:
(http://www.ks.uiuc.edu/Research/namd/)
(ftp://ftp.ks.uiuc.edu/pub/mdscope/namd/)
Kevin Shreder, University of California, San Diego has constructed the
Antibody Resource Page. The webpage contains educational links about
antibodies (some with incredible graphics), links to on-line journals
that cover antibody-related topics, an essay on the study of antibody
molecular recognition, links to on-line antibody sequencing and hybridoma
databases, and a miscellaneous section. There is also a large section
designed to help those looking for an antibody. This latter section
contains more than 60 links to on-line companies that sell antibodies,
many which have searchable catalogues. This section also contains useful
tips on how to find antibodies using the internet or otherwise. The
Antibody Resource Page is at:
(http://www-chem.ucsd.edu/Faculty/goodman/antibody.html/abpage.html)
Several shareware software are available for common molecular applications,
e.g., MOPAC for energy minimization, Babel to convert output to PDB format,
and Molden for molecular model building.
It contains a lot of information on proteases and a link to the MAGE
software, that could be useful for any teaching program on enzymes and
proteins. The Prolysis server is at URL:
(http://prolysis.phys.univ-tours.fr/Prolysis)
Kinemage, RasMol and Linus!Lite are three programs which are particularly
useful for teaching purposes. RasMol is particularly useful when used in
combination with "ChemScape Chime". Linus!Lite can produce "ray trace"
images very efficiently and the resultant images can be manipulated
(rotation). You can also generate movies with Linus!Lite with little
effort. Unfortunately, Linus!Lite is not available for Wintel.
PDB files can be downloaded (xxx.full) as text files and then directly
viewed through RasMol. For viewing through Mage, the file must first be
read in with RasMol as PDB files and written out as kinemage files by
issuing the command, "write kinemage filename.kin" in the command window,
where "filename.kin" is the name of the output file.
The URL's of Rasmol, Chemscape and Linus!Lite are, respectively,
(http://www.umass.edu/microbio/rasmol/getras.htm)
(http://www.mdli.com/mdlhome.html)
(http://www.blc.arizona.edu/linus/linus.html)
A modified version of Rasmol is available at U.C. Berkley; it enables
fiddling and twiddling, i.e., fragments can be rotated about a bond
as an axis. The first version was based on RasMac 2.5, and supports the
PPC chip; Beta test versions of Rasmol2.6-ucb are available for Mac,
Windows, Linux, Ultrix, and HP-UX. This version of RasMol is available at:
(http://hydrogen.cchem.berkeley.edu:8080/Rasmol/)
Three-dimensional structural information for S. cerevisiae proteins is
now available through the Saccharomyces Genome Database web site at the
following URL: (http://genome-www.stanford.edu/Sacch3D)
The following programs can be used to measure the nucleic acid parameters
and/or construct a new nucleic acid from scratch based on a new set of
parameters:
1) BIOSYM/MSI InsightII and MacroModel for DNA.
2) MC-SYM (http://www.iro.umontreal.ca/people/major/mcsym.html) for RNA
3) NAB for DNA (http://scripps.edu/case)
4) newhel93 for DNA available in PDB (http://www.gdb.org/hopkins.html)
5) Rasmol (http://www.gdb.org/hopkins.html)
Nemesis from Oxford Molecular allows modeling, manipulations, as well as
energy calculations. The Web Address of Oxford Molecular is:
(http://www.oxmol.co.uk/PRODUCTS/nemesis_top.html)
Modeller by Andrej Sali at Rockefeller University, is a homology based
modeling program for proteins. It is available at the URL:
(http://guitar.rockefeller.edu)
Swiss-Model is another such program which gives good result if the sequence
is highly homologus to a known structure. Swiss-Model is at:
(http://expasy.hcuge.ch/swissmod/SWISS-MODEL.html)
Raster3D runs on Unix derivatives and gives nice protein ribbon drawings,
especially if used together with Molscript.
NIH-Imager can be found at http://rsb.info.nih.gov/nih-image/ and Image
Tool (from University of Texas in San Antonio) can be found at:
(ftp://maxrad6.uthscsa.edu/pub/it).
NIH hosts an impressive site with Molecular Modeling resources at:
(http://cmm.info.nih.gov/modeling/quick_finder.html)
Particularly noteworthy are "Molecules R Us", a search engine, graphic
display of molecules in different forms and a PDB viewer helper application.
The UCLA-DOE Protein Fold Recogntion Automated Server takes an amino acid
sequence and searches the known structures to find a compatible fold. In
addition, it automatically provides the results from other sequence analysis
programs. The server is at: (http://www.mbi.ucla.edu/people/frsvr/frsvr.html)
NCBI's Entrez provides a means to visualize molecular structure data. The
viewer, Cn3D, is part of Network Entrez client programs and can also be
used as a helper application for the Web version of Entrez. A full
description of the Cn3D program may be found at the NCBI's Structure Group
home page: (http://www.ncbi.nlm.nih.gov/Structure)
3DBbrowse is a Web based browser, that makes it easy to search and retrieve
data from the Protein Data Bank (PDB). It allows the user to rapidly search
through the contents of the entire PDB Archive for entries obeying certain
constraints. A full text search (based on the Glimpse indexing and query
system) can be made for any string appearing in the text of a PDB entry.
3DBbrowse is available at: (http://pdb.pdb.bnl.gov/3DBbrowser.html)
PovChem is a program that takes pdb files as input, and uses the ray-
tracing program PovRay to produce graphics. PovChem is available at:
(http://ludwig.scs.uiuc.edu/~paul/PovChem.html)
NAOMI - a program for studying 3-D structures of proteins is available from
the Web site: (http://www.ocms.ox.ac.uk/~smb/Software/N_details/naomi.html)
or via anonymous ftp: (ftp://nmrz.ocms.ox.ac.uk/pub/smb/naomi)
SETOR solid model macromolecules program can be found at:
(http://scsg9.unige.ch/fln/setorlic.html)
Gamess-UK is a general purpose ab initio quantum chemistry package
distributed by Computing for Science Ltd. The program can be used to study a
wide range of chemical phenomena (including biological problems, such as drug
design and enzyme catalysis). The program is available free to UK academics.
Non-UK academics pay a nominal fee to cover administration and installation
costs. More information on GAMESS-UK is available on the World Wide Web
(http://www.dl.ac.uk/CFS)
ProFit is a least squares fitting program, written by Andrew Martin of
University College, London. It performs the basic function of fitting one
protein structure to another. One can specify subsets of atoms to be
considered, and zones to be fitted by number, sequence, or by sequence
alignment. The program will output an RMS deviation and optionally the
fitted coordinates. Zones for calculating the RMS can be different from
those used for fitting. ProFit is available from:
(http://www.biochem.ucl.ac.uk/~martin/#programs)
GRAMM (Global RAnge Molecular Matching) is a program for protein docking
written by Ilya Vakser. To predict the structure of a complex, it requires
only the atomic coordinates of the two molecules (no information about the
binding sites is needed). The program performs an exhaustive 6-dimensional
search through the relative translations and rotations of the molecules.
The molecular pairs may be: two proteins, a protein and a smaller compound,
two transmembrane (TM) helices, etc. GRAMM may be used for high-resolution
molecules, for inaccurate structures (where only the gross structural
features are known), in cases of large conformational changes, etc.
The Global Range Molecular Matching (GRAMM) methodology is an empirical
approach to smoothing the intermolecular energy function by changing the
range of the atom-atom potentials. The technique allows to locate the
area of the global minimum of intermolecular energy for structures of
different accuracy. The quality of the prediction depends on the
accuracy of the structures. Thus, the docking of high-resolution
structures with small conformational changes yields an accurate
prediction, while the docking of ultralow-resolution structures will
give only the gross features of the complex.
The GRAMM site on the Web is (http://guitar.rockefeller.edu/).
Presently, GRAMM is compiled on SGI R4000, SGI R4400, SGI R8000, and SGI
R10000 Unix workstations. In the near future I will expand this list, so
check the GRAMM site for the updates. Interestingly, GRAMM also works on
a PC platform under Windows95 (the performance on P5-120 with 16 MB RAM
is only two times slower than on SGI 250 MHZ Indigo2 R4400).
SnB (Shake-and-Bake) is a simulated annealing software hosted at
Roswell Park Memorial Institute, Buffalo. SnB can be obtained from:
(http://www.hwi.buffalo.edu/SnB/).
A web server for prediction of protein secondary structure percentages
from UV circular dichroism spectra has been established by Merelo and
Andrade of the University of Granada. 41 CD values ranging from 200 nm
to 240 nm are to be submitted (given in deg cm^2 dmol^-1 multiplied by
0.001) and the server gives back the estimated percentages of helix, beta
and rest of secondary structure of your protein plus an estimate of the
accuracy of the prediction. The prediction is done using a Kohonen neural
network with a 2-dimensional output layer. The http address of the k2d
server is: (http://kal-el.ugr.es/k2d/spectra.html)
The program can be downloaded from:
http://www.embl-heidelberg.de/~andrade/k2d.html
Several programs are available to support Voltage Clamp technique of
membrane biophysics, e.g., PClamp a DOS program and a commercial package
MicroCal Origin which has a PClamp module. Lars Thomsen has made a program
PROFILE that generates a voltage profile as used when doing whole cell
recordings. It is a true 32bit WIN95 program and thus provides good graphics
support and transferability. It requires a minimum screen resolution of
800x600 pixels. Whenever a setting is changed the picture is updated and
copied automatically to the clipboard. PROFILE is available at:
(http://home.interlynx.net/~lthomsen/index.htm).
NASA's Ames Research Center hosts a good home page pertaining to 3D
reconstruction from 2D images.
(http://biocomp.arc.nasa.gov:80/3dreconstruction/)
There is a package called SwaN-MR written by Dr. Balacco which does all
NMR processing on a Macintosh. It can be downloaded from sfdzuma.usc.es,
in the directory /pub/NMR.
Hanqing Wu has created a homepage of "Online EPR Spectrum Simulation through
CGIEMAIL" at:
(http://www.uwm.edu/~hanqing/watoc/oleprsm.htm". Please look)
LEE (Latent Energy Environments) is an artificial life simulator developed
by Richard Belew and Filippo Menczer of the University of California,
San Diego. LEE can be obtained through anonymous FTP:
(ftp://cs.ucsd.edu/pub/LEE), or through a link from the URL:
(http://www.cs.ucsd.edu/users/fil/lee/lee.html)
Foster Findlay Associates, Newcastle Upon Tyne, have developed PC_Image
for Windows 95 and Windows 3.1 and several other software for image
processing and analysis. These can be obtained through URL:
(http://www.demon.co.uk/ffaltd/)
There are nice pages on search engines at:
(http://www.unige.ch/crystal/w3vlc/int.index.html)
(http://scsg9.unige.ch/fln/setorlic.html)
MathPad is freeware and available at info-mac or at
(http://pubpages.unh.edu/~whd/MathPad/).
David Mathog, Manager, sequence analysis facility, biology division,
Caltech, has prepared a very informative and popular comparative table of
various molecular modeling/molecular display/related programs with respect
to portability. The table, listed alphabetically, is reproduced below.
The table is periodically updated by David Mathog and can be seen at:
(http://seqaxp.bio.caltech.edu:8000/www/molec_model_progs.html)
MIPS ---DEC Alpha---- Intel
What SGI DU WNT Windows GraphicsType
Biosym Yes No* No* No* GL
Grasp Yes No No No GL
MidasPlus Yes No No No GL
molmol Yes Yes No No X11
O Yes No No No GL
rasmol Yes Yes ? Yes X11/Windows
Setor Yes No No No GL
VMD Yes No No No GL
XtalView Yes Yes No No GL/X11
* Separately licensed "Axxess" product lets Biosym run as an X11 client.
PHASES: A Program Package for the Processing and Analysis of Diffraction Data from Macromolecules" is available from Dr. William Furey, Biocrystallography Laboratory, VA Medical Center, Pittsburgh. Phase extension using partial structure information, MAD phasing, using molecular replacement models, NC symmetry averaging (which can be done with multiple crystal forms), are well supported. Support for SGI's R8000 series processors and IRIX 6.2 is included, one can now input SCALEPACK files directly.
M.Capel of Brookhaven National laboratory has posted a software suite for visualizing and integrating two-dimensional diffraction images. The suite supports fuji, mar and multi-wire PSD detector formats, and has a wealth of different operational modes including: Circular, Sectorial, Norms along a central radial, Norms along an arbitrary vector, line/column extraction, Angular dependence of sector, Optimization of detector parameters, etc. The software runs on IRIX and Linux, and ports to SUN, HP and VMS are in progress. It is documented and made available via anonymous ftp at (http://crim12b.nsls.bnl.gov/x12b_downloads.html)
CrystalDesigner, developed by Crystal Structure Design AS, Oslo, Norway, is a tool for building, studying and visualising all kinds of crystal structures on the Macintosh platform. CrystalDesigner is an ideal tool for both teaching and scientific studies. The software is intended to be used by students and teachers at colleges and universities, as well as in industrial research. CrystalDesigner is available at: (http://www.crystaldesigner.no) or through ftp: (ftp://ftp.crystaldesigner.no/).
A major and very popular software for the refinement of molecular structures using x-ray crystallographic and solution nmr spectroscopic techniques is X-PLOR. This has been developed by Axel T. Brunger of Yale University. Internet access to X-PLOR(online) is now available for non-profit (academic) users holding a license for X-PLOR version 3.1 from Yale University, free of charge. Access instructions are available in the X-PLOR home page (http://xplor.csb.yale.edu.) The following is a summary of enhanced features introduced in a recent release:
X-ray crystallography:
Solution NNR spectroscopy:
One of the most exciting developments in Bioinformatics in recent times has been the application of Java to provide qualitatively enhanced information transfer over the Internet. In simple terms, Java enables a service provider, or server site to transmit not only a certain set of information, but also related appropriate routines, or programs to manipulate and process the data sent in response to commands entered interactively by the (client) user. The Java routines (applets) run on the client machine and thus relieves the server site and the network of considerable burden. Latest versions of most popular Web browsers have built in modules to interpret Java applets. Browsers having such a capability are referred to as Java enabled browsers.
Here below are listed some of the early bioinformatics applications using Java.
MDL's Chemscape Chime plugin for Netscape, for displaying the interactive rotating 3D Models has made a mark as an excellent teaching tool. The URL for MDL is: (http://www.mdli.com/mdlhome.html) Steve Williams and coworkers have developed some good examples to illustrate the use of Chime in Biology teaching. These can be seen at (http://iptunix.bcm.bham.ac.uk/sjwb/models.html), or, (http://www.birmingham.ac.uk/biochemistry) The MDL site itself provides links to some excellent examples, e.g., one for the photosynthetic reaction centre from the purple bacterium Rhodopseudomonas viridis.
Frank R. Gorga of Duquesne University has put together a simple (sophomore level) web-based tutorial on isomers of organic molecules at URL: (http://nexus.chemistry.duq.edu/~gorga/stereo/intro.htm)
Dirk Walther of EMBL has implemented a Java based PDB-structure viewer at (http://www.embl-heidelberg.de/~walther/JAVA/pdb.html)
Luca Ida Giovanni of EMBL has written a Java-based restriction map program, as part of a collection of Java based solutions in molecular biology at: (http://www.embl-heidelberg.de/~toldo/JaMBW.html)
Andrei Grigoriev has added a calculator which uses the model of Roach for random fingerprinting to the set of physical mapping calculators. These can be viewed with a Java-enabled browser at: (http://www.mpimg-berlin-dahlem.mpg.de/~andy/calc/mapcalc.html)
QuickPDB is A Sequence/Structure Search and Display Java Applet developed by Ilya Shindyalov and Phil Bourne. QuickPDB is a lightweight applet with two major functions:
QuickPDB accesses the most current (nightly updated) version of the PDB database located on San Diego Supercomputer Center (SDSC) servers and uses the new index-based database structure for fast search and retrieval developed at SDSC. The URL to access QuickPDB is: (http://xtal1.sdsc.edu/misha/QuickPDB.html).
Tai Y. Fu of the University of British Columbia has released Java Lattice (A Java applet for viewing crystal packing of Protein Structure Database files). Features include double buffering, colored models, control buttons, zooming, etc. Java Lattice can be visited at: (http://laue.biochem.ubc.ca:8080/cgi-bin/ssis/kelowna/latte.html)
However overwhelming the above compilation may look, it may be mentioned that this is still only a fraction of the growing list of resources. The present compilation is based on discussions taking place among members of several Bionet News Groups mid 1996. In fact the present compilation itself can be converted into an Internet resource as will be illustrated during the presentation. With a few additions the present document will become an Index page with active links to the various URL's mentioned here. Each paragraph here would virtually explode into a voluminous repertoire of documents, programs and services with the power of the HTML and WWW technologies. Innovative tools, appropriate computer and network resources and cooperative