|
|
|
![]() |
This is the home page
of the MAO project, a joint effort to define community standards for
data retrieval and exchange in the fields of
DNA/RNA alignment, protein sequence and protein structure alignment.
The post-genomic era is presenting
new challenges for bioinformatics. High throughput genome sequencing
and
assembly techniques, together with new information resources, such as
structural proteomics, transcriptome data
from microarray analyses, or light
microscopy images of living
cells have lead to a rapid increase in the amount of data available,
ranging
from complete genome sequences to cellular, structure, phenotype and
other
types of biologically relevant information. In the face of this
ever-increasing
volume of complex and constantly evolving data, the integration of
experimental
data with bioinformatic comparative and
predictive
analyses will be crucial to the complete description of protein
function, not
only at the molecular level but also at the higher levels of the
pathways,
macro-molecular complexes, cells or organs a protein belongs to.
As a central concept in molecular
biology, the gene and it's related products
represents
an ideal basis for the integration of this mass of biological
information in
the context of the protein family. In
order to fully understand the functions and molecular interactions of
a particular gene, such diverse information as cellular location,
degradation
and modification, 2D/3D structures, mutations and their associated
illnesses,
the evolutionary context and literature references must be assembled,
classified and made available to the biologist. The global multiple
alignment
presents a synthetic view of the variability along the sequence and
among homologous
sequences and thus provides an ideal network for the integration and
visualisation
of the most vital and relevant aspects of all these sequence data.
The
central role of multiple alignments in information propagation
Sequence comparisons or alignments
have been used since their introduction in the early seventies in a
wide range
of molecular biology applications. Alignments of two sequences, known
as pairwise alignments, are mainly used to
search the sequence
databases in order to identify potential homologues i.e. sequences that
have
evolved from a common ancestor. Generally, homologous proteins share
the same
three-dimensional (3D) structure and have similar functions, active
sites or
binding domains. Pairwise alignments can
be naturally
extended to the alignment of more than two sequences. These multiple
sequence
alignments were originally used in the identification of conserved
motifs or
key functional residues in a family of proteins and in evolutionary
studies to
define the phylogenetic relationships
between
organisms. Of course, in the current era of complete genome sequences,
it is
now possible to perform comparative multiple sequence analysis at the
genome
level. Multiple sequence alignments now play a fundamental role in most
of the
computational methods used in proteomics, from gene identification and
validation to the determination of the protein 3D structure and the characterisation of the molecular and cellular
functions of
the protein.For a more
complete review of the role of multiple alignments in modern molecular biology, see (1).
Knowledge structures, known as ontologies are now being introduced in biology
as a working
model of the entities and interactions in a particular domain. Ontologies offer a mechanism by which
knowledge can be represented in a form suitable for machine processing.
An ontology includes a "vocabulary of terms" and a
"specification of their meaning" including definitions and
inter-relations, which impose a structure on the domain and constrain
the
possible interpretations of terms. The most famous example is the Gene
Ontology
(GO) project, which develops three structured, controlled vocabularies
that
describe gene products in terms of their associated biological
processes,
cellular components and molecular functions.
Many other ontologies
exist and many of these have been collected together on the OBO (Open Biological Ontologies)
web site (http://obo.sourceforge.net),
including more generic ontologies that
apply across all organisms and others will
be more restricted in scope, for example to specific taxonomic groups
or to
specific fields of interest. For a review of ontologies
in bioinformatics, see (2).
In response to the challenges of the
post-genomic era, multiple alignment techniques are evolving away from
a single
all-encompassing algorithm towards an
integrated system bringing together knowledge-based or text-mining
systems, which can exploit
the new structure and functional data available. Systematic
organization of
this information is now crucial for several reasons:
The goal of the MAO project is to develop
a controlled, structured vocabulary, that will provide a common
language not
only for the construction of multiple alignments, but also for the
numerous
applications that exploit the information available in an integrated,
global
multiple alignment of a protein family. An important issue in the
development
of MAO is the interoperability with existing information sources, in
order to
maximise its’ applicability and utility. Therefore, links should be
provided to other kinds of information, including databases such as UniProt, PDB or InterPro,
as well
as existing ontologies such as GO or the
HUPO
Proteomics Standards Initiative (PSI).
A vocabulary of terms is provided in
the form of a hierarchical network of concepts, together with precise,
explanatory definitions of each term and the inter-relations between
the
different concepts. The aim is to include the great majority of the
concepts
relevant to multiple sequence alignments, ranging from fundamental
concepts
such as ‘sequence’ or ‘residue’ to more complex
concepts such as structural environment, functional activity or
evolutionary
context.
The MAO is represented as a Directed
Acyclic Graph (DAG), in which each node stands for a concept and the
links
connecting the nodes represent the relationship between them. Two
hierarchical
relations are defined: ‘is_a’ and ‘part_of’. The nodes can also have other
information
attached to them in the form of names (defined by the relation ‘is_name’), annotations (defined by ‘is_annotation’) or other attributes. An
attribute is
generally a simple string or number variable that contains additional
information about the concept.
MAO is developed using the DAG-Edit Java
application (http://www.godatabase.org/dev/),
which provides an interface to browse, query and edit any vocabulary
that has a
DAG data structure. The DAG-Edit software provides an option for saving
the ontology
in the OBO common file format.
Collaborators
UC
Davis, Davis, California (Patrice Koehl)
Lawrence Berkeley National Laboratory, California (Steve
Holbrook)
Bioinformatics Center, Institute
for Chemical Research, Kyoto
University, Kyoto (Kazutaka Katoh)
MAO is freely available from the MAO Home Page at
http://bips.u-strasbg.fr/LBGI/MAO/mao.html.
You can download the latest version of the ontology in the standard OBO file format:
mao.obo
If
you would like to participate in the MAO development, please contact us
at Julie.Thompson@igbmc.u-strasbg.fr
1. Lecompte O, Thompson JD, Plewniak F, Thierry J, Poch
O. Multiple alignment of complete
sequences (MACS) in the post-genomic era.Gene.
2001
270(1-2):17-30.
2. Stevens R, Goble
CA, Bechhofer S. Ontology-based
knowledge representation for bioinformatics.Brief
Bioinform. 2000 1(4):398-414.
3. Thompson JD, Koehl P, Ripp R,
Poch O. BAliBASE
3.0: latest developments of the multiple sequence alignment
benchmark. Proteins: Structure, Function and Bioinformatics,
submitted.