IGBMC

Institut de Génétique et de

Biologie Moléculaire et Cellulaire

LBL
Lawrence Berkeley
National Laboratory
bic
Institute for Chemical Research
Kyoto University
ibmc
Institut de Biologie Moléculaire et Cellulaire

mao




This is the home page of the MAO project, a joint effort to define community standards for data retrieval and exchange in the fields of DNA/RNA alignment, protein sequence and protein structure alignment.

Table of Contents

 

Background

The post-genomic era is presenting new challenges for bioinformatics. High throughput genome sequencing and assembly techniques, together with new information resources, such as structural proteomics, transcriptome data from microarray analyses, or light microscopy images of living cells have lead to a rapid increase in the amount of data available, ranging from complete genome sequences to cellular, structure, phenotype and other types of biologically relevant information. In the face of this ever-increasing volume of complex and constantly evolving data, the integration of experimental data with bioinformatic comparative and predictive analyses will be crucial to the complete description of protein function, not only at the molecular level but also at the higher levels of the pathways, macro-molecular complexes, cells or organs a protein belongs to.

As a central concept in molecular biology, the gene and it's related products represents an ideal basis for the integration of this mass of biological information in the context of the protein family. In order to fully understand the functions and molecular interactions of a particular gene, such diverse information as cellular location, degradation and modification, 2D/3D structures, mutations and their associated illnesses, the evolutionary context and literature references must be assembled, classified and made available to the biologist. The global multiple alignment presents a synthetic view of the variability along the sequence and among homologous sequences and thus provides an ideal network for the integration and visualisation of the most vital and relevant aspects of all these sequence data.

 

The central role of multiple alignments in information propagation

Sequence comparisons or alignments have been used since their introduction in the early seventies in a wide range of molecular biology applications. Alignments of two sequences, known as pairwise alignments, are mainly used to search the sequence databases in order to identify potential homologues i.e. sequences that have evolved from a common ancestor. Generally, homologous proteins share the same three-dimensional (3D) structure and have similar functions, active sites or binding domains. Pairwise alignments can be naturally extended to the alignment of more than two sequences. These multiple sequence alignments were originally used in the identification of conserved motifs or key functional residues in a family of proteins and in evolutionary studies to define the phylogenetic relationships between organisms. Of course, in the current era of complete genome sequences, it is now possible to perform comparative multiple sequence analysis at the genome level. Multiple sequence alignments now play a fundamental role in most of the computational methods used in proteomics, from gene identification and validation to the determination of the protein 3D structure and the characterisation of the molecular and cellular functions of the protein.For a more complete review of the role of multiple alignments in modern molecular biology, see (1).

 

Ontologies in biology

Knowledge structures, known as ontologies are now being introduced in biology as a working model of the entities and interactions in a particular domain. Ontologies offer a mechanism by which knowledge can be represented in a form suitable for machine processing. An ontology includes a "vocabulary of terms" and a "specification of their meaning" including definitions and inter-relations, which impose a structure on the domain and constrain the possible interpretations of terms. The most famous example is the Gene Ontology (GO) project, which develops three structured, controlled vocabularies that describe gene products in terms of their associated biological processes, cellular components and molecular functions.

Many other ontologies exist and many of these have been collected together on the OBO (Open Biological Ontologies) web site (http://obo.sourceforge.net), including more generic ontologies that apply across all organisms and others will be more restricted in scope, for example to specific taxonomic groups or to specific fields of interest. For a review of ontologies in bioinformatics, see (2).

 

 

Design and Development

 

Motivation

In response to the challenges of the post-genomic era, multiple alignment techniques are evolving away from a single all-encompassing algorithm towards an integrated system bringing together knowledge-based or text-mining systems, which can exploit the new structure and functional data available. Systematic organization of this information is now crucial for several reasons:

Objectives

The goal of the MAO project is to develop a controlled, structured vocabulary, that will provide a common language not only for the construction of multiple alignments, but also for the numerous applications that exploit the information available in an integrated, global multiple alignment of a protein family. An important issue in the development of MAO is the interoperability with existing information sources, in order to maximise its’ applicability and utility. Therefore, links should be provided to other kinds of information, including databases such as UniProt, PDB or InterPro, as well as existing ontologies such as GO or the HUPO Proteomics Standards Initiative (PSI).

Scope

A vocabulary of terms is provided in the form of a hierarchical network of concepts, together with precise, explanatory definitions of each term and the inter-relations between the different concepts. The aim is to include the great majority of the concepts relevant to multiple sequence alignments, ranging from fundamental concepts such as ‘sequence’ or ‘residue’ to more complex concepts such as structural environment, functional activity or evolutionary context.

Implementation

The MAO is represented as a Directed Acyclic Graph (DAG), in which each node stands for a concept and the links connecting the nodes represent the relationship between them. Two hierarchical relations are defined: ‘is_a’ and ‘part_of’. The nodes can also have other information attached to them in the form of names (defined by the relation ‘is_name’), annotations (defined by ‘is_annotation’) or other attributes. An attribute is generally a simple string or number variable that contains additional information about the concept. 

MAO is developed using the DAG-Edit Java application (http://www.godatabase.org/dev/), which provides an interface to browse, query and edit any vocabulary that has a DAG data structure. The DAG-Edit software provides an option for saving the ontology in the OBO common file format.

 
Collaborators

Institut de Génétique et de Biologie Moléculaire et Cellulaire , Strasbourg (Dino Moras, Olivier Poch, Julie Thompson)
Institut de Biologie Moléculaire et Cellulaire, Strasbourg (Eric Westhof)
UC Davis, Davis, California (Patrice Koehl)
Lawrence Berkeley National Laboratory, California (Steve Holbrook)
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto (Kazutaka Katoh)

Availability

MAO is freely available from the MAO Home Page at http://bips.u-strasbg.fr/LBGI/MAO/mao.html.

You can download the latest version of the ontology in the standard OBO file format: mao.obo

Older versions

Version 1.0: mao1.obo


Get Involved

If you would like to participate in the MAO development, please contact us at Julie.Thompson@igbmc.u-strasbg.fr

Applications

The MAO vocabulary is currently being used to create annotated multiple alignments in XML format for a number of different projects, including:
The MAO is also part of a collaborative project to study the correlation between protein sequence, structure and function. While a direct relationship between sequence similarity and conservation of protein structure has been clearly established, the relation between fold and function is more controversial. The aim of the project is to study the relations between sequence similarity and structural similarity and the extent to which sequence/structure variations can effect a protein's function. An important part of this project will be the development of a new method which will combine the advantages of sequence alignment with 3D structure superposition techniques to improve the quality and reliability of multiple protein alignments.

References

1. Lecompte O, Thompson JD, Plewniak F, Thierry J, Poch O. Multiple alignment of complete sequences (MACS) in the post-genomic era.Gene. 2001 270(1-2):17-30.

2. Stevens R, Goble CA, Bechhofer S. Ontology-based knowledge representation for bioinformatics.Brief Bioinform. 2000 1(4):398-414.

3. Thompson JD, Koehl P, Ripp R, Poch O. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark.  Proteins: Structure, Function and Bioinformatics, submitted.