22 avril 2013
Zoom sur la bioinformatique structurale
Priorité est donnée à la bioinformatique appliquée à la biologie structurale à l'ère post-génomique.
The first version of BAliBASE (version 1.0) was described in Bioinformatics 1999 Vol 15, Issue 1, 87-88 and was dedicated to the evaluation of multiple alignment programs and was divided in five hierarchical reference sets of :
Reference 1 equi-distant sequences with various levels of conservation,
Reference 2 families aligned with a highly divergent "orphan" sequence,
Reference 3 subgroups with <25% residue identity between groups,
Reference 4 sequences with N/C-terminal extensions,
Reference 5 internal insertions.
For release 2.0 of BAliBASE, these alignments have been verified and corrected by superposition of all known 3-dimensional structures, using the lsqman program.
BAliBASE 2.0 includes three new alignment references sets (references 6-8) containing 26 protein families with 12 distinct repeat types, 9 transmembrane families and 5 families with inverted domains, representing more than 1100 sequences. As in references 1-5, core blocks are defined that only include the repeated/inverted domains and the transmembrane helices.
The difficulties encountered when detecting or aligning proteins containing repeats are strongly related to the residue similarity of the repeated regions. Therefore, for each of the 12 reference families, a multiple alignment has been constructed by fragmenting the sequences in order to align all the repeated regions. The repeats were then classified into a number of subtypes according to their residue similarity. The number of repeats and the presence of additional domains also affect the ability of a given program to construct an accurate alignment. To address these questions, subsets of each reference alignment are proposed, selected according to repeat subtype and presence of additional domains (Table 2). These 132 subsets present a number of aligned sequences verifying specific criteria:
By combining these alignment subsets, a flexible benchmark test system can be designed to evaluate and compare the various algorithms currently available for the detection and alignment of repeats in protein sequences.
In the case of transmembrane proteins, the problems are similar to those defined previously for repeats, i.e. detection, local and global alignment. Reference 7 consists of 9 families of transmembrane proteins containing approximately 500 aligned protein sequences. A global multiple alignment of each family is available, in which the known transmembrane helices for each sequence are identified. The number and lengths of the sequences, and the pairwise sequence similarities are included in a separate text file, allowing the user to define specific subsets of the reference alignments, as required.
Reference 8 concerns two different, but related, problems. Inversions in proteins can result from various phenomena, such as the insertion of a complete domain at different sites in a protein, or the transfer of part of the C-terminal of the protein to its N-terminal, thus causing a discontinuity of the terminal domain. Reference 8 consists of five protein families in which the sequential ordering of the domains is not preserved, corresponding to xxx sequences. For each family in this reference, we propose an independent alignment of each permuted domain.
The alignments for all the reference sets are provided in either RSF or MSF formats. Each alignment is associated with an annotation file containing a description of the alignment.
If you have any problems/comments/questions, please e-mail Julie Thompson