MUSCA: Multiple Sequence Alignment

What Is This Tool For? - Input - Options - Parameters - Mode of Operation - Output - References

What Is This Tool For?

This tool allows the user to align multiple streams of letters so as to reveal salient features that are potentially present in a given collection of event streams.

We define the multiple sequence alignment problem as follows: given a collection of N sequences, insert spaces (gaps) into the sequences or at the beginning/end of them so that the resulting sequences all have the same length. One can typically use a scoring function, that rewards agreements and penalizes mismatches and gaps, in order to quantitatively determine the quality of the resulting alignment. Traditionally, multiple sequence alignment was used as a first step in the path of discovering patterns that are shared among sequences (proteins, dna, etc.). In the early 90's, researchers began exploring the use of pattern discovery as a means to determine and report multiple sequence alignments.

__________________________________________________________________________

__________________________________________________________________________

Input Format

The input to be processed consists of streams of symbols. Each of the streams is to be preceded by a label line. Thus, an input set comprising N data streams should consist of 2*N input lines (N 'label' lines and N 'data' lines).

In the case of amino acid sequences the input can be given in FASTA format: it will be processed automatically and turned into the well-formed input with alternating 'label' lines and 'data' lines. Obviously, 'data' lines can all have different lengths.

For amino acid-based inputs, the amino acids in the final alignment can be colored-coded based on their hydropathy -- this is optional and controlled by selecting the appropriate box in the main page of the multiple sequence alignment tool.

NOTE: All the symbols in the input will be turned into lower case before processing, unless you have selected the 'upper case' option: then, case will be preserved and a symbol in lower case will be considered different from its upper case version.

__________________________________________________________________________

__________________________________________________________________________

Options

The following options are available to the user:

__________________________________________________________________________

__________________________________________________________________________

Parameters

The parameters you can set here are the following:

In the latter case, you need to specify what symbols of the allowed alphabet can be treated as equivalent during the actual pattern discovery process. You define equivalence sets by listing each set on a separate line. Each line corresponds to a single set and should contain the symbols without any spaces belonging to the set. The equivalency sets can overlap (e.g. KR and KRH appearing on two different lines). Also, if a symbol is in a class by itself, you should NOT list it in this window.

NOTE 1: in our extensive work on the analysis of protein sequences, we have discovered that use of scoring matrices is limiting since it may result in restrictive choices: allowing you to define equivalency sets permits to bypass such restrictions and provide for more flexibility during the discovery process.

NOTE 2: If you work with proteins the following equivalence sets may prove useful:

AS, DEN, DEKQ, FLWY, HNQY, ILMV, EKQR, FILMV, DHNS, EHKQR, KQR, ANST, ST, FWY, FHWY

Rules of Thumb for Setting the Parameters

Typical starting choices for K (= support) should be in the interval [N/2, N]. Regarding the choices for L and W our recommendation is to start with values that correspond to 'high-degree' of 'local' conservation and progressively relax the constraints. Here is a typical choice 'schedule' for an input of N=20 sequences: L=6 W=8 K=10, L=6 W=9 K=10, L=6 W=10 K=10, L=6 W=12 K=10, L=5 W=12 K=10, L=5 W=12 K=9, L=5 W=12 K=8, L=5 W=12 K=7, L=5 W=12 K=6, etc. As you try these combinations of values you typically expect to see the results improve and converge toward a stable, "best" alignment -- given the performance of the algorithm you will actually be able to very quickly try these combinations before deciding on the combination that is best.

__________________________________________________________________________

__________________________________________________________________________

Mode Of Operation

MUSCA is a two-phase algorithm for computing the multiple sequence alignment of a set of N sequences. During the first phase, we use Teiresias to discover motifs that are common among a subset of the N sequences. These motifs are used in the second phase to generate and report the multiple sequence alignment. In particular, the motifs are first mapped to vertices of a directed graph: if two motifs pi and pj do not occur simultaneously in any sequence then there is no edge connecting the corresponding vertices of the graph; the vertices corresponding to pi and pj will be connected by an edge with direction from pi to pj if pi occurs before pj in all the sequences where they both appear; the labels of the edges depend on three things: whether pi and pj are pair-wise incompatible, have overlapping instances, or are pair-wise compatible but do not overlap. Those vertices that are joined by incompatible edges or participate in inconsistent cycles form the basic non-feasible sets. After labeling the vertices of the reduced graph with the help of a simple cost function, we use a greedy algorithm to obtain a solution to a weighted set-cover problem that essentially identifies the minimum number of motifs/vertices to be removed. The resulting graph is used to determine blocks that involve overlapping feasible motifs. We obtain the final alignment by properly aligning the blocks and padding up the existing gaps.

The conceptual underpinnings of this method can be traced to earlier work where the multiple alignment is driven by pair-wise comparisons of the processed input sequences. What distinguishes our work is that it is independent of the order in which the input sequences are given and that its starting point is a K-wise alignment (with K being user-defined and generally larger than 2). Two situations where the algorithm excels are: a) inputs where the various sequences share domains that are present in some of the sequences only, and b) inputs that comprise two or more subsets with high conservation within each set but low conservation across the subsets.

The resulting algorithm works very efficiently with large inputs that contain long sequences and a detailed description of it together with results on several benchmark datasets can be found in the references.

__________________________________________________________________________

__________________________________________________________________________

Output Format

After you prepared and entered the input in the provided window, click on the COMPUTE button. Once the processing has completed, the results will be reported as follows:

__________________________________________________________________________

__________________________________________________________________________

References