TEIRESIAS: Pattern Discovery

What Is This Tool For? - Input - Options - Parameters - Mode of Operation - Output - References

What Is This Tool For?

This tool allows the user to carry out pattern discovery on event streams that of alphanumeric characters. Possible alphabet sets include nucleic acids, amino acids, etc.

Conceived and designed in 1996, the Teiresias algorithm is a two-phase, combinatorial algorithm for general purpose pattern discovery. Due to its exceptional speed and its ability to handle very large input datasets and arbitrarily large alphabets (= sets of allowed events) it is applicable to a wide spectrum of problems that include a large number of computational biology applications (e.g. pattern discovery, dna tandem repeat discovery, automated protein functional and structural annotation, gene discovery, etc.), computer security, text mining, market/customer analysis, etc. The algorithm operates equally well on discrete as well continuous inputs. More information on the algorithm can be found in the publications listed below.

__________________________________________________________________________

__________________________________________________________________________

Input Format

If you work with proteins or DNA, then any file in the FASTA format will be acceptable. If your domain of interest is outside of biology, then you must prepare your input file according to the following specifications:

Examples of valid label lines:

>My_2_CENTS 1234

>hba_HUMAN 1

>ThisIsStreamNumberTen 2345

Examples of valid data lines:

abgaJHDJnbadsid

currentuumLUWTuwuzumlwuwtUWZTUWCOPYwtuwtuwtuwtuwTUWZUUMLbrvbar

In other words, your input file should contain twice as many lines as the input streams that you have with label lines preceding actual data lines; e.g.

label_line

data_stream_line

label_line

data_stream_line

label_line

data_stream_line

....

...

..

.

Here is an example valid input file:

>stream 1

ABCDEFGHIJKLMNOPQRSTUV

>stream 2

AXCDXXXHIJKLMNXXQTUV

>stream 3

ABCDEFGHIJKLMNOPUAHIJKLMNOXYZSTUV

__________________________________________________________________________

__________________________________________________________________________

Options

The following options are available to the user:

__________________________________________________________________________

__________________________________________________________________________

Parameters

The parameters you can set here are the following:

In the latter case, you need to specify what symbols of the allowed alphabet can be treated as equivalent during the actual pattern discovery process. You define equivalence sets by listing each set on a separate line. Each line corresponds to a single set and should contain the symbols without any spaces belonging to the set. The equivalency sets can overlap (e.g. KR and KRH appearing on two different lines). Also, if a symbol is in a class by itself, you should NOT list it in this window.

NOTE 1: in our extensive work on the analysis of protein sequences, we have discovered that use of scoring matrices is limiting since it may result in restrictive choices: allowing you to define equivalency sets permits to bypass such restrictions and provide for more flexibility during the discovery process.

NOTE 2: If you work with proteins the following equivalence sets may prove useful:

AS, DEN, DEKQ, FLWY, HNQY, ILMV, EKQR, FILMV, DHNS, EHKQR, KQR, ANST, ST, FWY, FHWY

__________________________________________________________________________

__________________________________________________________________________

Mode Of Operation

Scanning phase

During a first phase, the algorithm compiles a complete collection of elementary patterns (L-tuples of events with the tuples spanning no more than W positions in any event stream - clearly, L <= W).

Convolution phase

During this second phase, Teiresias recursively combines elementary patterns into increasingly longer patterns of decreasing support that have the guaranteed property of being maximal with respect to both length and composition.

Fig 1. The convolution phase of Teiresias.

Figure 1 shows a snapshot of the convolution phase where two arbitrary patterns are combined into a longer one. A left and a right pattern from the pool of active patterns1 each of which carries a position list indicating where exactly it appears in the processed database may be combined if and only if the left pattern ends in a suffix of n non-wild card events that is identical to a prefix of the right pattern: in the presence of the suffix/prefix agreement, the position lists of the two patterns are also examined for agreement and the pattern resulting from the convolution is given a position list that is the intersection of the position lists of the component patterns; to guarantee completeness of results, n must be equal to L-1. The position lists of the two component patterns are updated and returned to the pool of active patterns if their support continues to exceed threshold. The order in which convolutions are to be performed is decided through the use of two partial orderings, prefix-wise and suffix-wise2 and is essential in avoiding the generation of redundant patterns. Finally, the algorithm guarantees that all maximal patterns with support exceeding the predetermined threshold are reported.

It is frequently desirable to carry out the discovery process by permitting events from a small collection to substitute for one another. For example, one may wish to allow for negatively-charged amino acids to replace one another, or allow for structurally similar amino acids to replace one another, etc. We thus further allow for the discovery of patterns in the presence of a collection C of sets of events from the power set 2|E|. The sets in the collection C need not form a partition of E. Typical example collections include but are not limited to: a) C={{A,G}, {C}, {D,E}, {F,Y}, {H}, {I,L,M,V}, {K,R}, {N,Q}, {P}, {S,T}, {W}}, b) C={ {A,G,C,F,Y,H,I,L,M,V,N,Q,P,S,T,W}, {D,E}, {K,R,H}}, c) C={{A,G}, {A,G,P,S,T}, {C}, {D,E}, {D,E,Q,N}, {K,R,H}, {I,L,M,V}, {F,Y,W}}, etc. The features that are under consideration each time will dictate the choice of the collection and the user can determine/define such a collection interactively. The convolution phase in Fig. 1 remains unchanged in the case where the pattern discovery is carried out in the presence of event equivalence. Finally, note that a given event can simultaneously participate in multiple equivalence set, as in case c) above. The algorithm will make use of the set of equivalences that are most specific for the input that is processed.

In summary, if E denotes the set of permissible events, Teiresias can discover patterns that can be captured by the regular expression:

__________________________________________________________________________

__________________________________________________________________________

Output Format

After you prepared and entered the input in the provided window, click on the COMPUTE button. Once the processing has completed, the results will be reported as follows:

Each line corresponds to a discovered pattern with a 'dot' representing a wild-card. The leftmost number in each line is the rank of the pattern. The second and third numbers in each line are the number of instances of the pattern and the number of input sequences that contain these instances respectively.

Note that a number of additional input windows and tools have appeared at the bottom of the results page. They implement the following functionalities:

Evaluate Significance: by clicking on this button, you can compute and attach a measure of significance to each of the discovered patterns. The method used here relies on Bayes theorem in conjunction with a 2nd order Markov chain. What is particularly interesting is that the method is applicable to all types of regular expressions including those that contain brackets. We assume that each of the patterns will be used as a predicate to search a database which has the size and composition of GenPept. The reported values are the logarithms of the estimated probability for the pattern under consideration and for a very large number of test cases, they have been shown to correspond well to biological information. Generally, patterns with probability values that are less than or equal to -30.00 can be used as family predicates.

Select by Log-Probability: this utility is available only after Evaluate Significance is run. It allows you to subselect from the reported list of patterns based on their estimated log-probability. The latter is computed with the help of a simple 2nd order Markov chain that has been used with NCBI's GenPept database. This utility works only if the test input comprises proteins or protein fragments.

Select by Density: this allows you to reduce the output file to a file that contains patterns whose density (i.e. non-dots over total length of the pattern) is between the user-specified limits.

Select by Number of Streams Containing Pattern: by determining a minimum and a maximum value you can SUB-select from the discovered set of patterns those that whose occurrences appear in a number of input sequences that is constrained by the desired interval. Allowed values range from 0 to infinity.

Enforce "Occurrences = Sequences" Constraint: by clicking on this button you will SUB-select those of the patterns that appear exactly once in the sequences that contain them, i.e. those patterns for which the number of times they occur (first column in reported results) is equal to the number of sequences in which they occure (second column in reported results).

Select by Number of Non-Dots in Pattern: by specifying the number of non-don't cares that are present in a pattern one can subselect various types of patterns from the discovered set. If only one argument is given, say m, it is used as a lower bound and the patterns that survive will have a minimum of m symbols, brackets or combination of the two. If a second argument is given, say M, then the surviving patterns will have a minimum of m but not more than M symbols, brackets or combination of the two.

Generate Bracketed Expressions (for protein inputs and "exact discovery" only): here, you need to specify either 1 or 3 arguments, let us call them Left, Center, Right. The value of Left corresponds to the maximum allowed number of alphabet symbols that can be used when dereferencing a wild-card (i.e. don't care character). If you do not specify values for the Center and Right variables, then the code will attempt to dereference and replace each don't care character by a bracketed expression with at most Left many symbols. When Center and Right are specified they should be set respectively to the values of L and W you used during the discovery. When set, the code will attempt to expand each of the discovered patterns by growing them outward while (a) maintaining the = constraint, and (b) never dereferencing a wild card by more than Left many alphabet symbols. You can of course change the values of Center and Right (equiv. 'change the density constraint') if exploring the neighborhood of a given pattern for signals that are weaker than . Note that the time needed to execute this subroutine increases with the value of W.

Once you have generated a collection of patterns using none, one or more of the above utilities, you can click on a pattern to select it, then click on the SEQUENCES button:

This will open a new window that will show the original input sequences with the instances of the selected pattern highlighted.

Alternatively, you can click on a pattern to select it, then click on the SwissProt/TrEMBL button. This will search our on-line release of SwissProt/TrEMBL looking for other sequences in the database that contain the selected pattern. The results are links to the respective SwissProt/TrEMBL database entries and are reported as follows:

__________________________________________________________________________

__________________________________________________________________________

References


1 I.e. patterns that have not been reported and whose support still exceeds threshold - initially the pool comprises only the elementary patterns.

2 Let s1 and s2 be two events from E and x, y two strings from . Then the prefix-wise partial ordering is defined as follows: a) and and b) c) if and only if and d) if and only if . The suffix-wise partial ordering is defined analogously.