Teiresias-based Pattern Discovery On Integers
What Is This Tool For? - Input - Options - Parameters - Mode of Operation - Output - References
This tool allows the user to carry out pattern discovery on event streams that consist of positive integers.
Why did we do this? In the papers that described the Teiresias algorithm and its applications, and in order to simplify the presentation therein, we used a small-size alphabet based on alphanumeric characters. Examples include: nucleotides (= {A, C, G, T}), amino acids (= {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}), the English alphabet (= {A, B, C, D, E, F, G, H, I, J, K, L, M, N, P, O, Q, R, S, T, U, V, W, X, Y, Z}), etc. However, such alphabet sets do not contain enough distinct symbols to accommodate the needs that frequently arise in many other problems where pattern discovery can be useful. Consequently, we have implemented a version of Teiresias where the permitted "alphabet" set is the set of positive integers. In fact, you can use as many as 231 -1 distinct positive integers to forms input streams.
There is a very large number of problems that can be solved using this version of the algorithm. Essentially, any pattern discovery problem that can be converted into a stream of positive integers can be solved with this tool.
__________________________________________________________________________
__________________________________________________________________________
This tools takes as inputs data lines consisting of space/tab-separated integers. Carriage returns indicate a new event stream. The web version of the tool does not require that label lines precede data lines; moreover, it will automatically add the integer "-1" at the end of each data line so that it can be processed by the Teiresias algorithm. However, if you run this version of Teiresias on the command line, your will need to add label lines and also to terminate each data line with "-1"
__________________________________________________________________________
__________________________________________________________________________
The following option is available to the user:
__________________________________________________________________________
The parameters you can set here are the following:
a) they span at least L positions.
b) if you look at any consecutive L literals (note: consecutive does NOT mean contiguous!) in a reported pattern the first and last literal are not more than W positions apart.
c) if a pattern is reported as appearing 10 times in the input, we guarantee that it cannot be made more specific through appending/prepending another string or through dereferencing a wild-card character without either decreasing the number of its occurences or violating the L/W constraint.
if SEQ_VERSION is set you are solving the problem: "find all patterns that are supported by at least K (and not more than Q) streams"
if SEQ_VERSION is not set you are solving the problem: "find all patterns that are supported by at least K (and not more than Q) instances."
Note that the output in this case is typically much bigger than in the previous version of the problem. The default value of Q is 2147483647.
__________________________________________________________________________
The input to be processed consists of streams of space/tab-separated positive integers. We assume that you will re-map the original set of "symbols" to a set of positive integers of your choice which we will then process for you. The data lines are permitted to have different lengths in the general case.
__________________________________________________________________________
__________________________________________________________________________
After you prepared and entered the input in the provided window, click on the COMPUTE button. Once the processing has completed, the results will be reported as follows:
Each line corresponds to an integer-based pattern with a 'dot' representing a wild-card. The leftmost number in each line is the rank of the pattern. The second and third numbers in each line are the number of instances of the pattern and the number of input sequences that contain these instances respectively.
You can next click on a pattern to select it, then click on the SEQUENCES button:
This will open a new window that will show the original input sequences with the instances of the selected pattern highlighted.
__________________________________________________________________________
__________________________________________________________________________