Gene Expression Analysis Using Teiresias
What Is This Tool For? - Input - Options - Parameters - Mode of Operation - Output - References
In the context of gene expression analysis the types of experiments that can be carried out are only limited by a researcher's imagination. We can nonetheless distinguish two basic categories of experiments. In the first category, one uses dna array technology in order to characterize the microscopic state of a cell or tissue and associate it with a macroscopic state that can be observed. In the second category, one deals with dynamic studies that track the induction or repression of genes as a function of time and in response to environmental changes.
Both categories can be thought as generating datasets that are matrices of N rows with each row comprising M columns of real numbers. The task is to discover and report relationships involving subsets of the rows and subsets of the columns of this matrix. This is a special case of the problem of association discovery: given a database of N records each of which comprises 2 or more fields (i.e. columns), determine the set of all possible associations that involve at least two fields (= columns) and that are supported by at least K records, with
.
It is important to establish such associations because they can be turned into either hypotheses about causal relationships between genotype and phenotype (first category), or hypotheses about information flow from/to genes that can give rise to putative network models involving these and other genes (second category). In either case, any hypotheses that are derived in silico can be corroborated or refuted through further study.
This tool is meant for analyzing time series that track the induction or repression of genes as a function of time and in response to environmental changes.
If instead of a time series, your input has M columns that represent "answers" to "different questions" (e.g. the rows are plants or tissues and the columns indicate levels of expression of a given gene of interest) then the ASSOCIATION DISCOVERY tool will give you more flexibility and you should use that tool instead.
__________________________________________________________________________
__________________________________________________________________________
This tools takes inputs that consist of real numbers. In particular, if you want to carry out gene expression analysis of N rows (streams) each comprising M time steps, then you must provide N lines each consisting of M real numbers with the numbers separated by spaces or tabs. Since all of the rows contain the same number of columns, there is no need for label lines that precede the data lines.
__________________________________________________________________________
__________________________________________________________________________
The following options are available to the user:
__________________________________________________________________________
The parameters you can set here are the following:
a) they span at least L positions.
b) if you look at any consecutive L literals (note: consecutive does NOT mean contiguous!) in a reported pattern the first and last literal are not more than W positions apart.
c) if a pattern is reported as appearing 10 times in the input, we guarantee that it cannot be made more specific through appending/prepending another string or through dereferencing a wild-card character without either decreasing the number of its occurences or violating the L/W constraint.
if SEQ_VERSION is set you are solving the problem: "find all patterns that are supported by at least K (and not more than Q) streams"
if SEQ_VERSION is not set you are solving the problem: "find all patterns that are supported by at least K (and not more than Q) instances."
Note that the output in this case is typically much bigger than in the previous version of the problem. The default value of Q is 2147483647.
Rules of thumb for setting the parameters
When selecting the values of L and W, the following rules of thumb apply:
- if W is much bigger than L then you should expect many more patterns in your output set
- if the requested support is a small percentage of the number of streams you should expect many more patterns in your output set
- we recommend that you push the envelope progressively:
- start with values of W that are close to L and support that is close to 100%
- continue by slowly increasing W or by slowly decreasing the support
- if you really know what you are up to push it by setting large values for W and small values for support
__________________________________________________________________________
__________________________________________________________________________
Let us assume that the i-th column (i=1,...M) assumes values in a range [min-vali, max-vali]. We determine the range of values in all columns, i.e.
[min all i {min-val1, min-val2, ..., min-valM,}, max all i {max-val1, max-val2, ..., max-valM,}]
and quantize it using the user-defined number of bins (this is the value of the parameter Bins). We then run Teiresias on the quantized input, discover all <L,W> patterns that are supported by at least K rows and report them after re-mapping to the respective intervals.
Whenever applicable, the user can select the "inverse regulation option" that in addition to the patterns that correspond to co-regulation will permit the user to also determine patterns involving rows with column entries that have opposite signs (=inverse regulation).
Also, whenever applicable, the user can further smooth the input rows by selecting 'smooth;' the input rows will then be preprocessed by an averaging filter prior to running pattern discovery.
Finally, one can optionally select 'use derivatives' in order to do pattern discovery on the signs of the derivatives of the input rows - see reference below for an extensive description and real-world examples; if the 'use derivatives' option is selected the input is replaced by +'s (=up-regulation),
's (=down-regulation) and ='s (=no change), and it is on that new representation that we run Teiresias; the output will then consists of +'s,
's, ='s, and of course wild-cards.
__________________________________________________________________________
__________________________________________________________________________
After you prepared and entered the input in the provided window, click on the COMPUTE button. Once Teiresias has completed processing your input, the results will be reported as follows:
We use a fixed-width font to report the results in a manner that permits the user to qualitatively determine where (e.g. beginning, middle, end, etc.) in the input stream each discovered pattern begins. We also use the underscore character '_' as a prefix to patterns that do not begin at the leftmost position in a processed input stream: more underscores indicate that the pattern's starting position is further from the beginning of each stream.
where the first two numbers are the number of instances and the number of streams containing those instances, followed by the pattern in the form of intervals on real numbers interspersed with wild-cards.
This last pattern involves time points that are to the right of the ones comprising the pattern
22 22 __________-.+.+-+--+
With the help of the fixed width font, the user can quickly skim through the discovered patterns and select those that involve a region of interest.
You can click on a row to select it, then click on the SEQUENCES button:
This will open a new window and display the original input sequences with the selected pattern highlighted, as follows:
Alternatively, you can click on a row to select it, then click on the PLOT button. This will open a new window that will show the plots of the original input sequences with the pattern highlighted. The following plot is an example of an input that has been processed with the options 'use derivatives' and 'inverse regulation' selected:
__________________________________________________________________________
__________________________________________________________________________