Bio-Dictionary(TM)-based Protein Annotation
What Is This Tool For? - Input - Options - Parameters - Mode of Operation - Output - References
This tool allows one to automatically annotate a given amino acid sequence (or fragment) with the help of the Bio-Dictionary. The tool will determine and report local and global similarities between the query and the contents of the SwissProt/TrEMBL database, phylogenetic domain membership as a function of position within the query, all PUBMED references that are relevant to the query, the location and nature of domains, regions of interest, active sites, post-translationally modified sites, etc. The patterns of the Bio-Dictionary are used to annotate the query and are also reported together with their location within the query and an associated log-probability value.
The Bio-Dictionary is a collection of patterns, referred to as seqlets, that can be shown to account for all of the sequence space of natural proteins as this space is sampled by the currently available public databases. The project began in early 1997, and over the years we have used public databases such as SwissProt and GenPept as our input. Since the Spring 2000, we have been using SwissProt/TrEMBL as our knowledge base.
By design, the seqlets contained in a given version of the Bio-Dictionary capture both intra-family and inter-family signals that can be determined directly from the amino acid sequences, thus presenting new opportunities for the study and analysis of proteins. By tapping into information contained in public databases we can augment each seqlet with a 'meaning' which can have many forms: for example, the meaning can contain the name of the family (-ies) of the proteins containing instances of the seqlet, known secondary/tertiary structure for the sequence regions corresponding to the instances of the seqlet, other known features corresponding to these regions, etc.
The key idea is simple and straightforward: the query sequence is decomposed in terms of the fixed collection of seqlets contained in the Bio-Dictionary. Using the various meanings that are attached to the seqlets that comprise the decomposition of the query we annotate individual locations as well as regions of the query with the respective meaning (an instance of 'guilty by association').
__________________________________________________________________________
__________________________________________________________________________
The input to be processed consists of an amino acid sequence in FASTA format, i.e. a label line followed by one or more data lines (amino acids).
__________________________________________________________________________
__________________________________________________________________________
There are no options that are available for this tool.
__________________________________________________________________________
__________________________________________________________________________
The parameters you can set here are the following:
Rules of Thumb for Setting the Parameters
The default settings should suffice for most purposes. The semantics of the various parameters will permit to affect the generated output in an obvious manner.
__________________________________________________________________________
__________________________________________________________________________
Once the query is given, all of the seqlets in the Bio-Dictionary are looked-up within the query. For those that are determined to have instances in the query their respective meanings are accumulated; neighboring regions with similar meanings are grown into larger regions. Those regions whose associated meanings exceed a threshold value of support are reported.
__________________________________________________________________________
__________________________________________________________________________
The output consists of three frames, as shown here:
The top frame shows the phylogenetic domain membership of the query as a function of position:
Also provided here is a link that permits the user to see the seqlets that were used to annotate the query:
The middle frame (a.k.a. 'similarities frame') shows which regions of the query can also be found in which known protein families (or individual proteins if no protein family has been assigned); the plots assume values that range from 0 to 100 indicating the quality of the region's similarity and are scored in order of decreasing score. Note that if a narrow (in terms of extent) region is better conserved than a wide region, then the narrow region's plot will be ranked higher in the list of plots. The captions of the plots are active links: when selected, a query will be issued that will determine those entries from SwissProt and TrEMBL (and TrEMBL-New when appropriate) that have the same description line. In this frame, we also provide a link that will report those PUBMED references that are relevant for the corresponding reported similarity.
The bottom frame (a.k.a. 'features' frame) shows the location and extent of features that can be identified in the processed query; e.g. binding domains, transmembrane domains, active sites, post-translationally-modified sites, signals, etc. Note that these deductions are made using entirely local information: as such, the following situation may arise: regions of a globular protein are automatically marked as 'transmembrane' because the involved regions have high similarity to known instances of transmembrane helices at the sequence level. It is easy to determine and discard such local/global conflicts due to the lack of additional corroborating features that should otherwise appear in the 'features' frame. As in the case of the 'similarities' frame, the captions of the plots are active links that will help you determine those entries from SwissProt and TrEMBL (and TrEMBL-New when appropriate) that contain the corresponding feature table entry.
NOTE: The minimum allowed extent that a similar region must have before being reported is user-controlled from the tool's entry page; note that this threshold only controls the results in the similarities frame and will not affect the results reported in the features frame.
References