Signal Search Analysis Tutorial

1. Introduction:

Signal search analysis is an ancient method (published in 1984) to analyse sequence motifs that occur at characteristic distances upstream or downstream from a functional site in a nucleic acid sequence. Note that this problem is different from the standard motif search problem addressed by many other algorithms. There, one wants to find a motif present in all or a statistically significant proportion of input sequences. The location of the motif within the sequences is irrelevant. In signal search analysis, the input is a list of experimentally defined functional sites, for instance transcription initiation sites, given as pointers to positions in nucleotide database sequences (The sequences are stored somewhere else on the computer). The user specifies on the fly the sequence range around the site she/he wants to consider. In signal search analysis, not only the structure, but also the location relative to the functional site as well as the distance flexibility are of interest.

Let's have a look at a few examples.

A)

A sizeable fraction of eukaryotic promoters contain a so-called TATA-box upstream of the initiation site. Let's suppose that we already know the approximate structure of this element and that we are primarily interested in its location relative to the initiation sites. To answer this question go to the OProf page (OProf stays for occurrence profile) and follow the instructions below: You are now ready to submit the job.

As you can see, there are a number of parameters you can play with in order to make a graphically appealing signal occurrence profile.

The CCAAT-box is another signal reported to occur frequently in the upstream region of eukaryotic promoters. Unlike the TATA-box, it appears to function in either orientation. To confirm these claims:

B)

Let's now look at bacterial translation initation sites.
Prokaryotic messenger RNAs contain a so-called Shine-Dalgarno interaction region upstream of the initiation codon, containing a sequence motif that is complementary to the highly conserved 3'-terminus of the 16s ribosomal RNA (see for instance A14565).

To analyse this motif, chose an oligonucleotide near the 3'end of E.coli ribosomal RNA, e.g. CCUCCU, and produce a signal occurrence profile for its complement (AGGAGG in proposed example) for translation start site regions of several bacterial species. Note that you can select several species at once (up to four) in order to combine several signal occurrence profiles in one graph. To start this analysis, we propose to analyse and compare the Shine-Dalgarno interaction regions of the extensively studied species E. coli and B. subtilis. Bacteria do not only use ATG as translation initiation codon, but at lower frequences also GTG, CTG, and TTG. Determine the frequencies at which these codons are used in various prokaryotic species using the OProf service.

2. FPS-dependent sequence retrieval

As mentioned before, SSA programs typically do not use sequences as input but lists of computer-readable pointers to sequence positions in a database. Such a list of pointers is called a functional position set, or FPS. Each pointer contains a sequence id, a position, and two flags, one indicating the strand (+ or -), the other one the topology (1=linear, 0=circular). The Eukaryotic Promoter Database is an example of a functional position set. To further illustrate this concept, let's now go for a short moment to the EPD pages:
   EPD/
Display an individual promoter entry in text format, for instance
   HS_MYC_1 doc
The computer-readable pointer to the sequence position is contained in the line starting with the line code FP.
Try to identify the four crucial elements: sequence id, topology, orientation, and position.
The FPS files used by the Signal Search Analysis server can also be viewed, for instance:
   /ssa/data/fps/pro/epd_nr.fps
   /ssa/data/fps/bac/Bacillus_subtilis.fps

Now, from the EPD home page, on the left menu, follow the link: Download EPD db. This page allows you to extract promoter sequence segments around transcription initiation sites from EPD, and from various predefined subsets of EPD. The user can specify the relative 5' and 3' borders of the sequence regions to extract. Use this page to download promoter sequence files in Fasta format corresponding to the following subsets.

   All promoters
   Plant promoters
   Arthropode promoters
   Vertebrate promoters
Activate the switch "Representative set of not closely related sequences" at the bottom of the page. Specify sequence region -499 to +100. Note at this point, that the base corresponding to the first transcribed base of the RNA is numbered 0. The total length of the extracted sequence fragments is thus exactly 600. The result has to be saved in text format.

These files can be uploaded to the signal search signal search server. Try to reproduce one of the signal occurrence profiles you made before by uploading the promoter sequence file containing the non-redundant subset of all promoter sequences. Note that you have to indicate the relative internal position of the functional site on the OPROF form (500 in this case). You can also specify a name for the sequence set (e.g. epd_nr) and a description of the site type (e.g. "Transcription start site") on the form. The contents of these fields will appear in the graphical output produced by the signal search analysis server.

3. Constraint profiles.

In the examples studied so far, we already had some idea of how the signal we were interested looks like. But how to proceed if we know absolutely nothing in the beginning. The program CPR (for constraint profile) can be of some help in such situations. A constraint profile is a plot of sequence non-randomness as a function of the location relative to a functional site. For instance, eukaryotic promoter sequences show high non-randomness about 30 bp upstream of the transcription start site because of the frequent presence of a TATA-box motif in this region.

Input to a constraint analysis is a functional position set (FPS) and a so-called "signal sequence collection". The latter may consist of a complete set of oligonucleotides of particular length. Like in OProf, the sequences extracted with the FPS are scanned with a sliding window. The frequencies of the elements of the signal sequence collection are determined for each window. This gives rise to a two-dimensional array of numbers called "signal search data". In windows with high sequence constraints, a few oligonucleotides may occur at very high frequencies while most others occur at frequencies slightly below expectation. This would lead to a relatively high variance of "signal frequencies" (original jargon). The constraint index displayed in a constraint profile is in fact based on the variance of the signal frequencies.

Let's look at an example:

Instead of a complete signal search collection, one can also use a random subset of oligonucleotides of a particular length, for instance 200 hexamers. This allows one to use longer signals without exponentially increasing the computing time. Special collections allow usage of so-called "gapped oligonucleotides". A gapped oligonucleotide is a motif consisting of real bases and unspecific positions represented by the wild-card character N. For instance ANA is a gapped dinucleotide. A certain type of gapped oligonucleotides is specified by a string consisting of the letters X and N, where X stays for a real base and is automatically expanded to all four bases of the DNA alphabet. For instance XNX is expanded to:
   XNX -> ANA,ANC,ANG,ANT,CNA,CNC,CNG,CNT,GNA,GNC,GNG,GNT,TNA,TNC,TNG,TNT,
Different types can be combined in one collection but they all have to be of the same length.

Further suggestion:

4. Using signal lists to analyse the contents of a constraint regions

The program SList (for Signal List) is used to analyse the contents of a constraint region. The input and data processing steps are largely the same as for the constraint analysis. Both programs generate so-called signal search data (lists of oligonucleotide frequencies determined in a sliding window). What is different is the output. SList produces a list of locally over- or under-represented "signals" (oligonucleotide motifs). Over- and under-representation can be assessed in two different ways. "Calculation mode" 1 uses the the mean of all signal frequencies in the corresponding window as the reference, mode 2 uses the mean of the frequencies of the corresponding signal in all windows as the reference. The selection mode refers to local and global maxima along a particular signal occurrence profile.

Use Slist to further investigate the signals corresponding to the constraint regions found in eukaryotic promoters and bacterial translation start regions.

5. Optimizing a weight matrix for a locally over-represented sequence motif

Consensus sequences are not always appropriate descriptors of regulatory sequence motifs. In particular, they cannot make a difference between easily tolerated and severe mismatches. Note that a weight matrix can be viewed as a generalization of a consensus sequence. For instance, the motif TATAAA (1 mismatch) can be represented by the following weight matrix:
   0 0 0 1
   1 0 0 0
   0 0 0 1
   1 0 0 0
   1 0 0 0
   1 0 0 0

Cut-off value: 5
Convince yourself of this equivalence by generating the same signal occurrence profile with this motif for eukaryotic promoters, once with a consensus sequence and once with a weight matrix.

It is not a trivial task to find an optimal weight matrix description for a motif like the TATA-box. The program PATOP PatOp (for pattern optimization) implements an iterative procedure which successively optimizes the weight matrix, the cut-off value, and the borders of the preferred region of occurrence, keeping two of these three components constant at a time. PAPOP has the capability of extending the matrix to the left and right side if additional consensus is observed, or to drop positions in the opposite case.

Use this program to produce a weight matrix description of the TATA-box motif for the non-redundant insect and plant promoter sets (they are relatively small and thus do not take too much time). Use default parameters for this purpose (a detailed understanding of the parameters of the PatOp algorithm is beyond the scope of this tutorial). Start from the consensus sequence motif TATAAA (one mismatch).

PatOp uses a heuristic algorithm converging to a local optimum. To test convergency, start the iterative refinement process from another initial motif, for instance TATAAT.

Try also to derive a weight matrix for the Shine-Dalgarno interaction region of a completely sequenced bacterium. Are the weights of the matrix found to be compatible with the assumption that G:U pairs can also be formed between the mRNA leader and the 3'end of the 16s RNA?

6. Analyse a collection of yeast splice acceptor sites

It has been said that the introns from budding yeast contain a special signal near the branchpoint. During the splicing reaction, the 3' end of the intron is covalently linked to the 2'OH group of an internal base (called branchpoint) leading to a so-called lariat structure. The branchpoint is difficult to determine experimentally but it is known to be located within a limited distance range from the 3' end of the intron, also called the splice acceptor site.

A sequence set of yeast splice acceptor sites can be found here.

   /ssa/data/yeast_ag.seq
The sequences extend from -200 to +100 relative to the 3' end of the intron. Use your skills learned during the previous exercises to characterize the branchpoint consensus sequence of budding yeast.
Last update 9 Jul. 2010