EPD tutorial

Background.
This tutorial describes how to generate a validated promoter collection (represented as transcription start sites, TSS), check its quality and select promoters from it. In section 1 you will learn how to experimentally validate a promoter collection downloaded from public repositories. You will use CAGE data available on our server as the experimental evidence. The flowchart underlying this section is presented in the Figure 1. Section 2 will be focused on describing quality checks that can be applied to promoters lists that rely on studying promoter-specific feature such as distribution of core promoter elements or histone marks around TSS. Finally, section 3 will focus on how to select promoters based on their expression in specific samples.

Tutorial flowchart
Figure 1: flowchart describing how to make a validated promoter collection.
Each step will require the following:
  • Step 1: Download data from the web
  • Step 2: Selection and formating is done in R
  • Step 3 and 4: The data is already present in the MGA database
  • Step 5: Peak calling is done with ChIP-Peak
  • Step 6: Promoters are validated with ChIP-Cor
  • Step 7: Shifting is done first with ChIP-Extract, then R
  • Step 8: Final promoter collection ready to be checked

1. Make a validated promoter collection

1.1 Make a promoter collection from UCSC Genome Browser data

The first step in this tutorial is to download a promoter collection from a publicly available database. Here it is shown how to do it for UCSC gene annotation table but you can do it in BioMart as well. This is your starting point: a promoter collection that is not experimantally validated, with a low accurancy in placing the TSS. You will then use CAGE data to validate the promoters (selecting the promoters that are expressed) and to shift the TSS location to the nearest CAGE peak. This should represent a better estimate of the true TSS location.

Now you have the promoter collection correctly formatted for the ChIP-Seq server. If you are having problems generating it, you can download a version here. Open it with a text editor to familiarize with the SGA format. Note that the 6th colum represents the associated gene name and that often several promoters belong to the same gene.

1.2 Validate the UCSC promoters with CAGE data

To start the validation process (point 1 in the next list), you have to extract CAGE peaks from the genome. To do so, you will use samples from the ENCODE CAGE data . To validate the largest number of promoters, you should use a sample that is the result of merging all longPolyA samples (mRNA enriched samples) that are present in this serie. The merged sample is available from our server in the drop-down menu. The resulting peaks will then be used to validate and select promoters (point 2) for subsequent analysis.

  1. Use ChIP-Peak to find peaks in ENCODE CAGE data using the following parameters. You can try and change them. For example changing the 'Peak Threshold' would affect the sensitivity of your analysis, lower values produce more peaks that can be used to validate more TSSs. Reducing the 'Window Width' and 'Vicinity Range' would affect broad peaks, potentially splitting them in smaller peaks.

    • Select the following sample:
      • Genome: hg19
      • Data Type: ENCODE RNA-seq
      • Series: GSE34448, CAGE data from ...
      • Sample: All samples cell longPolyA

    • Set the following parameters:
      • Strand : oriented
      • Window Width (bp) : 200
      • Vicinity Range (bp) : 200
      • Peak Threshold (nb of reads) : 50
      • Count Cut-off (nb of reads) : 99999999
      • Peak Refinement : on
    • Run the job.
    • Save the SGA file with the peaks.
  2. Use ChIP-Cor to select promoters with CAGE peaks nearby:

    • Use the UCSC human TSS file (generated with R before) as reference feature (using the 'Upload custom Data'). It is an SGA file, mapped on hg19 and is an oriented feature.
    • Use the ENCODE CAGE peaks file you have just saved as a target feature (again with the 'Upload custom Data'). Select the "+" strand only. This way, you will get the distribution of CAGE peaks that are near UCSC annotated TSS that map to the sense strand relative to the TSS orientation.
    • Reduce the range by a factor of 10: from -100 to 100; Window Width to 1 and Count Cut-off to 1.
    • You are now ready to submit the analysis.
    • In the result page use the "Enriched Feature Extraction Option" to extract promoters that have at least one CAGE peak near them (in a window of 200 bp around them). To do so, set the following parameters:
      From -100; To 100; Threshold: 1; Cut-Off : 1; set "Ref Feature Oriented " as ON
    • You can save the SGA file with the validated promoters

1.3 Shift the position of the annotated TSS to a CAGE peak nearby

At this point in the analysis you have a promoter collection that has been validated by CAGE data with the original TSS coordinates from UCSC. These are not very precise. To increase the TSS precision you could shift them to the nearest CAGE peak position or, alternatively, shift their location to the nearest base with the higher CAGE count. These two values do not always coincide, since ChIP-Peak return the 'center of gravity' of the peak area, and, expetially for broad peaks, it might be away from the base with the higher CAGE count. [Remember that CAGE reads represent the 5-end of mRNA molecules and so the transcription start site of a gene.] We found that the second option gives better results in term of quality. In this section you will learn how to do it. This is a two step procedure in which you [1] extract the position of the CAGE tags around each promoter and [2] shift the TSS to the nearest CAGE maximum.

  1. Use ChIP-Extract to extract CAGE tags around these promoters

    • You can used the link to ChIP-Extract from the previous page or alternatively upload the file with validated promoters as reference feature. As usual, promoters are oriented features.
    • Use the ENCODE samples you used at the beginning of this tutorial (All samples cell longPolyA). Select "+" strand only.
    • Set the range from -100 to 100; Window Width 1 and Count Cut-off 9999999999
    • Submit the job
    • Save the "Table (TEXT)" file. This matrix stores the distribution of CAGE tags around each promoter.
  2. Use R to shift promoters to the nearest CAGE tag maximum using the following code. Here ucsc_encodeCageTags.out is the matrix file you just generated and ucscHumanGenesActive.sga is the list of validated promoters.

    # Read the matrix file:
    cage <- read.table("ucsc_encodeCageTags.out")

    # The function getShift calculates the distance (= shift) between the TSS and
    # the CAGE maximum. It accepts a vector (x) and return the shifting distance
    # from the midpoint.
    getShift <- function(x){
      maxCage <- which(x == max(x))
      min(maxCage - ceiling(length(x)/2))
    }

    # Apply the function to each row of the cage matrix: get the shift position
    # for each promoter in your collection
    promoterShift <- apply(cage, 1, getShift)

    # Read the SGA file with the list of active genes (it has the same order as
    # cage matrix)
    ucscGenesActive <- read.table("ucscHumanGenesActive.sga", as.is=T, sep="\t", header=F)

    # Shift promoters to the nearest cage maximum using the shift values you have
    # calculated:
    ucscPromotersShift <- ucscGenesActive
    ucscPromotersShift[,3] <- as.numeric(ucscPromotersShift[,3]) + promoterShift

    # Write out the result as a SGA-formatted file:
    write.table(ucscPromotersShift, "ucscPromotersShift.sga", col.names=F, row.names=F, sep="\t", quote=F, eol="\n")

Now you have an experimentally validated promoter collection with the TSS position defined as the position of CAGE maximum. The next tutorial will show you how to check the quality of your collection and how to compare it to the initial UCSC annotation and EPDnew. Please note that after shifting there is a high probability that your collection is not properly sorted. If you want you can sort it with following bash command:

sort -s -k1,1 -k3,3n -k4,4 ucscPromotersShift.sga > ucscPromotersShiftSorted.sga

Or, alternatively, switch-on the sorting option when uploading your collection on the server.

2. Quality control of your promoter collection

In this second section of the tutorial you will perform some quality controls on your promoter collection to see if it is any better than UCSC original collection and to compare it to EPDnew. First you will check the motif distribution around promoters and then some histone marks.

Motif search around promoters is done using OProf tools (part of the SSA web server). It accept input in several file formats such as SGA, even if the internal file type is called FPS. Start the analysis as follow:

Now check histone marks around promoters collection and compare them:

Considering motif plots and histone marks analysis, how does your database compare to the other two? Write a short document reporting your findings (parameters used, number of promoters you have found, ...) and the figures you just generated.

3. Promoter selection

In this final section you will learn how to select promoters that are expressed in a particular sample. You will check their chromatin status and compare it to non-expressed promoters.

As an example, you will use the ENCODE data for cell line GM12878. This is an ENCODE tier 1 cell line. As a consequence, it has been heavily studied, providing data for almost all conditions / targets used by the consotium.

In a first step you will use ChIP-Cor to select promoters that are expressed in GM12878, then you will study their histone marks distribution and additionally check their Gene Onthology (using the GREAT suite)