Chip-seq data analysis: from quality check to motif discovery and more

Lausanne, 27 March - 31 March 2017

Data reproduction exercise: Alignment of MNase tags around NFKB sites

Romain Groux, Sunil Kumar and Philipp Bucher


Introduction

Current exercise is baesd on the following paper:

A. Heatmaps of MNase midpoints (columns 1–2) and DNase I cuts (column 3) surrounding 1000 randomly sampled ChIP-seq peaks for CTCF, NF-kB, Irf4, GABP and C-fos. Heatmap rows are ordered from top to bottom by the nucleosome array log likelihood ratio (LLR). B. Aggregation plot for MNase midpoint and DNase I cutsite depths across all regions and for the subset of regions with LLR>500.

We will focus on only a part of the figure, and explore the MNase pattern around NF-kB (pre and post 're-alignment' or in our case using a simple algorithm).

Data re-alignment method

In the current exercise, we will use a simple re-alignment technic, different from the one used in the article but which will complete the expected task. Briefly, this algorithm will take as input a count matrix where each individual row contains the MNase counts around one given TF binding site. It will then try to re-align each row by comparing them individually with the data aggregation pattern. Several different shift will be considered and for every one of them, a correlation score will be computed (aggregation pattern versus the row). The optimal shift for a row will be considered as the one maximizing the correlation. Eventually, the row will be re-aligned according to the corresponding shift. The code will be provided :-)
In case you are interested, you can check another algorithm developped by the group to discover significant patterns in ChIP-Seq data [Nair et al., 2014].

Hints and recipes

In order to identify patterns in MNase dataset around specific transcription factor, we will need two datasets. Instead of using Gaffney and colleagues MNase data, we will use the MNase data generated by the ENCODE Consortium (you can also find Gaffney and colleagues data on our server). Since these data ara coming from GM12878 cells, we will also choose a NFkB peak list coming from the same cell line. The datasets we will need are :

We will use ChIP-Extract Analysis Module to generate a tag count matrix in defined bins around NF-kB sites.
Select the parameters as shown in the picture below and click submit.

ChIPExtract instructions

Download the Ref SGA File and Table (TEXT) and save as mnase_data_encode.txt.

Performing the data re-alignment

Navigate into directory containing all the data and launch R.

Load the data re-alignment function Re-alignment function code Hide code
Read the data, define input parameters and perform the re-alignment: R code Hide code
Plotting the results: R code Hide code