ChIP-Seq
News:
2018-10-29 -- A new ChIP-seq dataset for human (hg19) has been added to MGA: Mohammed 2015, Progesterone receptor modulates estrogen receptor-α action in breast cancer

The MGA Data Repository

The Mass Genome Annotation (MGA) Data Repository stores published next generation sequencing data and other genome annotation data (such as gene start sites, SNPs, etc.) that, in conjunction with the ChIP-Seq and SSA servers, can be accessed and studied by scientists. The main characteristic of the MGA database is to store mapped data (in the form of genomic coordinates of mapped reads) and not sequence files. In this way, each sample present in the database has been pre-processed (for example sequence reads has been mapped to a genome) and presented in a standardized text format named SGA (Simple Genome Annotation).

How to cite:
R. Dreos, G. Ambrosini, R. Groux, R. Cavin Perier, P. Bucher; MGA repository: a curated data resource for ChIP-seq and other genome annotated data, Nucleic Acids Research, gkx995, https://doi.org/10.1093/nar/gkx995

Access to the database

Access to the database can be done in various ways:

  • Searching for keywords in the MGA-Search page. Links to documentation, relevant publication and analysis tools help in the study and interpretation of published data.
  • Via the MGA Data Overview page browsing through all series and samples.
  • Via the FTP site for data download in SGA format.
  • Through menus in all input pages of the ChIP-Seq and SSA servers.

Data export and format conversion

The native file format at the back end of the repository is SGA and can be accessed via the FTP server. Users interested in using MGA data with other tools that do not support SGA format can easly convert SGA formatted data to BED by:

Technical informations about SGA file format and conversion rules can be found here.

Database content

The MGA repository contains the following numebr of samples (stratified by organism and data type):

Data Type

Human

Mouse

Rat

Rhesus Macaque

Dog

Chicken

Zebra fish

Bee

Fruit Fly

Worm

Baker's Yeast

Fission Yeast

Arabidopsis

Corn

Malaria Parasite

Total

ChIP-seq

6707

661

4

5

11

4

34

-

514

198

527

405

212

12

-

9294

ChIP-seq-invitro

-

-

-

-

-

-

-

-

-

-

-

931

-

-

-

931

ChIP-seq-peak

1936

28

-

-

-

-

-

-

-

-

-

-

-

-

-

1964

Transcription Profiling

2429

1352

13

15

12

32

12

16

347

19

22

16

13

8

13

4319

DNase FAIRE etc.

1070

26

-

-

-

-

4

-

68

6

58

8

9

3

-

1252

DNA methylation

24

4

-

-

-

-

-

-

-

-

-

-

-

-

-

28

Genome annotation

30

21

2

2

2

15

6

2

16

18

4

5

5

3

1

132

Sequence-derived

3617

2315

-

-

-

1

14

9

1240

9

9

9

1531

9

-

8764

Total # of Samples

15814

4407

19

22

25

52

70

27

2185

250

620

443

2701

35

14

26684

Data types are the following:

  • ChIP-seq: raw data (reads mapping coordinates) from classical ChIP-seq experiments targeting transcription factors, protein-DNA intraction, histone variants and modifications, etc.
  • ChIP-seq-invitro: raw data (reads mapping coordinates) from in-vitro ChIP-seq experiments such ad DAP-seq.
  • ChIP-seq-peak: peak regions provided by the authors of the data
  • Transcript Profiling: raw data from experiments aimed at profiling transcripts initiation such as CAGE, GRO-cap, GRO-seq, PEAT, etc.
  • DNase FAIRE etc.: raw data from chromatin and chromatin accessibility studies such as MNase-seq, DNase-seq, DNase-hypersensitivity, etc.
  • DNA methylation: raw data from methylation studies.
  • Genome Annotation: transcription start sites, transcription end sites, intron-exon boundaries
  • Sequence derived: PWM matches, Natural Variants, Conservation scores, etc.

The list of series present in the database can be found in the MGA Data Overview page.

Sample name conventions

Samples names in MGA contain useful informations about the samples' biological and technical variables. For example, the sample '* S2|PolII|80mMsalt|contol' contains several informations that can be summarised in the figure below:

MGA naming conventions
Sample names are divided into multiple sections separated by pipes ('|') or sometimes by dash lines ('-'). Each section is devoted to store informations about one important sample variable:
  1. Cell type: the cell in wich the samples experiment was carried out. This can refer to a cell line (for example GM12878), a developmental stage (as in the example of a S2 cell in D. melanogaster) or a mutant strain (for example 'WT', for wild type cells, or 'anchor-away Abf1', for cell depleted of Abf1 TF).
  2. Target: target protein that is the focus of the sample. Examples are transcription factors (CTCF, YY1, etc.), DNA-interacting proteins ('PolII', histones, etc.), histone modifications and variants (H3K4me3, H2A.Z, etc.).
  3. Conditions: important conditions in wich the experiment was performed and that characterise one or more samples. Examples can be specific growing media or time points during a time course experiment. Note that this field does not list growing conditions that are common to all samples in the series.
  4. Additional Info: other informations that characterise the samples such as replica number
  5. Star: the star symbol ('*') at the beginning of the name indicates that this sample has unoriented features. This is often the case for samples containing peak lists (a peak in the genome is unoriented by definition) or samples derived from paired-end sequencing (the fragment defined by the two paired reads does not have a preferred orientation in the genome).
Note that the first two fields are always present in the sample name whreas the others can be missing if non relevant.

Last update October 2018