The MGA Data Repository

The Mass Genome Annotation (MGA) Data Repository stores published next generation sequencing data and other genome annotation data (such as gene start sites, SNPs, etc.) that, in conjunction with the ChIP-Seq and SSA servers, can be accessed and studied by scientists. The main characteristic of the MGA database is to store mapped data (in the form of genomic coordinates of mapped reads) and not sequence files. In this way, each sample present in the database has been pre-processed (for example sequence reads has been mapped to a genome) and presented in a standardized text format named SGA (Simple Genome Annotation).

How to cite:
R. Dreos, G. Ambrosini, R. Groux, R. Cavin Perier, P. Bucher; MGA repository: a curated data resource for ChIP-seq and other genome annotated data, Nucleic Acids Research, gkx995,

Access to the database

Access to the database can be done in various ways:

  • Searching for keywords in the MGA-Search page. Links to documentation, relevant publication and analysis tools help in the study and interpretation of published data.
  • Via the MGA Data Overview page browsing through all series and samples.
  • Via the FTP site for data download in SGA format.
  • Through menus in all input pages of the ChIP-Seq and SSA servers.

Data export and format conversion

The native file format at the back end of the repository is SGA and can be accessed via the FTP server. Users interested in using MGA data with other tools that do not support SGA format can easly convert SGA formatted data to BED by:

Technical informations about SGA file format and conversion rules can be found here.

Database content

The MGA repository contains the following numebr of samples (stratified by organism and data type):

Data Type Human Mouse Fruit Fly Worm Zebra fish Baker's Yeast Arabidopsis Fission Yeast Corn Bee Total
ChIP-seq 6498 621 485 198 16 395 212 349 12 - 8786
ChIP-seq-invitro - - - - - - 931 - - - 931
ChIP-seq-peak 1925 28 - - - - - - - - 1953
Transcription Profiling 2203 385 347 19 12 22 6 1 8 16 3019
DNase FAIRE etc. 1047 18 46 6 4 56 9 8 3 - 1197
DNA methylation 24 4 - - - - - - - - 28
Genome annotation 27 11 16 18 6 4 3 4 3 2 94
Sequence-derived 3617 2315 1240 9 14 9 1531 9 9 9 8762
Total # of Samples 15341 3382 2134 250 52 486 2692 371 35 27 24771

Data types are the following:

  • ChIP-seq: raw data (reads mapping coordinates) from classical ChIP-seq experiments targeting transcription factors, protein-DNA intraction, histone variants and modifications, etc.
  • ChIP-seq-invitro: raw data (reads mapping coordinates) from in-vitro ChIP-seq experiments such ad DAP-seq.
  • ChIP-seq-peak: peak regions provided by the authors of the data
  • Transcript Profiling: raw data from experiments aimed at profiling transcripts initiation such as CAGE, GRO-cap, GRO-seq, PEAT, etc.
  • DNase FAIRE etc.: raw data from chromatin and chromatin accessibility studies such as MNase-seq, DNase-seq, DNase-hypersensitivity, etc.
  • DNA methylation: raw data from methylation studies.
  • Genome Annotation: transcription start sites, transcription end sites, intron-exon boundaries
  • Sequence derived: PWM matches, Natural Variants, Conservation scores, etc.

The list of series present in the database can be found in the MGA Data Overview page.

Sample name conventions

Samples names in MGA contain useful informations about the samples' biological and technical variables. For example, the sample '* S2|PolII|80mMsalt|contol' contains several informations that can be summarised in the figure below:

MGA naming conventions
Sample names are divided into multiple sections separated by pipes ('|') or sometimes by dash lines ('-'). Each section is devoted to store informations about one important sample variable:
  1. Cell type: the cell in wich the samples experiment was carried out. This can refer to a cell line (for example GM12878), a developmental stage (as in the example of a S2 cell in D. melanogaster) or a mutant strain (for example 'WT', for wild type cells, or 'anchor-away Abf1', for cell depleted of Abf1 TF).
  2. Target: target protein that is the focus of the sample. Examples are transcription factors (CTCF, YY1, etc.), DNA-interacting proteins ('PolII', histones, etc.), histone modifications and variants (H3K4me3, H2A.Z, etc.).
  3. Conditions: important conditions in wich the experiment was performed and that characterise one or more samples. Examples can be specific growing media or time points during a time course experiment. Note that this field does not list growing conditions that are common to all samples in the series.
  4. Additional Info: other informations that characterise the samples such as replica number
  5. Star: the star symbol ('*') at the beginning of the name indicates that this sample has unoriented features. This is often the case for samples containing peak lists (a peak in the genome is unoriented by definition) or samples derived from paired-end sequencing (the fragment defined by the two paired reads does not have a preferred orientation in the genome).
Note that the first two fields are always present in the sample name whreas the others can be missing if non relevant.

Last update October 2017