SIB Swiss Institute of Bioinformatics

MGA

See all news
2019-04-24	New data sets for H. sapiens, and G. gallus. [showhide] The following new data sets have been added to the MGA database: H. sapiens Stelloo 2018, ChIP-seq against AR, H3K27ac, H3K4me3 and H3K27me3 in prostate cancer primary tumors EPDnewNC, EPDnew for non-coding (NC) RNAs HGNC, TSS collection G. gallus Schmidt 2010, Mapping of HNF4a and CEBPA in liver

The MGA Data Repository

The Mass Genome Annotation (MGA) Data Repository stores published next generation sequencing data and other genome annotation data (such as gene start sites, SNPs, etc.) that, in conjunction with the ChIP-Seq and SSA servers, can be accessed and studied by scientists. The main characteristic of the MGA database is to store mapped data (in the form of genomic coordinates of mapped reads) and not sequence files. In this way, each sample present in the database has been pre-processed (for example sequence reads has been mapped to a genome) and presented in a standardized text format named SGA (Simple Genome Annotation).

How to cite:
R. Dreos, G. Ambrosini, R. Groux, R. Cavin Perier, P. Bucher; MGA repository: a curated data resource for ChIP-seq and other genome annotated data, Nucleic Acids Research, gkx995, https://doi.org/10.1093/nar/gkx995

Access to the database

Access to the database can be done in various ways:

Searching for keywords in the MGA-Search page. Links to documentation, relevant publication and analysis tools help in the study and interpretation of published data.
Via the MGA Data Overview page browsing through all series and samples.
Via the FTP site for data download in SGA format.
Through menus in all input pages of the ChIP-Seq and SSA servers.

Data export and format conversion

The native file format at the back end of the repository is SGA and can be accessed via the FTP server. Users interested in using MGA data with other tools that do not support SGA format can easily convert SGA formatted data to BED by:

Using the on-line tool ChIP-Convert.
Through the Sample Hub page of the MGA-search result page.
Using sga2bed tool (a C script) from the ChIP-Seq toolkit for bulk file conversions.

Technical information about SGA file format and conversion rules can be found here.

Database content

The MGA repository contains the following number of samples (stratified by organism and data type):

Data Type	Human	Mouse	Rat	Rhesus Macaque	Dog	Chicken	Zebra fish	Bee	Fruit Fly	Water Flea	Worm	Baker's Yeast	Fission Yeast	Arabidopsis	Corn	Malaria Parasite	Total
ChIP-seq	8248	758	4	5	11	14	34	-	514	18	198	527	405	212	12	52	11012
ChIP-seq-invitro	-	-	-	-	-	-	-	-	-	-	-	-	931	-	-	-	931
ChIP-seq-peak	8206	28	-	-	-	-	-	-	-	-	-	-	-	-	-	-	8234
Transcription Profiling	2431	1352	13	15	12	33	12	16	371	11	19	22	16	13	8	13	4357
DNase FAIRE etc.	1434	42	-	-	-	-	4	-	68	-	6	58	8	9	3	12	1644
DNA methylation	24	4	-	-	-	-	-	-	-	-	-	-	-	-	-	-	28
Genome annotation	32	23	2	2	2	15	6	2	16	4	18	4	5	5	3	3	179
Sequence-derived	3617	2315	-	-	-	1	14	9	1240	-	9	9	9	1531	9	-	8764
Total # of Samples	27051	4535	19	22	25	63	70	27	2209	33	250	620	443	2701	35	15	38185

Data types are the following:

ChIP-seq: raw data (reads mapping coordinates) from classical ChIP-seq experiments targeting transcription factors, protein-DNA interaction, histone variants and modifications, etc.
ChIP-seq-invitro: raw data (reads mapping coordinates) from in-vitro ChIP-seq experiments such ad DAP-seq.
ChIP-seq-peak: peak regions provided by the authors of the data
Transcript Profiling: raw data from experiments aimed at profiling transcripts initiation such as CAGE, GRO-cap, GRO-seq, PEAT, etc.
DNase FAIRE etc.: raw data from chromatin and chromatin accessibility studies such as MNase-seq, DNase-seq, DNase-hypersensitivity, etc.
DNA methylation: raw data from methylation studies.
Genome Annotation: transcription start sites, transcription end sites, intron-exon boundaries
Sequence derived: PWM matches, Natural Variants, Conservation scores, etc.

The list of series present in the database can be found in the MGA Data Overview page.

Sample name conventions

Samples names in MGA contain useful information about the samples' biological and technical variables. For example, the sample '* S2|PolII|80mMsalt|control' contains some information that can be summarised in the figure below:

Sample names are divided into multiple sections separated by pipes ('|') or sometimes by dash lines ('-'). Each section is devoted to store information about one important sample variable:

Cell type: the cell in which the sample experiment was carried out. This can refer to a cell line (for example GM12878), a developmental stage (as in the example of a S2 cell in D. melanogaster) or a mutant strain (for example 'WT', for wild type cells, or 'anchor-away Abf1', for cell depleted of Abf1 TF).
Target: target protein that is the focus of the sample. Examples are transcription factors (CTCF, YY1, etc.), DNA-interacting proteins ('PolII', histones, etc.), histone modifications and variants (H3K4me3, H2A.Z, etc.).
Conditions: important conditions in which the experiment was performed and that characterise one or more samples. Examples can be specific growing media or time points during a time course experiment. Note that this field does not list growing conditions that are common to all samples in the series.
Additional Info: other information that characterise the samples such as replica number
Star: the star symbol ('*') at the beginning of the name indicates that this sample has unoriented features. This is often the case for samples containing peak lists (a peak in the genome is unoriented by definition) or samples derived from paired-end sequencing (the fragment defined by the two paired reads does not have a preferred orientation in the genome).

Note that the first two fields are always present in the sample name, whereas the others can be missing if non-relevant.

Last update September 2021

SIB Swiss Institute of Bioinformatics | Computational Cancer Genomics | Privacy Notice |

Back to the Top

The MGA Data Repository

Access to the database

Data export and format conversion

Database content

Data Type

Human

Mouse

Rat

Rhesus Macaque

Dog

Chicken

Zebra fish

Bee

Fruit Fly

Water Flea

Worm

Baker's Yeast

Fission Yeast

Arabidopsis

Corn

Malaria Parasite

Total

ChIP-seq

8248

758

4

5

11

14

34

-

514

18

198

527

405

212

12

52

11012

ChIP-seq-invitro

-

-

-

-

-

-

-

-

-

-

-

-

931

-

-

-

931

ChIP-seq-peak

8206

28

-

-

-

-

-

-

-

-

-

-

-

-

-

-

8234

Transcription Profiling

2431

1352

13