ENSEMBL61, 5\'end collection.

Description

Interspliced Sites of ENSEMBL61 database downloaded from Biomart.

Source

Samples

From C. elegans (May 2008 WS190/ce6).

Genome Annotation:

Filename Description Feature GEO-ID
1 Cen_ENSEMBL61.sga 5p-end from ENSEMBL61 5END -

Technical Notes

Data was downloaded from BioMart selecting the following attributes:

  1. Ensembl Transcript ID
  2. Chromosome Name
  3. Strand
  4. Transcript Start (bp)
  5. Transcript End (bp)
  6. Gene Start (bp)
  7. Gene End (bp)
  8. Status (transcript)
  9. Status (gene)
  10. Associated Gene Name

Then, transcrips have been filtered according to the following rules:

  1. Transcript length > 0 [Transcript Start different from Transcript End]
  2. Transcript lies on full chromosomes
  3. Gene must have a 5' UTR [Transcript Start different from Gene Start]
  4. Genes must be annotated [Associated Gene Name present]
  5. Gene and transcripts status known

This can be archived using the following awk command:

awk -F \\t '
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "1" && $4 != $5 && $4 != $6 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print "chr" $2 "\tTSS\t" $4 "\t+\t" 1 "\t" $10}
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "-1" && $4 != $5 && $5 != $7 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print $2 "\tTSS\t" $5 "\t-\t" 1 "\t" $10}
' biomart_output.txt | sort -s -k1,1 -k3,3n -k4,4 | compact_sga.pl > ENSEMBL.sga

The SGA file can than be transformed into an FPS file using sga2fps.pl

References

Last update: 1 Oct 2018