Description

Transcription Start Sites of ENSEMBL69 database downloaded from Biomart.

Overview of samples:

NameFeature Feature description
Hum_ENSEMBL69.sgaTSSTSS from ENSEMBL69

Technical Notes

Data was downloaded from BioMart selecting the following attributes:

  1. Ensembl Transcript ID
  2. Chromosome Name
  3. Strand
  4. Transcript Start (bp)
  5. Transcript End (bp)
  6. Gene Start (bp)
  7. Gene End (bp)
  8. Status (transcript)
  9. Status (gene)
  10. Associated Gene Name
Then, transcrips have been filtered according to the following rules:
  1. Transcript length > 0 [Transcript Start different from Transcript End]
  2. Transcript lies on full chromosomes
  3. Gene must have a 5' UTR [Transcript Start different from Gene Start]
  4. Genes must be annotated [Associated Gene Name present]
  5. Gene and transcripts status known
This can be archived using the following awk command:

awk -F \\t '
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "1" && $4 != $5 && $4 != $6 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print "chr" $2 "\tTSS\t" $4 "\t+\t" 1 "\t" $10}
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "-1" && $4 != $5 && $5 != $7 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print $2 "\tTSS\t" $5 "\t-\t" 1 "\t" $10}
' biomart_output.txt | sort -s -k1,1 -k3,3n -k4,4 | compact_sga.pl > ENSEMBL.sga

The SGA file can than be transformed into an FPS file using sga2fps.pl

References

Haider S, Ballester B, Smedley D, Zhang J, Rice P, Kasprzyk A.
BioMart Central Portal--unified access to biological data. Nucleic Acids Res. 37:W23-7. PMID: 19420058

Genome browser viewable files