Description

Interspliced Sites of ENSEMBL61 database downloaded from Biomart.

Overview of samples:

NameFeature Feature description
Cen_ENSEMBL61.sga5END5'-end from ENSEMBL61

Technical Notes

Data was downloaded from BioMart selecting the following attributes:

  1. Ensembl Transcript ID
  2. Chromosome Name
  3. Strand
  4. Transcript Start (bp)
  5. Transcript End (bp)
  6. Gene Start (bp)
  7. Gene End (bp)
  8. Status (transcript)
  9. Status (gene)
  10. Associated Gene Name
Then, transcrips have been filtered according to the following rules:
  1. Transcript length > 0 [Transcript Start different from Transcript End]
  2. Transcript lies on full chromosomes
  3. Gene must have a 5' UTR [Transcript Start different from Gene Start]
  4. Genes must be annotated [Associated Gene Name present]
  5. Gene and transcripts status known
This can be archived using the following awk command:

awk -F \\t '
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "1" && $4 != $5 && $4 != $6 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print "chr" $2 "\tTSS\t" $4 "\t+\t" 1 "\t" $10}
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "-1" && $4 != $5 && $5 != $7 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print $2 "\tTSS\t" $5 "\t-\t" 1 "\t" $10}
' biomart_output.txt | sort -s -k1,1 -k3,3n -k4,4 | compact_sga.pl > ENSEMBL.sga

The SGA file can than be transformed into an FPS file using sga2fps.pl

References

Haider S, Ballester B, Smedley D, Zhang J, Rice P, Kasprzyk A.
BioMart Central Portal--unified access to biological data. Nucleic Acids Res. 37:W23-7. PMID: 19420058

Genome browser viewable files