ENSEMBL67, ORF starts.

Description

Transcription Start Sites of ENSEMBL database downloaded from Biomart.

Source

Samples

From S. cerevisiae (Apr 2011 R64/sacCer3).

Genome Annotation:

Filename Description Feature GEO-ID
1 sacCer3_ENSEMBL67.sga ORF start from ENSEMBL67 TSS -

Technical Notes

The following attributes have been selected:

  1. Ensembl Transcript ID
  2. Chromosome Name
  3. Strand
  4. Transcript Start (bp)
  5. Transcript End (bp)
  6. Gene Start (bp)
  7. Gene End (bp)
  8. Status (transcript)
  9. Status (gene)
  10. Associated Gene Name

Then, transcrips have been filtered according to the following rules:

  1. Transcript length > 0 [Transcript Start different from Transcript End]
  2. Transcript lies on full chromosomes

This can be archived using the following awk command:

awk -F \\t '
$3 == "1" && $4 != $5 {print "chr" $2 "\tTSS\t" $4 "\t+\t" 1 "\t" $10}
$3 == "-1" && $4 != $5 {print $2 "\tTSS\t" $5 "\t-\t" 1 "\t" $10}
' biomart_output.txt | sort -s -k1,1 -k3,3n -k4,4 | compact_sga.pl > ENSEMBL.sga

The SGA file can than be transformed into an FPS file using sga2fps.pl

References

Last update: 1 Oct 2018