ENSEMBL, TSS collection downloaded from ENSEMBL.

Description

TSS collection provided by ENSEMBL.

Source

Samples

From D. melanogaster (Apr 2006 BDGP R5/dm3).

Genome Annotation:

Filename Description Feature GEO-ID
1 Dm_ENSEMBL70.sga TSS from ENSEMBL70 TSS -
2 Dm_ENSEMBL66.sga TSS from ENSEMBL66 TSS -
3 Dm_ENSEMBL64.sga TSS from ENSEMBL64 TSS -
4 Dm_ENSEMBL86.sga TSS from ENSEMBL86 TSS -

Technical Notes

The following attributes have been selected:
  1. Ensembl Transcript ID
  2. Chromosome Name
  3. Strand
  4. Transcript Start (bp)
  5. Transcript End (bp)
  6. Gene Start (bp)
  7. Gene End (bp)
  8. Status (transcript)
  9. Status (gene)
  10. Associated Gene Name
Then, transcrips have been filtered according to the following rules:
  1. Transcript length > 0 [Transcript Start different from Transcript End]
  2. Transcript lies on full chromosomes
  3. Gene must have a 5' UTR [Transcript Start different from Gene Start]
  4. Genes must be annotated [Associated Gene Name present]
  5. Gene and transcripts status known
This can be archived using the following awk command: awk -F \\t '
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "1" && $4 != $5 && $4 != $6 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print "chr" $2 "\tTSS\t" $4 "\t+\t" 1 "\t" $10}
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "-1" && $4 != $5 && $5 != $7 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print $2 "\tTSS\t" $5 "\t-\t" 1 "\t" $10}
' biomart_output.txt | sort -s -k1,1 -k3,3n -k4,4 | compact_sga.pl > ENSEMBL.sga

References

Last update: 1 Oct 2018