CCG

Frequently asked questions: The SNP2TFBS Web Site

Topics


Return to FAQ Table of Contents

General Questions

How to cite SNP2TFBS

To properly cite the SNP2TFBS project, please use the following citation:

  • Kumar S., Ambrosini G., Bucher P., SNP2TFBS - a database of regulatory SNPs affecting predicted transcription factor binding site affinity, NUcleic Acids Res 2017 Jan4. PMC5210548

SNPSelect

Does a positive score imply a larger PWM score in the reference allele?

The scorediff column reports the difference in PWM scores between alternate and reference allele. Therefore, a positive score implies a larger PWM score in the alternate allele.

Are score differences comparable across PWMs?

No, large score differences for a PWM do not necessarily imply large effects on TF binding. One could use ChIP-seq data to compare score differences and effects of binding site occupancy, but this is not trivial. Currently, we haven't implemented an easy way to link score differences with effects on real binding.

How is the TF enrichment calculation done? And what does it represent?

The TF enrichment values are calculated as the ratio of the observed SNP hits (column #3 of the TF enrichment statistics output file) over the expected ones (for each TF).

or each TF, the expected number of SNP hits is calculated as the product of the probability p of observing a SNP hit for that particular TF and the number of selected matching variants from your initial SNP set.

The p probability is, in turn, computed as the ratio of the total number of SNPs affecting genome-wide the given TF according to the SNP2TFBS database (column #2 of the TF enrichment statistics file) over the total number of SNPs overlapping TFBSs according to SNP2TFBS (about 3.5 million SNPs).

As an example, we take Example1 on the Web site (SNPSelect) that uses the diabetes-related set of variants from the NHGRI-EBI catalog.

One can see that out of the initial set of 817 variants, 188 variants match our database. These are the selected matching variants.

The TF-enrichment plot shows the SRF TF as the top-ranked factor with an enrichment value of about 8.9. If one clicks n the TF Enrichment statistics link at the bottom of the output page, you see that, for SRF, the number of observed SNP hits is 5 out of 10323 SNPs genome-wide (according to SNP2TFBS). The probability of getting a hit by chance is given by:

  • p = 10323/3500000 ~ 0.003
where 3'500'000 is the total number of SNPs in our database.

The expected number of SNP hits for SRF is therefore p*188=0.564, and the enrichment value is given by the following calculation:

  • TF-enrich(SRF) = 5/0.564 = 8.9

Why is the score difference between reported SNPs sometines zero?

The scorediff column reports the difference in PWM scores between alternate and reference allele. A positive score implies a larger PWM score in the alternate allele.

The scoring system based on PWMs (integer log-odds position weight matrices) has, of course, its own limitations.

First of all, given that we use PWM raw binding scores, which are computed as the sum of the position-specific weights over all bases of the binding site, these numbers have no absolute meaning and are not comparable across different PWMs (or TFs).

Secondly, predicted TF binding sites match the PWM with a p-value threshold of 10-5. The P-value of PWM raw score x is defined as the probability that a random k-mer sequence has a binding score ≥x given the base composition of the human genome. This is a quite stringent cut-off, and depending on the base composition and length of the PWM model, it may result in a very restricted raw score space. As a matter of fact, for several TFs, such as BRCA1, ARID2A and CREB1, and many others, above the 10-5 cut-off all raw scores have the same value. If you have a look at the cumulative PWM score distribution function, you will see that the explored score space is actually the flat right-end tail of the curve. We are thinking of lowering the low cut-off threshold or adding new PWM score-based filtering options on the Web interface so to have a more suitable scoring system.

With regard more specifically to this question, when comparing the score difference alt-ref for each overlapping variant, if the PWM match is missing in one of the two genomes (ref and alt), we use the raw low score threshold to score the missing hit. It might so happen that for those PWMs that assign the same raw score to all matches above the p-value threshold of 10-5, we get a score difference of 0 for variants that appear to disrupt or create a new binding site. For this reason, in such cases, we indicate the score difference as '0' if the SNP disrupts an existing binding site or '>0' if it creates a new binding site. This is not very intuitive and should be changed as well.

Last update July 2017