Return to FAQ Table of Contents
General QuestionsHow to cite SNP2TFBSTo properly cite the SNP2TFBS project, please use the following citation:

SNPSelectDoes a positive score imply a larger PWM score in the reference allele?The scorediff column reports the difference in PWM scores between alternate and reference allele. Therefore, a positive score implies a larger PWM score in the alternate allele. Are score differences comparable across PWMs?No, large score differences for a PWM do not necessarily imply large effects on TF binding. One could use ChIPseq data to compare score differences and effects of binding site occupancy, but this is not trivial. Currently, we haven't implemented an easy way to link score differences with effects on real binding. How is the TF enrichment calculation done? And what does it represent?The TF enrichment values are calculated as the ratio of the observed SNP hits over the expected ones (for each TF). For each TF, the enrichment value is computed on the bases of the following numbers:
As a demonstration, we consider Example1 on the Web entry page of SNPSelect. In this example, we use the diabetesrelated set of variants from the NHGRIEBI catalog. SNPSelect extracts 188 variants matching SNP2TFBS (Number 3) out of the initial set of 817 variants. The TFenrichment plot shows the SRF TF as the topranked factor with an enrichment value of about 8.9. By clicking the TF Enrichment statistics link at the bottom of the output page, the results file from SNPSelect is displayed. For the SRF TF, the number of observed SNP hits is 5 (3rd field) out of 10323 SNPs genomewide (second field). The probability p of getting a hit by chance (success) is given by:
The expected number of SNP hits for SRF is therefore p*188=0.564, and the enrichment value is obtained by the following calculation:
How do we compute the Pvalue for TF enrichment?To compute the pvalues, we use a binomialbased test and different color codes to display different levels of significance. We use the R function pbinom, the cumulative distribution function for the binomial distribution with parameters size and prob. It is conventionally interpreted as the number of successes in size trials:
So, taking the example above, if the number of observed SNP hits is 5, the selected variants (size) are 188, and the probability of getting a hit by chance is prob=0.003 as shown before, the pvalue is computed as follows:
No correction for multiple testing is applied.
Why is the score difference between reported SNPs sometines zero?The scorediff column reports the difference in PWM scores between alternate and reference allele. A positive score implies a larger PWM score in the alternate allele.The scoring system based on PWMs (integer logodds position weight matrices) has, of course, its own limitations. First of all, given that we use PWM raw binding scores, which are computed as the sum of the positionspecific weights over all bases of the binding site, these numbers have no absolute meaning and are not comparable across different PWMs (or TFs). Secondly, predicted TF binding sites match the PWM with a pvalue threshold of 105. The Pvalue of PWM raw score x is defined as the probability that a random kmer sequence has a binding score ≥x given the base composition of the human genome. This is a quite stringent cutoff, and depending on the base composition and length of the PWM model, it may result in a very restricted raw score space. As a matter of fact, for several TFs, such as BRCA1, ARID2A and CREB1, and many others, above the 105 cutoff all raw scores have the same value. If you have a look at the cumulative PWM score distribution function, you will see that the explored score space is actually the flat rightend tail of the curve. We are thinking of lowering the low cutoff threshold or adding new PWM scorebased filtering options on the Web interface so to have a more suitable scoring system. With regard more specifically to this question, when comparing the score difference altref for each overlapping variant, if the PWM match is missing in one of the two genomes (ref and alt), we use the raw low score threshold to score the missing hit. It might so happen that for those PWMs that assign the same raw score to all matches above the pvalue threshold of 105, we get a score difference of 0 for variants that appear to disrupt or create a new binding site. For this reason, in such cases, we indicate the score difference as '0' if the SNP disrupts an existing binding site or '>0' if it creates a new binding site. This is not very intuitive and should be changed as well. 