Return to FAQ Table of Contents
General QuestionsHow to cite SNP2TFBSTo properly cite the SNP2TFBS project, please use the following citation:

SNPSelectDoes a positive score imply a larger PWM score in the reference allele?The scorediff column reports the difference in PWM scores between alternate and reference allele. Therefore, a positive score implies a larger PWM score in the alternate allele. Are score differences comparable across PWMs?No, large score differences for a PWM do not necessarily imply large effects on TF binding. One could use ChIPseq data to compare score differences and effects of binding site occupancy, but this is not trivial. Currently, we haven't implemented an easy way to link score differences with effects on real binding. How is the TF enrichment calculation done? And what does it represent?The TF enrichment values are calculated as the ratio of the observed SNP hits (column #3 of the TF enrichment statistics output file) over the expected ones (for each TF). or each TF, the expected number of SNP hits is calculated as the product of the probability p of observing a SNP hit for that particular TF and the number of selected matching variants from your initial SNP set. The p probability is, in turn, computed as the ratio of the total number of SNPs affecting genomewide the given TF according to the SNP2TFBS database (column #2 of the TF enrichment statistics file) over the total number of SNPs overlapping TFBSs according to SNP2TFBS (about 3.5 million SNPs). As an example, we take Example1 on the Web site (SNPSelect) that uses the diabetesrelated set of variants from the NHGRIEBI catalog. One can see that out of the initial set of 817 variants, 188 variants match our database. These are the selected matching variants. The TFenrichment plot shows the SRF TF as the topranked factor with an enrichment value of about 8.9. If one clicks n the TF Enrichment statistics link at the bottom of the output page, you see that, for SRF, the number of observed SNP hits is 5 out of 10323 SNPs genomewide (according to SNP2TFBS). The probability of getting a hit by chance is given by:
The expected number of SNP hits for SRF is therefore p*188=0.564, and the enrichment value is given by the following calculation:
Why is the score difference between reported SNPs sometines zero?The scorediff column reports the difference in PWM scores between alternate and reference allele. A positive score implies a larger PWM score in the alternate allele.The scoring system based on PWMs (integer logodds position weight matrices) has, of course, its own limitations. First of all, given that we use PWM raw binding scores, which are computed as the sum of the positionspecific weights over all bases of the binding site, these numbers have no absolute meaning and are not comparable across different PWMs (or TFs). Secondly, predicted TF binding sites match the PWM with a pvalue threshold of 105. The Pvalue of PWM raw score x is defined as the probability that a random kmer sequence has a binding score ≥x given the base composition of the human genome. This is a quite stringent cutoff, and depending on the base composition and length of the PWM model, it may result in a very restricted raw score space. As a matter of fact, for several TFs, such as BRCA1, ARID2A and CREB1, and many others, above the 105 cutoff all raw scores have the same value. If you have a look at the cumulative PWM score distribution function, you will see that the explored score space is actually the flat rightend tail of the curve. We are thinking of lowering the low cutoff threshold or adding new PWM scorebased filtering options on the Web interface so to have a more suitable scoring system. With regard more specifically to this question, when comparing the score difference altref for each overlapping variant, if the PWM match is missing in one of the two genomes (ref and alt), we use the raw low score threshold to score the missing hit. It might so happen that for those PWMs that assign the same raw score to all matches above the pvalue threshold of 105, we get a score difference of 0 for variants that appear to disrupt or create a new binding site. For this reason, in such cases, we indicate the score difference as '0' if the SNP disrupts an existing binding site or '>0' if it creates a new binding site. This is not very intuitive and should be changed as well. 