Thu, 04/17/2014 - 11:57

Thanks for the questions. To answer them:
1. We used VQSR slightly differently than it is normally used, to avoid the problem of FN calls you point out. We'll have a pre-print of our manuscript on ArXiv describing our methods soon (likely tomorrow), but briefly, I used the VQSR tranche as an indication of bias when arbitrating between datasets that disagree with each other. For example, if some datasets had strand bias or lots of clipped reads, then I would use the genotype from the other datasets. If almost all the datasets had evidence of bias, then the site is excluded from the bed file. This improves upon a simple voting scheme, in which the datasets with more evidence of bias might win a vote, and it also allows us to make arbitrated homozygous reference calls. There are likely some true variants outside the bed file, but the bed file allows you to compare only to confident variant and reference calls.

2. We don't count different callers on the same dataset as multiple votes - we combine calls before arbitrating between datasets. However, calls made in 3+ Illumina datasets without evidence of bias are in the dataset, as long as more than one of the other datasets doesn't disagree. The reasoning behind this is that Illumina-specific errors are unlikely to make it through this arbitration. However, if you want to be even more conservative, we have included an annotation "platforms", so that calls with platforms>1 have support from at least 2 different platforms. Our bed file is pretty conservative at this point, though most of the bases excluded are due to structural variants in dbVar, segmental duplications, and low mapping quality/coverage regions, rather than due to our arbitration process (more details about this in our manuscript). These regions should at least be questioned with short-read technologies, even if the calls appear to be good.

3. Some datasets used the decoy from the 1000 Genomes, and others didn't. Most of the regions with decoy sequence are excluded in segmental duplications or CNVs in dbVar, but I haven't systematically analyzed this yet. The decoy sequences are somewhat ad hoc, so it would be nice to do a systematic analysis of these.