We found it quite helpful to have the NIST high confidence call set as an outside standard in our recent exome analysis. http://www.personalis.com/assets/files/posters/ashg2013/Towards_a_medica...
it is a very good news that the PGP has started to collect blood for cell line development for all trios that are current participatory . Please add some more details about it in the coming blog post. we are really carrying out an amazing job here. al the support.regards
Developing methods to integrate multiple data-sets and bed file for is very helpful in gaining sets of genotypes for its references. It will be very easy for them to track places and determine its descriptions. Easy work is awaitingif this will be fully developed.
Putting data into public makes everybody stop suspecting someone. It showcase a clear visibility with regards to the data process because many can monitor on whats going on and it will never undergone illegal process.
Thanks for the questions. To answer them:
1. We used VQSR slightly differently than it is normally used, to avoid the problem of FN calls you point out. We'll have a pre-print of our manuscript on ArXiv describing our methods soon (likely tomorrow), but briefly, I used the VQSR tranche as an indication of bias when arbitrating between datasets that disagree with each other. For example, if some datasets had strand bias or lots of clipped reads, then I would use the genotype from the other datasets. If almost all the datasets had evidence of bias, then the site is excluded from the bed file. This improves upon a simple voting scheme, in which the datasets with more evidence of bias might win a vote, and it also allows us to make arbitrated homozygous reference calls. There are likely some true variants outside the bed file, but the bed file allows you to compare only to confident variant and reference calls.
2. We don't count different callers on the same dataset as multiple votes - we combine calls before arbitrating between datasets. However, calls made in 3+ Illumina datasets without evidence of bias are in the dataset, as long as more than one of the other datasets doesn't disagree. The reasoning behind this is that Illumina-specific errors are unlikely to make it through this arbitration. However, if you want to be even more conservative, we have included an annotation "platforms", so that calls with platforms>1 have support from at least 2 different platforms. Our bed file is pretty conservative at this point, though most of the bases excluded are due to structural variants in dbVar, segmental duplications, and low mapping quality/coverage regions, rather than due to our arbitration process (more details about this in our manuscript). These regions should at least be questioned with short-read technologies, even if the calls appear to be good.
3. Some datasets used the decoy from the 1000 Genomes, and others didn't. Most of the regions with decoy sequence are excluded in segmental duplications or CNVs in dbVar, but I haven't systematically analyzed this yet. The decoy sequences are somewhat ad hoc, so it would be nice to do a systematic analysis of these.