Post-ASHG Update on Genome in a Bottle

Consortium members had some discussions at the recent ASHG meeting, and we wanted to update everyone on these discussions and our plans moving forward so that anyone can comment on them. NIST has started the process to gain Institutional Review Board (IRB) approval for use of the NA12878 pedigree and Personal Genome Project (PGP) samples as NIST Reference Materials. Because of recent research developing methods to identify individuals through their DNA sequence, NIST is also looking into the possibility of more explicitly reconsenting the living children of NA12878 for possible reidentification and commercial use. Coriell has started the process of growing a large batch of NA12878, which will be used to understand heterogeneity between cell expansions and is a prospective NIST Reference Material, pending IRB approval.

PGP has started to collect blood for cell line development for all trios that are current participants, which will be listed soon in a separate blog post. PGP has also identified a Caucasian trio, for which the parents have cell lines in QC at Coriell and the son (huAA53E0) is in process for cell line development. Blood from this trio would be available for sequencing now, but our current plan is to wait to sequence when cell lines are established unless any consortium members would like to begin sequencing them now. Until then, the bioinformatics working group can do methods development using the large amount of existing public data for NA12878. Chris Mason has performed Nextera mate-pair sequencing on NA12878 with Illumina, and is working with PacBio and others to prepare 20kb fragment libraries and perform PacBio sequencing, which will be shared with the consortium. We are also talking with Complete Genomics about using their preliminary Long Fragment Read sequencing of NA12878, which has some interesting attributes in terms of phasing and error rates. Illumina has performed paired-end and mate-pair sequencing on the entire NA12878 pedigree and is in the process of uploading these data to EBI/NCBI. OpGen and BioNano Genomics have also started to use their optical mapping platforms to analyze the NA12878 pedigree, which should be useful for de novo assembly and large-scale SV detection.

NIST and NCBI have been collecting publicly available data sets for NA12878, and we will post this list on the consortium website so that others can add other data sets we’ve missed or new data sets that have been released. Consortium members are encouraged to start exploring methods to analyze these data. NIST has also been working on forming consensus SNP calls by integrating 9 whole genome and 2 exome datasets for NA12878, and has submitted a manuscript based on this work. If any consortium members are interested in comparing results to help improve and develop bioinformatics methods for integrating data, contact Justin Zook.

We also discussed possible bioinformatics problems for members of the consortium to solve:
- How should we represent homozygous reference genotype calls (e.g., gVCF)?
- What should be done with the unmapped reads?
- How should we identify and characterize the “difficult” regions of the genome? How much effort should be placed here, given rapidly improving technologies? Should we focus on characterizing some examples of each type of difficult region that could be used for performance assessment?