Dear Genome in a Bottle Bioinformatics Working Group members,
We are working on collecting all of the data that is available for NA12878 in one place at NCBI so that consortium members can start experimenting with methods to process and integrate the data. Steve Sherry at NCBI is also looking into the possibility of having the data hosted at AWS, and there are a couple of options for doing this, so we would like to gauge interest in using AWS to process consortium data. NCBI is releasing a new version of the SRA toolkit next month that supports AWS access directly to the SRA copy of the data, and is testing performance right now. So we can have two access modes from AWS: (1) compute on the read-only version stored at NCBI, and (2) create a separate AWS copy of the data. 20TB in AWS is not that much (around $18K for the year) if we want to use this as a useful test for computing in AWS. But – we’d want to be sure it’s actually going to be used. Therefore, NCBI would like to know if we could get a group of researchers lined up that commit to working on the data in AWS (option 2)? This could be in addition to using the data with remote I/O (option 1). Please leave a comment on this blog or email Justin Zook at NIST if you are interested in either possibly using AWS or committing to using AWS for processing consortium data. We are starting with existing data, but we hope to add new data collected for NA12878 and the PGP trios selected by the consortium.