Dear Genome in a Bottle Bioinformatics Working Group members,
We are working on collecting all of the data that is available for NA12878 in one place at NCBI so that consortium members can start experimenting with methods to process and integrate the data. Steve Sherry at NCBI is also looking into the possibility of having the data hosted at AWS, and there are a couple of options for doing this, so we would like to gauge interest in using AWS to process consortium data. NCBI is releasing a new version of the SRA toolkit next month that supports AWS access directly to the SRA copy of the data, and is testing performance right now. So we can have two access modes from AWS: (1) compute on the read-only version stored at NCBI, and (2) create a separate AWS copy of the data. 20TB in AWS is not that much (around $18K for the year) if we want to use this as a useful test for computing in AWS. But – we’d want to be sure it’s actually going to be used. Therefore, NCBI would like to know if we could get a group of researchers lined up that commit to working on the data in AWS (option 2)? This could be in addition to using the data with remote I/O (option 1). Please leave a comment on this blog or email Justin Zook at NIST if you are interested in either possibly using AWS or committing to using AWS for processing consortium data. We are starting with existing data, but we hope to add new data collected for NA12878 and the PGP trios selected by the consortium.
Thanks!
Justin

Comments
Yes to AWS data access
I will definitely use the AWS data (raw reads) to run our pipelines on them. I am not sure that the SRA toolkit access is suitable due to i/o (will need more details). We will need to copy the data anyway locally to AWS EBS volumes to use it. In fact what would be more useful, rather than have these data files in S3, it would be best to have a snapshot of Linux formatted EBS volumes with the data, or in fact a snapshot of the NFS mountable RAID that any fuser can instantiate and attach to a cluster (e.g. instantiated with STartCluster) to run analysis from them.
I imagine other tools to be used in the project that need to run from the raw data could benefit from running in AWS (e.g. Cortex, LobSTR) because downloading data and moving it around is a major obstacle. Long term, as reference data sets are established and they are used to validate new analytic pipelines, having access to this in an environment where the new tools can run for testing purposes would be very valuable. Raw read data may be less used than aligned BAMs or later on VCFs, but I will suggest you monitor usage and depending on this decide when removing these files from there.
Francisco M. De La Vega, D.Sc.
VP Genome Science
Real Time Genomics, Inc.
Interested in AWS.
I think making the consortium data (raw reads, bam files and consensus calls) available on AWS will be very useful for us. We have a custom NGS analysis pipeline and we have used it on the thousand genome data on AWS. Having access to validated reference data that we can use to test our tools will be very valuable. I agree that downloading data locally for processing is a huge obstacle.
Niru Chennagiri